...

Code Block

language	bash
title	Handy 1 liner to determine frequencies of reads mapping to each reference. Command broken down below.

echo -e "File\tReference\tReads"; for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done

...

Again the format is not great, but can be copy pasted to Excel or similar for column alignment. and in either case this will produce a single row for each sample and single column for each reference, while at the same time, changing the * category (which from reading info about samtools you would see is a read that does not map to any reference) to a more human understood 'unmapped' classification.

You were promised a description of the complicated for loop above.

No Format

example bash command to generate input file
for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done > ref_counts.tsv


for sam in *.sam;                           # for loop looking at sam files
    do                                      # standard bash for loop formating
    name=$(echo $sam | sed 's/.sam//');     # strip .sam ending from files
    grep -v "^@" $sam |                     # remove header lines from consideration
    awk -F '\t' '{print $3}' |              # Pull the 3rd column from the sam file. 3rd column is name of chromosome read mapped to
    sort |                                  # Sort chromosome names (required for uniq to work correctly)
    uniq -c |                               # take unique chromosome names and return the count of each
    awk                                     # Start of long awk command
        -v sam="$name"                      # define awk variable "sam" using shell variable "$name" (see above) for later printing
        '{sum+=$1}                          # define variable 'sum' as running count of how many total reads were evaluated
        {print sam,$2,$1}                   # for each line (chromosome) print the sam file name, chromosome name, and number of reads mapped to that chromosome
        END{print sam,"total",sum}'         # after all lines (chromosomes) dealt with, print sam file name, the word "total", and the value of variable 'sum' (total number of reads)
        OFS='\t';                           # OutputFieldSeparator set to tab so output will be tsv
    done                                    # standard bash for loop end
> ref_counts.tsv                            # redirect output to ref_counts.tsv file

The above box is a copy paste from the header information found in the bt2-mapping-ref-count-table.py file.

Code Block

language	bash
title	To quickly view the entire script consider the following command

less `which bt2-mapping-ref-count-table.py`

Tip
Note 1 last trick here where the text between the ` marks (backtick NOT single quote) is evaluated fist allowing you to open the script for viewing in the less viewer. Frustratingly, once you hit the first ` mark, you lose the ability to tab complete often making it easier to type the which command with tab completion, then go back to add the ` marks.

Next steps

As noted above, the human trios tutorial deals with calling SNVs on multiple samples at once. Consider how you could combine the two steps

...

Versions Compared

Old Version 2

New Version 3

Key

Next steps

Page Comparison

Versions Compared

Old Version 2

New Version 3

Key

Next steps