Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleHandy 1 liner to determine frequencies of reads mapping to each reference. Command broken down below.
echo -e "File\tReference\tReads"; for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done

...

Again the format is not great, but can be copy pasted to Excel or similar for column alignment. and in either case this will produce a single row for each sample and single column for each reference, while at the same time, changing the * category (which from reading info about samtools you would see is a read that does not map to any reference) to a more human understood 'unmapped' classification.

You were promised a description of the complicated for loop above. 

No Format
example bash command to generate input file
for sam in *.sam;do name=$(echo $sam | sed 's/.sam//');grep -v "^@" $sam | awk -F '\t' '{print $3}' | sort | uniq -c | awk -v sam="$name" '{sum+=$1}{print sam,$2,$1}END{print sam,"total",sum}' OFS='\t';done > ref_counts.tsv


for sam in *.sam;                           # for loop looking at sam files
    do                                      # standard bash for loop formating
    name=$(echo $sam | sed 's/.sam//');     # strip .sam ending from files
    grep -v "^@" $sam |                     # remove header lines from consideration
    awk -F '\t' '{print $3}' |              # Pull the 3rd column from the sam file. 3rd column is name of chromosome read mapped to
    sort |                                  # Sort chromosome names (required for uniq to work correctly)
    uniq -c |                               # take unique chromosome names and return the count of each
    awk                                     # Start of long awk command
        -v sam="$name"                      # define awk variable "sam" using shell variable "$name" (see above) for later printing
        '{sum+=$1}                          # define variable 'sum' as running count of how many total reads were evaluated
        {print sam,$2,$1}                   # for each line (chromosome) print the sam file name, chromosome name, and number of reads mapped to that chromosome
        END{print sam,"total",sum}'         # after all lines (chromosomes) dealt with, print sam file name, the word "total", and the value of variable 'sum' (total number of reads)
        OFS='\t';                           # OutputFieldSeparator set to tab so output will be tsv
    done                                    # standard bash for loop end
> ref_counts.tsv                            # redirect output to ref_counts.tsv file

The above box is a copy paste from the header information found in the bt2-mapping-ref-count-table.py file. 

Code Block
languagebash
titleTo quickly view the entire script consider the following command
less `which bt2-mapping-ref-count-table.py`
Tip

Note 1 last trick here where the text between the ` marks (backtick NOT single quote) is evaluated fist allowing you to open the script for viewing in the less viewer. Frustratingly, once you hit the first ` mark, you lose the ability to tab complete often making it easier to type the which command with tab completion, then go back to add the ` marks.


Next steps

As noted above, the human trios tutorial deals with calling SNVs on multiple samples at once.  Consider how you could combine the two steps

...