Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tip
titleReservations

Use our summer school reservation (CoreNGS) when submitting batch jobs to get higher priority on the ls6 normal queue.

Code Block
languagebash
titleRequest an interactive (idev) node
# Request a 180 minute interactive node on the normal queue using our reservation
idev -m 120 -N 1 -A OTH21164 -r CoreNGS
idev -m 120 -N 1 -A TRA23004 -r CoreNGS

# Request a 120 minute idev node on the development queue 
idev -m 120 -N 1 -A OTH21164 -p development
idev -m 120 -N 1 -A TRA23004 -p development
Code Block
languagebash
titleSubmit a batch job
# Using our reservation
sbatch --reseservation=CoreNGS <batch_file>.slurm

Note that the reservation name (CoreNGS) is different from the TACC allocation/project for this class, which is OTH21164.



Table of Contents

Overview

...

Suppose we want to know how many alignments included insertions or deletions (indels) versus the reference. Looking at the CIGAR field definition table above, we see that insertions are denoted with the character I and deletions with the character D. We'll use this information, along with samtools view, cut and grep, to count the number of indels.

...

The result should be 1.22283 %.In addition to the rate, converted to a percentage, we also output the literal percent sign ( % ).

Notes:

  • print 100*6697/547664 calculates the percentage rate
  • The double quotes ( " )

...

  • surround a literal string in awk

...

    • here the percent sign ( % )
  • The comma ( , ) between the number and the string says to separate the two output fields (the rate and the literal text) with whatever the default Output Field Separator (OFS) is.
    • By default,

...

    • awk'Output Field Separator is a single space
      • while it's default Input Field Separator (FS) is whitespace (any number of space and Tab characters)

...

So what if we want to get fancy and do all this in a one-liner command? We can use echo and backtick evaluation to put both numbers in a string, then pipe that string to awk. This time we use awk body code, and refer to the two whitespace-separated fields being passed in: total count ($1) and indel count ($2).

...

Tip

awk also has a printf function, which can take the standard formatting commands (see https://en.wikipedia.org/wiki/Printf_format_string#Type_field).

Code Block
languagebash
awk 'BEGIN{ printf("%.2f %%\n", 100*6697/547664) }'

Notes:

  • The printf arguments are enclosed in parentheses since it is a true awk function.
  • The 1st argument is the format string, enclosed in double quotes.
    • The %.2f format specifier says to output a floating point number with 2 digits after the decimal place.
    • The %% format specifier is used to output a single, literal percent sign ( % ).
    • Unlike the standard print statement, the printf function does not automatically append a newline to the output, so \n is added to the format string here.
  • Remaining All arguments to printf after the format string are the values to be substituted for the format specifiers (here our percentage computation).
    • There must be one argument for every item that is being formatted
      • i.e., one for every percent sign ( % ) except the %% literal.

Filtering by location range

...

Code Block
languagebash
cd $SCRATCH/core_ngs/samtools

# count the number of reads mapped to chromosome 2 (chrII)
samtools view -c -F 0x4 yeast_pe.sort.bam chrII

# count the number of reads mapped to chromosomes 1 or M (chrI, chrM)
samtools view -c -F 0x4 yeast_pe.sort.bam chrI chrM

# count the number of reads mapped to chromosomes 1 that overlap coordinates 1000-2000
samtools view -c -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# since there are only 20 reads in the chrI:1000-2000 region, examine them individually
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# look at a subset of field for chrI:1000-2000 reads
# fields:  2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9

...

Code Block
languagebash
# look at a subset of field for chrI:1000-2000 reads
#  fields: 2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9 | \
  awk 'BEGIN{FS=OFS="\t"}
       {$1 = sprintf("0x%x", $1); print}'

...

  • If a command line is continued on more than one line, the line continuation character ( \ ) is used.
  • There is a sprintf (string print formatted) function in awk, just as there is in many modern programming languages.
  • We use sprintf to re-format the 1st input field (here the cut flags) in hex
    • using the format directive (%x)
    • also adding a literal "0x" prefix to denote the numeric base is hexadecimal
  • The results of sprintf are stored back into the 1st field ($1), replacing the original value.
  • awk's print statement is then used with no other arguments to print all its input fields, including the changed $1.

...

Expand
titleAnswer 2 (using SAM flag)

Here's another way of doing it:

Code Block
languagebash
samtools view -c -F 0x4 -f 0x2 yeast_pe.sort.bam chrI:1000-2000


About mapping quality

Mapping qualities are quality is a measure of how likely a given sequence alignment to its reported location is correct. If a read's mapping quality is low (especially if it is zero, or mapQ 0 for short) the read maps to multiple locations on the genome (they are multi-hit or multi-mapping reads), and we can and we can't be sure whether the reported location is the correct one. These are called multi-mapping or multi-hit reads.

Aligners also differ in whether they report alternate alignments for multi-hit reads, and if they do, how many. Some things to keep in mind:

...

  • bwa aln (global alignment) and bowtie2 with default parameters (both --local default end-to-end mode) report at most one location for a read that maps
    • this will be the location with the best mapping quality and alignment
    • if a given read maps equally well to multiple locations, these aligners pick one location at random
      • bwa aln will always report a 0 mapping quality for these multi-hit reads
      • bowtie2 will report a low mapping quality (< 10), based on the complexity (information content) of the sequence
  • bwa mem (local alignment) can always report more than one location for a mapped read
    • its definition of a secondary alignment is different (and a bit non-standard)
      • if one part of a read maps to one location and another part maps somewhere else (e.g. because of RNA splicing), the longer alignment is marked as primary and the shorter as secondary.
    • there is no way to disable reporting of secondary alignments with bwa mem.
      • but they can be filtered from the sorted BAM with -F 0x100 (secondary alignment flag = 0).
  • bowtie2 can be configured to report more than one location for a mapped read
    • the -k N option says to report up to N alignments for a read
  • most transcriptome-aware RNA-seq aligners by default report more than one location for a mapped read
    • e.g. hisat2, star, tophat2.
    • when reads are quantified (counted with respect to genes), multiply-mapped reads can be counted fractionally
      • e.g. if a read maps to 5 genes, it can be counted as 1/5 for each of the genes

...

Expand
titleAnswer


Code Block
languagebash
titlesolution
samtools view -c yeast_pe.sort.bam
samtools view -c yeast_pe.sort.filt.bam

# or to be really fancy...
echo "`samtools view -c yeast_pe.sort.bam` \
      `samtools view -c yeast_pe.sort.filt.bam`" | \
       awk '{printf "%d original reads\n", $1;
             printf "%d Q20 reads (%d read pairs)\n", $2, $2/2;
             printf "%0.2f%% high-quality reads\n", 100*$2/$1}'

There were 1,184,360 alignment records in the original BAM, and only 456,890 in the quality-filtered BAM, around 38% of our starting reads.

Since we have only properly paired reads, the filtered BAM will contain equal numbers of both R1s and R2s. So the number of read pairs is 456890/2 or 228451.

Exercise: How many primary aligned reads (0x100 = 0) are in Copy the bwa_local.sort.dup.bam file?

...

local alignment files here:

Code Block
Expand
titleSolution
Code Block
languagebash
titlesolution
samtools view -F 0x104 -c bwa_local.sort.dup.bam

There are 874791 primary alignments.

Code Block
languagebash
titlesolution
samtools view -F 0x4 -f 0x100 -c bwa_local.sort.dup.bam
And 133209 secondary alignments.
languagebash
titleCombining SAM flags

samtools view only pays attention to one -F or -f option, so to specify more than one flag value you need to combine them into one number. For example, combining 0x100 and 0x4 yields 0x104.

Expand
titleHint

You want both the secondary (0x100) and unmapped (0x4) flags to be 0.

solution
cds; cd core_ngs/samtools
cp $CORENGS/catchup/bwa_script/bwa_local.sort.dup.bam* .
cp $CORENGS/catchup/bwa_script/bwa_local.samstats.txt  .

Exercise: How many primary aligned reads (0x100 = 0) are in the bwa_local.sort.dup.bam file?

Tip
titleCombining SAM flags

samtools view only pays attention to one -F or -f option, so to specify more than one flag value you need to combine them into one number. For example, combining 0x100 and 0x4 yields 0x104.


Expand
titleHint

You want both the secondary (0x100) and unmapped (0x4) flags to be 0.


Expand
titleSolution


Code Block
languagebash
titlesolution
samtools view -F 0x104 -c bwa_local.sort.dup.bam

There are 874791 primary alignments.

Code Block
languagebash
titlesolution
samtools view -F 0x4 -f 0x100 -c bwa_local.sort.dup.bam

And 133209 secondary alignments.

Look at the statistics for the bwa mem (local) alignment in the bwa_local.samstats.txt file:

Code Block
languagebash
titlesolution
-----------------------------------------------
             Aligner:       bwa
     Total sequences:   1317569
        Total mapped:   1008000 (76.5 %)
      Total unmapped:    309569 (23.5 %)
             Primary:    874791 (86.8 %)
           Secondary:    133209 (13.2 %)
          Duplicates:    428418 (42.5 %)
          Fwd strand:    502782 (49.9 %)
          Rev strand:    505218 (50.1 %)
          Unique hit:    947418 (94.0 %)
           Multi hit:     60582 (6.0 %)
           Soft clip:    543390 (53.9 %)
           All match:    328038 (32.5 %)
              Indels:      6679 (0.7 %)
             Spliced:
-----------------------------------------------
       Total PE seqs:   1317569
      PE seqs mapped:   1008000 (76.5 %)
        Num PE pairs:    658784
   F5 1st end mapped:    413159 (62.7 %)
   F3 2nd end mapped:    594841 (90.3 %)
     PE pairs mapped:    384462 (58.4 %)
     PE proper pairs:    282664 (42.9 %)
-----------------------------------------------