Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
# look at a subset of field for chrI:1000-2000 reads
# fields:  2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9 | \
  awk 'BEGIN{FS=OFS="\t"}
       {$1 = sprintf("0x%x", $1); print}'

...

Aligners also differ in whether they report alternate alignments for multi-hit reads, and if they do, how many. Some things to keep in mind:

...

  • bwa aln (global alignment) and bowtie2 with default parameters (both --local default end-to-end mode) report at most one location for a read that maps
    • this will be the location with the best mapping quality and alignment
    • if a given read maps equally well to multiple locations, these aligners pick one location at random
      • bwa aln will always report a 0 mapping quality for these multi-hit reads
      • bowtie2 will report a low mapping quality (< 10), based on the complexity (information content) of the sequence
  • bwa mem (local alignment) can always report more than one location for a mapped read
    • its definition of a secondary alignment is different (and a bit non-standard)
      • if one part of a read maps to one location and another part maps somewhere else (e.g. because of RNA splicing), the longer alignment is marked as primary and the shorter as secondary.
    • there is no way to disable reporting of secondary alignments with bwa mem.
      • but they can be filtered from the sorted BAM with -F 0x100 (secondary alignment flag = 0).
  • bowtie2 can be configured to report more than one location for a mapped read
    • the -k N option says to report up to N alignments for a read
  • most transcriptome-aware RNA-seq aligners by default report more than one location for a mapped read
    • e.g. hisat2, star, tophat2.
    • when reads are quantified (counted with respect to genes), multiply-mapped reads can be counted fractionally
      • e.g. if a read maps to 5 genes, it can be counted as 1/5 for each of the genes

...

Expand
titleAnswer


Code Block
languagebash
titlesolution
samtools view -c yeast_pe.sort.bam
samtools view -c yeast_pe.sort.filt.bam

# or to be really fancy...
echo "`samtools view -c yeast_pe.sort.bam` \
      `samtools view -c yeast_pe.sort.filt.bam`" | \
       awk '{printf "%d original reads\n", $1;
             printf "%d Q20 reads (%d read pairs)\n", $2, $2/2;
             printf "%0.2f%% high-quality reads\n", 100*$2/$1}'

There were 1,184,360 alignment records in the original BAM, and only 456,890 in the quality-filtered BAM, around 38% of our starting reads.

Since we have only properly paired reads, the filtered BAM will contain equal numbers of both R1s and R2s. So the number of read pairs is 456890/2 or 228451.

Exercise: How many primary aligned reads (0x100 = 0) are in the Copy the bwa local alignment files here:

Code Block
languagebash
titlesolution
cds; cd core_ngs/samtools
cp $CORENGS/catchup/bwa_script/bwa_local.sort.dup.bam

...

Tip
* .
cp $CORENGS/catchup/bwa_script/bwa_local.samstats.txt  .

Exercise: How many primary aligned reads (0x100 = 0) are in the bwa_local.sort.dup.bam file?

Tip
titleCombining SAM flags

samtools view only pays attention to one -F or -f option, so to specify more than one flag value you need to combine them into one number. For example, combining 0x100 and 0x4 yields 0x104.

...

Expand
titleSolution


Code Block
languagebash
titlesolution
samtools view -F 0x104 -c bwa_local.sort.dup.bam

There are 874791 primary alignments.

Code Block
languagebash
titlesolution
samtools view -F 0x4 -f 0x100 -c bwa_local.sort.dup.bam

And 133209 secondary alignments.

Look at the statistics for the bwa mem (local) alignment in the bwa_local.samstats.txt file:

Code Block
languagebash
titlesolution
-----------------------------------------------
             Aligner:       bwa
     Total sequences:   1317569
        Total mapped:   1008000 (76.5 %)
      Total unmapped:    309569 (23.5 %)
             Primary:    874791 (86.8 %)
           Secondary:    133209 (13.2 %)
          Duplicates:    428418 (42.5 %)
          Fwd strand:    502782 (49.9 %)
          Rev strand:    505218 (50.1 %)
          Unique hit:    947418 (94.0 %)
           Multi hit:     60582 (6.0 %)
           Soft clip:    543390 (53.9 %)
           All match:    328038 (32.5 %)
              Indels:      6679 (0.7 %)
             Spliced:
-----------------------------------------------
       Total PE seqs:   1317569
      PE seqs mapped:   1008000 (76.5 %)
        Num PE pairs:    658784
   F5 1st end mapped:    413159 (62.7 %)
   F3 2nd end mapped:    594841 (90.3 %)
     PE pairs mapped:    384462 (58.4 %)
     PE proper pairs:    282664 (42.9 %)
-----------------------------------------------