Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Since BAM files are binary, they can't be viewed directly using standard Unix file viewers such as more, less and head. We have seen how samtools view can be used to binary-format BAM files into text format for viewing. But samtools view also has options that let you do powerul filtering of the output. We focus on this filtering capability in this set of exercises.

The most common samtools view filtering options are:

  • -q N – only report alignment records with mapping quality of at least N (>= N).
  • -f 0xXX – only report alignment records where the specified flags XX are all set (are all 1)
    • you can provide the flags in decimal, or as here as hexidecimal
  • -F 0xXX – only report alignment records where the specified flags XX are all cleared (are all 0)

Setup

Login to login5.ls5.tacc.utexas.edu and set up a directory for these exercises and copy an indexed BAM file there. This is a renamed version of the yeast_pairedend.sort.bam file from our Alignment workflow exercises.

...

Code Block
languagebash
# if needed...
cd $SCRATCH/core_ngs/alignment/samtools
module load samtools

# count the number of reads mapped to chromosome 2 (chrII)
samtools view -c -F 0x4 yeast_pe.sort.bam chrII

# count the number of reads mapped to chromosomes 1 or M (chrI, chrM)
samtools view -c -F 0x4 yeast_pe.sort.bam chrI chrM

# count the number of reads mapped to chromosomes 1 that overlap coordinates 1000-2000
samtools view -c -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# since there are only 20 reads in the chrI:1000-2000 region, examine them individually
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000

# look at a subset of field for chrI:1000-2000 reads
#   2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9
Tip
titlesamtools view -L <bed_file>

You can also provide a BED-format file with one record for each desired overlap region: samtools view -L <bed_file>. This is a quick way to perform one of the functions of bedtools intersect (in BAM to BAM mode).


Since you will find yourself wanting to interpret the flags field, and it's easier to do that when it is represented as hexadecimal, let's use awk to help do that. 

...

Expand
titleAnswer

Here's one way to do it:

Code Block
languagebash
samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | awk '{$2 = sprintf("0x%x", $2); print $2}' | grep -c '3$'

Filtering high-quality reads

The code block below details some basic samtools functionality:

Code Block
languagebash
titlebasic samtools functionality
samtools view -o outfile_view.bam infile.bam  # use the -c option to just count alignment records
samtools sort infile.bam outfile.sorted.bam
samtools index aln.sorted.bam

First, logon to stampede and copy the file yeast_pairedend.bam to your $SCRATCH directory:

Code Block
languagebash
titlecopy the yeast_pairedend.bam file from yesterday to a new directory "samtools"
ssh user@login8.stampede.tacc.utexas.edu
cd $SCRATCH/core_ngs
mkdir samtools
cd samtools
cp  $SCRATCH/core_ngs/alignment/yeast_pairedend.bam .

Samtools flags and mapping rate: calculating the proportion of mapped reads in an aligned bam file

We have a sorted, indexed BAM file.  Now we can use other samtools functionality to filter this file and count mapped vs unmapped reads in a given region. samtools allows you to sort based on certain flags that are specified on page 4 on the sam format specification.  We'll focus on a couple, below.

Here are three of the most useful flags to sort on. We'll be using the unmapped flag.

Code Block
languagebash
titlea few samtools flags
# SAM specifications common flag usage
0x04 = unmapped
0x02 = part of a properly aligned pair
0x400 = optical duplicate  # look at samtools rmdup if you need to remove these sequences

Exercise 2: count unmapped reads vs total reads on chromosome III for the yeast_pairedend_sort.bam file you created above.  What proportion of the reads are mapped?

...

titleExercise 2 solution

...

About mapping quality

Mapping qualities are a measure of how likely a given sequence alignment to its reported location is correct. If a read's mapping quality is low (especially if it is zero, or mapQ 0 for short) the read maps to multiple locations on the genome (they are multi-hit reads), and we can't be sure whether the reported location is the correct one.

Aligners also differ in whether they report alternate alignments for multi-hit r reads. Some things to keep in mind:

  • Alternate locations for a mapped read are are flagged as secondary (flag 0x100)
  • While they often provide valuable information, secondary reads must be filtered for some downstream applications
    • e.g., ChIP-seq peak calling and variant analysis with GATK.
  • When secondary reads are reported, the total number of alignment records in the BAM file is greater than the number of reads in the input FASTQ files
    • this affects how the true mapping rate must be calculated
    • true mapping rate = (pirmary mapped reads) / (total BAM file sequences - secondary mapped reads)

Here are some examples of how different aligners handle reporting of multi-hit reads and their mapping qualities:

  • bwa aln (global alignment) and bowtie2 with default parameters (both --local and --global) report at most one location for a read that maps
    • this will be the location with the best mapping quality and alignment
    • if a read maps equally well to multiple locations, these aligners pick one location at random
      • bwa aln will always report a 0 mapping quality for these multi-hit reads
      • bowtie2 will report a low mapping quality (< 10), based on the complexity (information content) of the sequence
  • bwa mem (local alignment) can always report more than one location for a mapped read
    • for example, if one part of a read maps to one location and other part maps somewhere else (e.g. because of RNA splicing)
    • there is no way to disable reporting of secondary reads with bwa mem.
  • bowtie2 can be configured to report more than one location for a mapped read
    • the -k N option says to report up to N alignments for a read
  • most transcriptome-aware RNA-seq aligners by default can report more than one location for a mapped read
    • e.g. tophat2, hisat2, star.

Create a BAM containing high-quality, properly paired reads

Using our yeast_pe.sort.bam file, let's do some some quality filtering.

Exercise:  Use samtools view with -F, -f and -q options to create a BAM with unmapped and low quality (< mq 20) reads removed, and only containing mapped, properly paired, high-quality (mapQ 20+) reads. Call the output file yeast_pe.sort.filt.bam.

Expand
titleExercise 3 solution
Code Block
languagebash
titlesolution
module# load samtools                                     # if needed
samtools view -c yeast_pairedend_sort.bam chrIII         # total number of reads on this chromosome: 15503
samtools view -c -F 0X04 yeast_pairedend_sort.bam chrIII # reads on this chromosome which are unmapped: 13973

So the total proportion of reads that were unmapped on chromosome III is 13973/15503 or 90.1%, which is really high!  Only ~10% of reads on this chromosome were able to be mapped to the genome.

Filtering bam files based on mapped status and mapping quality using samtools view

Mapping qualities are a measure of how likely a given sequence alignment to a location is correct. The lowest score is a mapping quality of zero, or mq0 for short. The reads map to multiple places on the genome, and we can't be sure of where the reads originated. To improve the quality of our data, we can remove these low quality reads from our sorted and indexed file.

...

if needed...
cd $SCRATCH/core_ngs/alignment/samtools
module load samtools

samtools view -b -F 0x04 -f 0x2 -q 20 -o yeast_pe.sort.filt.bam yeast_pe.sort.bam

Exercise: If our original BAM contained secondary reads, (0x100 = 1) how would we exclude those also?

Tip
titleCombining SAM flags

samtools view only pays attention to one -F or -f option, so to specify more than one flag value you need to combine them into one number. For example, combining 0x100 and 0x4 yields 0x104.

Expand
titleHint

You want both the secondary (0x100) and unmapped (0x4) flags to be 0.


Expand
titleExercise 3 solution
Code Block
languagebash
titlesolution
module load samtools # if needed
samtools view -b -F 0x04 0x104 -f 0x2 -q 120 -o yeast_pairedend_pe.sort.mapped.q1filt.bam yeast_pairedend_sort.bam

Here is what my output looks like when I use ls -lah after running the above commands:

Code Block
languagebash
titlesamtools filtering output
-rwxrwxrwx 1 awh394 G-801021  42M May 26 14:26 yeast_pairedend_sort.mapped.q1pe.sort.bam