...
Since BAM files are binary, they can't be viewed directly using standard Unix file viewers such as more, less and head. We have seen how samtools view can be used to binary-format BAM files into text format for viewing. But samtools view also has options that let you do powerul filtering of the output. We focus on this filtering capability in this set of exercises.
The most common samtools view filtering options are:
- -q N – only report alignment records with mapping quality of at least N (>= N).
- -f 0xXX – only report alignment records where the specified flags XX are all set (are all 1)
- you can provide the flags in decimal, or as here as hexidecimal
- -F 0xXX – only report alignment records where the specified flags XX are all cleared (are all 0)
Setup
Login to login5.ls5.tacc.utexas.edu and set up a directory for these exercises and copy an indexed BAM file there. This is a renamed version of the yeast_pairedend.sort.bam file from our Alignment workflow exercises.
...
Code Block | ||
---|---|---|
| ||
# if needed... cd $SCRATCH/core_ngs/alignment/samtools module load samtools # count the number of reads mapped to chromosome 2 (chrII) samtools view -c -F 0x4 yeast_pe.sort.bam chrII # count the number of reads mapped to chromosomes 1 or M (chrI, chrM) samtools view -c -F 0x4 yeast_pe.sort.bam chrI chrM # count the number of reads mapped to chromosomes 1 that overlap coordinates 1000-2000 samtools view -c -F 0x4 yeast_pe.sort.bam chrI:1000-2000 # since there are only 20 reads in the chrI:1000-2000 region, examine them individually samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 # look at a subset of field for chrI:1000-2000 reads # 2=flags, 3=contig, 4=start, 5=mapping quality, 6=CIGAR, 9=insert size samtools view -F 0x4 yeast_pe.sort.bam chrI:1000-2000 | cut -f 2-6,9 |
Tip | ||
---|---|---|
| ||
You can also provide a BED-format file with one record for each desired overlap region: samtools view -L <bed_file>. This is a quick way to perform one of the functions of bedtools intersect (in BAM to BAM mode). |
Since you will find yourself wanting to interpret the flags field, and it's easier to do that when it is represented as hexadecimal, let's use awk to help do that.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
Here's one way to do it:
|
Filtering high-quality reads
The code block below details some basic samtools functionality:
Code Block | ||||
---|---|---|---|---|
| ||||
samtools view -o outfile_view.bam infile.bam # use the -c option to just count alignment records
samtools sort infile.bam outfile.sorted.bam
samtools index aln.sorted.bam |
First, logon to stampede and copy the file yeast_pairedend.bam to your $SCRATCH directory:
Code Block | ||||
---|---|---|---|---|
| ||||
ssh user@login8.stampede.tacc.utexas.edu
cd $SCRATCH/core_ngs
mkdir samtools
cd samtools
cp $SCRATCH/core_ngs/alignment/yeast_pairedend.bam . |
Samtools flags and mapping rate: calculating the proportion of mapped reads in an aligned bam file
We have a sorted, indexed BAM file. Now we can use other samtools functionality to filter this file and count mapped vs unmapped reads in a given region. samtools allows you to sort based on certain flags that are specified on page 4 on the sam format specification. We'll focus on a couple, below.
Here are three of the most useful flags to sort on. We'll be using the unmapped flag.
Code Block | ||||
---|---|---|---|---|
| ||||
# SAM specifications common flag usage
0x04 = unmapped
0x02 = part of a properly aligned pair
0x400 = optical duplicate # look at samtools rmdup if you need to remove these sequences |
Exercise 2: count unmapped reads vs total reads on chromosome III for the yeast_pairedend_sort.bam file you created above. What proportion of the reads are mapped?
...
title | Exercise 2 solution |
---|
...
About mapping quality
Mapping qualities are a measure of how likely a given sequence alignment to its reported location is correct. If a read's mapping quality is low (especially if it is zero, or mapQ 0 for short) the read maps to multiple locations on the genome (they are multi-hit reads), and we can't be sure whether the reported location is the correct one.
Aligners also differ in whether they report alternate alignments for multi-hit r reads. Some things to keep in mind:
- Alternate locations for a mapped read are are flagged as secondary (flag 0x100)
- While they often provide valuable information, secondary reads must be filtered for some downstream applications
- e.g., ChIP-seq peak calling and variant analysis with GATK.
- When secondary reads are reported, the total number of alignment records in the BAM file is greater than the number of reads in the input FASTQ files
- this affects how the true mapping rate must be calculated
- true mapping rate = (pirmary mapped reads) / (total BAM file sequences - secondary mapped reads)
Here are some examples of how different aligners handle reporting of multi-hit reads and their mapping qualities:
- bwa aln (global alignment) and bowtie2 with default parameters (both --local and --global) report at most one location for a read that maps
- this will be the location with the best mapping quality and alignment
- if a read maps equally well to multiple locations, these aligners pick one location at random
- bwa aln will always report a 0 mapping quality for these multi-hit reads
- bowtie2 will report a low mapping quality (< 10), based on the complexity (information content) of the sequence
- bwa mem (local alignment) can always report more than one location for a mapped read
- for example, if one part of a read maps to one location and other part maps somewhere else (e.g. because of RNA splicing)
- there is no way to disable reporting of secondary reads with bwa mem.
- bowtie2 can be configured to report more than one location for a mapped read
- the -k N option says to report up to N alignments for a read
- most transcriptome-aware RNA-seq aligners by default can report more than one location for a mapped read
- e.g. tophat2, hisat2, star.
Create a BAM containing high-quality, properly paired reads
Using our yeast_pe.sort.bam file, let's do some some quality filtering.
Exercise: Use samtools view with -F, -f and -q options to create a BAM with unmapped and low quality (< mq 20) reads removed, and only containing mapped, properly paired, high-quality (mapQ 20+) reads. Call the output file yeast_pe.sort.filt.bam.
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
So the total proportion of reads that were unmapped on chromosome III is 13973/15503 or 90.1%, which is really high! Only ~10% of reads on this chromosome were able to be mapped to the genome. |
Filtering bam files based on mapped status and mapping quality using samtools view
Mapping qualities are a measure of how likely a given sequence alignment to a location is correct. The lowest score is a mapping quality of zero, or mq0 for short. The reads map to multiple places on the genome, and we can't be sure of where the reads originated. To improve the quality of our data, we can remove these low quality reads from our sorted and indexed file.
...
|
Exercise: If our original BAM contained secondary reads, (0x100 = 1) how would we exclude those also?
Tip | ||
---|---|---|
| ||
samtools view only pays attention to one -F or -f option, so to specify more than one flag value you need to combine them into one number. For example, combining 0x100 and 0x4 yields 0x104. |
Expand | ||
---|---|---|
| ||
You want both the secondary (0x100) and unmapped (0x4) flags to be 0. |
Expand | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
Here is what my output looks like when I use ls -lah after running the above commands:
|