Introduction to Samtools - manipulating and filtering bam files
As Nathan showed you yesterday, the main type of output from aligning reads to a databases is a binary alignment file, or BAM file. These files are compressed, so they can't be viewed using standard unix file viewers such as more, less and head. Samtools allows you to manipulate the .bam files - they can be converted into a non-binary format (SAM format specification here) and can also be ordered and sorted based on the quality of the alignment. This is a good way to remove low quality reads, or make a BAM file restricted to a single chromosome.
We'll be focusing on just a few of samtools functions in this series of exercises. Since most aligners produce a BAM file, we'll work on some basic manipulations of the BAM files we produced from our alignments yesterday.
Most functionality while using BAM files can be described as such:
- SAM files are converted into BAM files (samtools view)
- BAM files are sorted by reference coordinates (samtools sort)
- Sorted BAM files are indexed (samtools index)
- Sorted, indexed BAM files are filtered based on location, flags, mapping quality (samtools view with filtering options)
Take a look here for a detailed manual page for each function in samtools. These steps presume that you are using a mapper/aligners such as bwa, which records both mapped and unmapped reads - make sure you check how the aligner writes it's output to SAM/BAM format, or you may get a strange surprise in your output aligned files!
The code block below details some basic samtools functionality:
samtools view -o outfile_view.bam infile.bam # use the -c option to just count alignment records samtools sort infile.bam outfile.sorted.bam samtools index aln.sorted.bam
First, logon to stampede and copy the file yeast_pairedend.bam to your $SCRATCH directory:
ssh user@login8.stampede.tacc.utexas.edu cds mkdir samtools cd samtools cp $SCRATCH/core_ngs/alignment/yeast_pairedend.bam .
Sorting and Indexing a bam file: samtools index, sort
Now that we have a BAM file, we need to index it. All BAM files need an index, as they tend to be large and the index allows us to perform computationally complex operations on these files without it taking days to complete.
Exercise 1: Sort and index the file "yeast_pairedend.bam", then examine the files you created
Samtools flags and mapping rate: calculating the proportion of mapped reads in an aligned bam file
We have a sorted, indexed BAM file. Now we can use other samtools functionality to filter this file and count mapped vs unmapped reads in a given region. samtools allows you to sort based on certain flags that are specified on page 4 on the sam format specification. We'll focus on a couple, below.
Here are three of the most useful flags to sort on. We'll be using the unmapped flag.
# SAM specifications common flag usage 0x04 = unmapped 0x02 = part of a properly aligned pair 0x400 = optical duplicate # look at samtools rmdup if you need to remove these sequences
Exercise 2: count unmapped reads vs total reads on chromosome III for the yeast_pairedend_sort.bam file you created above. What proportion of the reads are mapped?
Filtering bam files based on mapped status and mapping quality using samtools view
Mapping qualities are a measure of how likely a given sequence alignment to a location is correct. The lowest score is a mapping quality of zero, or mq0 for short. The reads map to multiple places on the genome, and we can't be sure of where the reads originated. To improve the quality of our data, we can remove these low quality reads from our sorted and indexed file.
Exercise 3: Remove unmapped and low quality reads from your bam file.
Now that we have a BAM file with only high-quality mapped reads, we are ready to manipulate this file using bedtools.