SAM format and samtools

Output files of mapping

Mappers spit out SAM/BAM files as output files. BAM is just a binary format of SAM.

Here is a specification of SAM format SAM specification.

Commonly, SAM files are processed in this order:

SAM files are converted into BAM files (samstools view)
BAM files are sorted by reference coordinates (samtools sort)
Sorted BAM files are indexed (samtools index)

Each step above can be done with commands below

samtools view -bS <samfile> > <bamfile>
samtools sort <bamfile> <prefix of sorted bamfile>
samtools index <sorted bamfile>

Copy an exercise SAM file to your SCRATCH directory

cds
mkdir samtools_exercise
cd samtools_exercise
cp /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff/yeast_chip.sam .

Exercise
Generate "yeast_chip.bam", a sorted bam file named "yeast_chip_sort.bam", and an index file of the sorted bam file

solution

module load samtools
samtools view -bS yeast_chip.sam > yeast_chip.bam
samtools sort yeast_chip.bam yeast_chip_sort
samtools index yeast_chip_sort.bam

Sorting takes quite big memory and long time. In order to save time, copy the results.

CATCH UP

cp /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff/yeast_chip_sort.bam .
cp /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff/yeast_chip_sort.bam.bai .

Basic mapping stats

samtools flagstat <bamfile>

prints out a host of useful stats about your mapping results, such as reads mapped. It can print a lot more information like % properly paired, # of duplicates, but it's simply relying on the information encoded in the second field of the SAM file - the bitwise flag field.

If you're using a mapper that doesn't generate a SAM file, stop what you are doing; go back and get a mapper that makes a SAM file.

Exercise
Print out basic stats about yeast_chip.bam

solution

samtools flagstat yeast_chip_sort.bam

samtools idxstats <indexed bam file>

reports on stats related to the chromosome-based indexing done by samtools index. For each sequence of the reference, it provides:

Sequence name (usually "chr1", etc.)
BP in that sequence
Reads mapping to that sequence
Reads not mapping to that sequence

In miRNA alignment, reference sequence name (3rd field in a sam file) is name of miRNA. Therefore, idxstats gives the number of mapped read on each miRNA.

Watch out for mapper-specific data

Not all mappers actually write data to that bitfield, or may not populate a field if you don't instruct the mapper right; for example, if you have paired-end data but tell the mapper to map both reads as single-end reads, you won't get pairing stats.

Not all mappers write the same information to a BAM file - most won't write unmapped reads unless you ask them to; some will list reads that map to multiple sites in the genome multiple times, others only once, others never.

A dirty secret is that these are often not published because they can raise questions that are uncomfortable or not easy to answer and often irrelevant to the subject of the publication. Because they are often not published, it can be difficult to use these numbers as QC statistics without some context. You should keep track of these as your experiments progress; they can be telltales of contamination (e.g. non-reference sequences), poor isolation (e.g. varying amounts of ribosomal RNA), or other library prep issues (e.g. adaptor dimers).

Additional SAMtools tricks

Extract/print sub alignments in BAM format.

If no region is specified in samtools view command, all the alignments will be printed; otherwise only alignments overlapping the specified regions will be output. A region can be presented, for example, in the following format: ‘chr2’ (the whole chr2), ‘chr2:1000000’ (region starting from 1,000,000bp) or ‘chr2:1,000,000-2,000,000’

Exercise
Count the number of mapped reads overlapping between chromosome III 123456 and chromosome III 124456

solution

samtools view yeast_chip_sort.bam chrIII:123456-124456 | wc -l

Extract/print mapped sub alignments in BAM format

As mentioned above, a bam/sam file includes or does not include unmapped reads depending on mappers or options on mappers. If you use bwa with default options, the output bam includes unmapped reads. In order to extract mapped reads from a bam file, use -F option in samtools view command. -F INT means "Skip alignments with bits present in INT". In other words, -F INT filters reads that have the INT in their flag. Please take a look at page 4 on SAM specification. 0X4 flag is for segment unmapped.

Exercise
Count the number of reads mapped on chromosome III

solution

samtools view -F 0X04 yeast_chip.bam chrIII | wc -l

Extract/print reversely mapped sub alignments in BAM format

If you have a strand-specific RNA-seq data, a mapping direction of a read is critical. For example, you may want to count only reversely-mapped reads on a (-)-direction gene. Directionality of mapped read is also recorded on flag.

Exercise 3
Count the number of reversely-mapped reads overlapping between chromosome III: 123456 and chromosome III: 124456
Hint: flag 0x10 = SEQ being reverse complemented

solution

samtools view -F 0X04 -f 0X10 yeast_chip.bam chrIII:123456-124456 | wc -l

Instead of using samtools, you can do the same (or more sophisticated) task by using linux command tools such as sed and awk.

Exercise 4
Count the total number of mappped reads in 'yeast_chip.sam' (There are at least three ways)

solution

samtools flagstat yeast_chip_sort.bam
samtools view -F 0x04 yeast_chip_sort.bam | wc -l 
cat yeast_chip.sam | grep -v '^@' | awk 'BEGIN{FS="\t"} and($2,0x04)!=4 {print $0}' | wc -l

Before moving on the next step, what on earth is awk? awk is a program language that processes line by line in order to manipulate or extract text. There are a lot of built-in functions in awk, and you can use for-loop, conditional, array, and so on in awk to do complicated tasks. However, learning just three basic concepts can make your analysis a lot easier: i) FS (Field Separator), OFS (Output Field Separator) ii) $integer (field number) iii) arithmetic

Guess the output of a following command:

cat yeast_chip.sam | grep -v '^@' | awk 'BEGIN{OFS=":"} {print $1,$2+10000,$3}' | head

solution

HWI-ST1097:104:D13TNACXX:2:1101:1103:12285:10077:*

HWI-ST1097:104:D13TNACXX:2:1101:1103:12285:10141:*

HWI-ST1097:104:D13TNACXX:2:1101:1105:43321:11113:chrXII

HWI-ST1097:104:D13TNACXX:2:1101:1105:43321:10181:chrXII

HWI-ST1097:104:D13TNACXX:2:1101:1107:11182:10077:*

HWI-ST1097:104:D13TNACXX:2:1101:1107:11182:10141:*

HWI-ST1097:104:D13TNACXX:2:1101:1108:96291:10163:chrIII

HWI-ST1097:104:D13TNACXX:2:1101:1108:96291:10083:chrIII

HWI-ST1097:104:D13TNACXX:2:1101:1116:91356:11123:chrXII

HWI-ST1097:104:D13TNACXX:2:1101:1116:91356:11171:chrXII

More advanced info from a sam file

Since the sam file is tab-delimited text, it's easy to use linux command line tools to tell you other useful information quickly. For example, if you'd like to get a quick histogram of the CIGAR scores in your sam file, you could do this:

head -100000 yeast_chip.sam | awk 'BEGIN {FS="\t"} {print $6}' | sort | uniq -c | sort -n -r | head

Instead of the CIGAR score (field 6, "$6" to awk), a histogram field 9 might be more interesting to the person who made your library. If your mapper wrote to field 9 right.

Keep in mind you have to do this on an UNSORTED BAM file - where the reads are not sorted by genome location. If you try to look at at sorted BAM file, the "head" command will show you the first part of the first reference sequence rather than a random sample of your reads.

For sparse mapping data like ChIP-Seq or RNA-Seq data, you can quickly find the most heavily covered spot in your genome with something like this:

head -100000 yeast_chip.sam | awk 'BEGIN {FS="\t"} {print $3"\t"$4}' | sort | uniq -c | sort -n -r | head

If you're looking for the most heavily covered 1 kb region, let awk do the (crude) math for you:

head -100000 yeast_chip.sam | awk 'BEGIN {FS="\t"} {print $3"\t"int($4/1000)}' | sort | uniq -c | sort -n -r | head

Finally, note that the SAM file specification says that all fields after field 11 are custom, in the format TAG:VTYPE:VALUE. This can be very confusing because multiple mappers may use the same two letter code for TAG but mean something different - know your mapper!

SAM files act as repositories

Although most mappers assume fastq input files, the SAM file concept is intended to be a working repository of sequences at any stage of analysis. It is general enough to hold the raw data from multiple different samples within one SAM file so that, for example, a Bayesian genotyping tool can formulate a stronger association with a putative alternate allele when it scans across an entire family rather than separately through individuals. This information is encoded in the RG field in the SAM file header and on each raw read.

After mapping, other programs can be used to correct for false positive SNP calls near indels, or adjust base quality values by assessing actual sequencing errors, etc. and then store that data back in the BAM file.

More Exercises

Explore the sam files created by various mappers:

Compare similarities and differences in mapping stats.
Parse a SAM file to count the number of unique start sites in your mapping data; again, compare mappers.
Look at the different optional fields (field 12) produced by different mappers
If time permits, try different options on some mappers and use samtools flagstat or awk to look at the differences.