Samtools mpileup (GVA14)
The utility of samtools mpileup
The samtools mpileup
command can form the basis of a basic genotyper directly.
The command itself essentially transposes a bam file. The rows of a sam/bam file are oriented around reads and give you very little context of the reference. samtools mpileup rows are oriented around genome coordinates with information about all reads (base-pairs really) underlying that position.
Start with the basic command and see what it does on one of the BAM files (head gives us just the first 10 lines):
samtools mpileup NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | head
You'll see that the second to last field is what we're after. The period "." and comma "," marks indicate that a read was present matching the reference - "." if the read is on the forward strand, "," if the read is on the reverse strand. A/a, C/c, G/g, or T/t will be printed when a read contains that base (upper case = forward strand, lower case = reverse strand). There are special characters indicating the start and end of reads. You can also get the base quality scores out if you'd like too (last column). Here is the full format specification.
Now add information about the reference and notice that you now have the reference base:
samtools mpileup -f ref/hs37d5.fa NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | head
You see from these first few lines that there aren't many base-pairs at each location. That's because the read coverage is very low there. Let's add a filter using awk that requires there to be at least 5 reads over a location. This is easy because the fourth column contains the read depth at that location:
samtools mpileup -f ref/hs37d5.fa NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | tr /acgt/ /ACGT/ | awk '{if ($4>5) {print $2"\t"$3"\t"$5}}' | head
For good measure, I've also capitalized all the base-pairs (I don't care about strand information for the moment) and I'm only printing out the chromosome 20 location ($2), the reference base ($3), and the pileup of bases at that location ($5).
Now, I'd like to only look at lines that have variants. I'd like to skip spurious sequencing errors, so I'll add one more awk command that counts the number of alternate base-pairs at a location and only prints out the line of the total alternate base-pairs are at least 40% of the read depth at that location (I've also added linux line continuation characters ("\") to separate this one command across many lines so you can read it more clearly:
samtools mpileup -f ref/hs37d5.fa NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | tr /acgt/ /ACGT/ | \ awk '{if ($4>5) {print $2"\t"$3"\t"$5}}' | \ awk '{m=0; \ for (i=1;i<=length($3);i++) {\ char=substr($3,i,1); \ if ($2!=char&&(char=="A"||char=="C"||char=="G"||char=="T")) { m++\ }\ } \ if (m>0.4*length($3)) {print $0"\t"m} \ }' | head
Now, redirect that to a file and check to see how concordant it is with your samtools or GATK output!
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.