Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device.
Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:
Map the raw data to a suitable reference
Look for SNPs and small-scale indels
Expectations of output are quite different however, which can add statistical power to uncovering variation in populations or organisms with more than two expected variants at the same location.
If you happen to be working with a model organism with extensive external data (ESPECIALLY HUMAN), then there are even more sophisticated tools like the Broad Institute's GATK that can improve both sensitivity and specificity of your variant calls.
Keep in mind that this type of trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases. The most prominent publication in this area is this first example of whole exome sequencing saving a life. There are many other publications since and some review articles such as this one. Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation. But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.
Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.
All the data we'll use is located here:
Diploid genome (human) example files
$BI/ngs_course/human_variation
This directory contains raw data (the .fastq files), mapped data (the .bam files) and variant calls (the .vcf files). It also contains the subdirectory ref with special references.
The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project.
.bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf.
Note if you're trying this on Stampede: The $BI directory is not accessible from compute nodes on Stampede so you will need to make a copy of your data on $SCRATCH and update file locations accordingly to get this demo to run.
Optional: Mapping Exercise
Let's use Anna Battenhouse's shell script align_bwa.sh to align the fastq files to the hg19 version of the genome, and the python program launcher_creator.py to create the job submission script.
Don't forget that for this to work, you need to have appended $BI/bin to your path.
Move into your scratch directory and then try to figure out how to create and qsub the align_bwa.sh command to align the data file $BI/ngs_course/human_variation/allseqs_R1.fastq against the hg19 reference. Call the output "test".
Make job submission script for mapping & variant calling
Caution: If you are using this example outside an SSC course, you must use the -a option to specify a valid allocation (e.g. BME_2012)
login1$ echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub map_call.sge" to queue your job.
login1$ qsub map_call.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...
Your job 586249 ("map_call") has been submitted
Note that the input is paired-end data.
Your directory should have content like this when done (from ls -lt):
Mapping output
-rw-r--r-- 1 sphsmith G-801020 5732 May 20 23:01 map_call.o586338
-rw------- 1 sphsmith G-801020 392 May 20 23:01 test.flagstat.txt
-rw------- 1 sphsmith G-801020 2175952 May 20 23:01 test.sorted.bam.bai
-rw------- 1 sphsmith G-801020 334782188 May 20 23:01 test.sorted.bam
-rw-r--r-- 1 sphsmith G-801020 13135 May 20 23:00 map_call.e586338
-rw------- 1 sphsmith G-801020 411244396 May 20 23:00 test.bam
-rw------- 1 sphsmith G-801020 45695040 May 20 22:49 test.read2.sai
-rw------- 1 sphsmith G-801020 45372400 May 20 22:39 test.read1.sai
-rw-r--r-- 1 sphsmith G-801020 0 May 20 22:26 map_call.pe586338
and samtools flagstat test.sorted.bam should yield:
samtools flagstat results
Running flagstat...
4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992274 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
40290 + 0 properly paired (0.89%:nan%)
3636946 + 0 with itself and mate mapped
355328 + 0 singletons (7.82%:nan%)
44128 + 0 with mate mapped to a different chr
15634 + 0 with mate mapped to a different chr (mapQ>=5)
Single-sample variant calling with samtools
We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it). $BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
With samtools, this is a two-step process:
samtools mpileup command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.
bcftools with a few options added uses the prior probability distribution and the data to calculate a genotype for the variants detected.
Here are the commands, piped together (ONLY run this directly if you are in an idev session - NOT on a head node!):
Note that the reference chosen for mpileup must be exactly the same as the reference used to create the bam file. The 1000 genomes project has created it's own reference and so the command listed above won't work - we have to use the 1000 genomes reference which is $BI/ngs_course/human_variation/ref/hs37d5.fa. We could have chosen another mapper if we'd wanted to though.
Exercise:
Write a modified version of the command above to use the proper reference into a commands file, create the job submission script with launcher_creator.py and submit the job.
As we did for mapping, we need to submit these to the Lonestar queue:
login1$ echo "samtools mpileup -uf /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa \
> /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
> | bcftools view -vcg - > test.raw.vcf" > commands
login1$ launcher_creator.py -l call_vars.sge -n call_vars -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub call_vars.sge" to queue your job.
login1$ qsub call_vars.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...
Your job 586369 ("call_vars") has been submitted
After your single-sample variant calling job completes
If you don't want to wait, CHANGE TO A NEW DIRECTORY and do this: ln -s $BI/ngs_course/human_variation/NA12878.chrom20.samtools.vcf test.raw.vcf and continue with the rest of these exercises. Remember to return to your job directory if you want to see your actual data.
You can examine some of the variant calls with:
Scanning a vcf file
more test.raw.vcf
Note the extensive header information, followed by an ordered list of variants.
A bcftools auxiliary perl script called vcfutils.pl can provide some useful stats. It's safe to run this on the head node since it's quick. This is not part of a standard TACC module but it's in our common $BI/bin directory.
Keep in mind that variant files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.
Multiple-sample variant calling with samtools
This is all fine, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.
To instruct samtools to call variants across many samples, you must simply give it mapped data with each sample tagged separately. Samtools allows two methods to do this:
By providing separate bam files for each sample, like this:
samtools multi-sample variants: separate bam files
The output file from the first option is in $BI/ngs_course/human_variation
CAVEAT: If you intend to use the second option, you must make sure you tag your reads with the right RG tag; this can easily be done during the samse or sampe stage of mapping with bwa with the -r option, using samtools merge, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. Note that our align_bwa.sh script takes care of this detail for us.
Identify the lineage
If genetics works, you should be able to identify the child based strictly on the genotypes. Can you do it?
You're trying to find the genotypes in the all.samtools.vcf and all.GATK.vcf files, and then use your knowledge of Mendelian inheritance to figure out which of the three samples is the only one that could be a child of the other two.
This linux one-liner should give you a snapshot of data sufficient to figure it out:
3) If the variant quality score is greater than 500, then print fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes).
4) Filter for only lines that have at least one homozygous SNP (exercise to the reader to understand why...)
5) Break the genotype call apart from other information about depth: "sed" turns the colons into spaces so that awk can just print the genotype fields.
6) Take the last 100 lines, sort them, then count the unique lines
Here is my interpretation of the data:
1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block.
2) The most telling data: the child will have heterozygous SNPs from two homozygous parents.
3) So all this data is consistent with column 1 (NA12878) being the child:
In theory, GATK does a much better job ruling out false positives. But are there some SNPs GATK calls with high confidence that Samtools doesn't call at all?
1) Yank out all the variant calls (comments start with '#') and add a string of "AAAAAAA" to them to make "sort" do it's job in a way that "join" will later like
2) Join the two files using their chromosome position as the join field, but also include any lines from the GATK file that DON'T match the samtools file. Use "awk" to figure out which ones came only from GATK (they are missing a bunch of fields from the samtools variant calls), look only for those that GATK has labeled "PASS" and write them to a file.
3) Sort the resultant file on the variant quality value - take the top 10 lines.
You will note that many of these are complex variants, particularly insertions, so it's not too surprising that GATK does better. But here's a SNP that GATK does much better on:
chr21:34278313
It has an interesting quantitative signature though... you might want to look at it in IGV.
If you don't need a variant caller that uses a full Bayesian inference engine with robust filtering, protections, etc. but just want to look for weird base-artifacts within your mapped data, check out samtools mpileup output directly.