...
If you happen to be working with a model organism with extensive external data (ESPECIALLY HUMAN), then there are even more sophisticated tools like the Broad Institute's GATK that can improve both sensitivity and specificity of your variant calls.
Example: The CEU Trio from the 1000 Genomes Project
Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.
All the data we'll use is located here:
Code Block | ||
---|---|---|
| ||
$BI/ngs_course/human_variation
|
This directory contains raw data (the .fastq files), mapped data (the .bam files) and variant calls (the .vcf files). It also contains the subdirectory ref
with special references.
The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project.
.bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
We've also pre-run samtools and GATK on the trioKeep in mind that this type of trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases. The most prominent publication in this area is this first example of whole exome sequencing saving a life. There are many other publications since and some review articles such as this one. Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation. But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.
Example: The CEU Trio from the 1000 Genomes Project
Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.
All the data we'll use is located here:
Code Block | ||
---|---|---|
| ||
$BI/ngs_course/human_variation
|
This directory contains raw data (the .fastq files), mapped data (the .bam files) and variant calls (the .vcf files). It also contains the subdirectory ref
with special references.
The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project.
.bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf.
...
We'll return to this example data shortly to demonstrate a much more involved tool, GATK, to do the same steps.
Note if you're trying this on Stampede: The $BI directory is not accessible from compute nodes on Stampede so you will need to make a copy of your data on $SCRATCH and update file locations accordingly to get this demo to run.
...
Expand | |||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||||||
Let's use Anna Battenhouse's shell script Don't forget that for this to work, you need to have appended
Move into your scratch directory and then try to figure out how to create and
Note that the input is paired-end data.
|
...
Single-sample variant calling with samtools
We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq
file above was derived, leading to some oddities in it).$BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
With samtools, this is a two-step process:
samtools mpileup
command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.bcftools
with a few options added uses the prior probability distribution and the data to calculate a genotype for the variants detected.
Here are the commands, piped together:
...
title | Calling variants using samtools and bcftools |
---|
would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq
file above was derived, leading to some oddities in it).$BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
With samtools, this is a two-step process:
samtools mpileup
command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.bcftools
with a few options added uses the prior probability distribution and the data to calculate a genotype for the variants detected.
Here are the commands, piped together (ONLY run this directly if you are in an idev session - NOT on a head node!):
Code Block | ||
---|---|---|
| ||
samtools mpileup -uf ref/hs37d5.fa \
NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
| bcftools view -vcg - > test.raw.vcf
|
or via qsub:
Code Block |
---|
launcher_creator.py -n samtools_test -b "samtools mpileup -uf $BI/ref_genome/fasta/ucsc/ucsc.hg19.fasta \ $BI/ngs_course/human_variation//hs37d5.fa NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ | bcftools view -vcg - > test.raw.vcf" -t 01:00:00 -q development -a CCBB -m samtools qsub samtools_test.sge |
Note that the reference chosen for mpileup must be exactly the same as the reference used to create the bam file. The 1000 genomes project has created it's own reference and so the command listed above won't work - we have to use the 1000 genomes reference which is $BI/ngs_course/human_variation/ref/hs37d5.fa
. We could have chosen another mapper if we'd wanted to though.
...
You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.
...
By providing separate bam files for each sample, like this:
Code Block title samtools multi-sample variants: separate bam files samtools mpileup -uf ref/hs37d5.fa \ NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ | bcftools view -vcg - > all.samtools.vcf
By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools
@RG
tags.Code Block title samtools multi-sample variants: one or more bam files using @RG samtools mpileup -uf hs37d5.fa all.bam | bcftools view -vcg - > all.raw.vcf
...
Expand | ||
---|---|---|
| ||
You're trying to find the genotypes in the |
Expand | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
This linux one-liner should give you a snapshot of data sufficient to figure it out:
|
GATK
GATK deserves it's own page which is here, but we've already run it and will now look at some differences in the SNP calls.
...
You can't get gene-context information from bcftools - you have to shift to variant annotation tools to do that.
Side-note on samtools mpileup
...