Human Trios

Human Trios

Diploid genomes

The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:

  1. Map the raw data to a suitable reference

  2. Look for SNPs and small-scale indels.

Expectations of output are quite different however, which can add statistical power to uncovering variation in populations or organisms with more than two expected variants at the same location. If you happen to be working with a model organism with extensive external data (ESPECIALLY HUMAN), then there are even more sophisticated tools like the Broad Institute's GATK that can improve both sensitivity and specificity of your variant calls.

Trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases.  The most prominent publication in this area is this first example of whole exome sequencing saving a life.  There are many other publications since and some review articles such as this one.  Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation.  But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.

Example: The CEU Trio from the 1000 Genomes Project

Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.

All the data we'll use is located here:

Diploid genome (human) example files
$BI/ngs_course/human_variation

This directory contains the following:

  1. compressed raw data (the .fastq.gz files)

  2. mapped data (the .bam files)

  3. variant calls (the .vcf files)

  4. the subdirectory ref with special references

  5. .bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".

  6. We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.

  7. We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf.

  8.  The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project. 

 

 

Here we show the steps for:

  1. Mapping

  2. Single-sample variant calling with samtools

  3. Multiple-sample variant calling with samtools

We'll return to this example data later to demonstrate a much more involved tool, GATK, to do the same steps in another tutorial.

Optional: Mapping Exercise (simplified way to map with bwa)

Let's use Anna Battenhouse's shell script align_bwa.sh to align the paired end fastq file to the hg19 version of the genome calling the output trios_mapping. The input fastq file can be found at $BI/ngs_course/human_variation/allseqs_R1.fastq or if you have already copied today's materials to your scratch directory at $SCRATCH/BDIB_Human_tutorial/raw_files. Throughout the course we have been telling you to copy things to a local location before working with them, and you can continue to do so, but it is important to remember this is not necessarily a requirement.

If you use launcher_creator.py to create the job submission script. Be sure to append &> "meaningful file name" to the end of the script to capture the standard output and standard

 

Type align_bwa.sh without any options to gain insight into the programs expectations. As you might have put together, the "BWA" part of the script name represents the BWA mapping tool, but what isn't obvious is that it also uses SAMtools. Make sure that these things are available before starting the commands

If on an iDev node:
module load bwa module load samtools cds cd BDIB_Human_tutorial align_bwa.sh raw_files/allseqs_R1.fastq trios_mapping hg19 1
Make job submission script for mapping & variant calling
echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq trios_mapping hg19 &> aln.trios_mapping.log" > commands # this is an alternative way to generate a commands file than nano, use nano if more comfortable, or try this new option launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands -A UT-2015-05-18 module load bwa module load samtools qsub map_call.sge

 

 

"ls -lt" output ... (a way of checking if things completed correctly
-rw-rw-r-- 1 ded G-802740 392 May 28 03:13 trios_mapping.flagstat.txt -rw-rw-r-- 1 ded G-802740 2149304 May 28 03:13 trios_mapping.sorted.bam.bai -rw-rw-r-- 1 ded G-802740 334770401 May 28 03:13 trios_mapping.sorted.bam -rw-rw-r-- 1 ded G-802740 411155732 May 28 03:12 trios_mapping.bam -rw-rw-r-- 1 ded G-802740 62257040 May 28 03:01 trios_mapping.read2.sai -rw-rw-r-- 1 ded G-802740 61884104 May 28 02:51 trios_mapping.read1.sai
"samtools flagstat trios_mapping.sorted.bam" expected results
4546280 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 3992266 + 0 mapped (87.81%:nan%) 4546280 + 0 paired in sequencing 2273140 + 0 read1 2273140 + 0 read2 40234 + 0 properly paired (0.88%:nan%) 3636928 + 0 with itself and mate mapped 355338 + 0 singletons (7.82%:nan%) 43366 + 0 with mate mapped to a different chr 15352 + 0 with mate mapped to a different chr (mapQ>=5)

 

Single-sample variant calling with samtools

We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it). 

$SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

With samtools, this is a two-step process:

  1. samtools mpileup command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.

  2. bcftools with a few options added uses the prior probability distribution and the data to calculate a genotype for the variants detected.

Here are the commands, piped together (ONLY run this directly if you are in an idev session - NOT on a head node!):

Make sure you are on an idev node, or submit as job

The previous command will take quite a bit of time to complete. Therefore you may want to open a new terminal window, and log back into lonestar to do other things while the previous command continues to run.

Calling variants using samtools and bcftools
cds cd BDIB_Human_tutorial mkdir samtools_example cd samtools_example samtools mpileup -uf ../raw_files/ref/hs37d5.fa ../raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools view -vcg - > trios_tutorial.raw.vcf
launcher_creator.py -n samtools_test -b "samtools mpileup -uf ref/hs37d5.fa NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools view -vcg - > trios_tutorial.raw.vcf" -t 01:00:00 -q development -A UT-2015-05-18 -m samtools qsub samtools_test.sge

 

Note that the reference chosen for mpileup must be exactly the same as the reference used to create the bam file. The 1000 genomes project has created it's own reference and so the command listed above won't work - we have to use the 1000 genomes reference which is $BI/ngs_course/human_variation/ref/hs37d5.fa or $SCRATCH/BDIB_Human_tutorial/raw_files/hs37d5.fa. We could have chosen another mapper if we'd wanted to though.

Exercise:

Write a modified version of the command above to use the proper reference into a commands file, create the job submission script with launcher_creator.py and submit the job.

As we did for mapping, we need to submit these to the Lonestar queue:

Submission of variant calling commands
echo "samtools mpileup -uf $SCRATCH/BDIB_Human_tutorial/raw_files/hs37d5.fa $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools view -vcg - > trios_tutorial.raw.vcf" > commands launcher_creator.py -l call_vars.sge -n call_vars -t 01:00:00 -j commands -A UT-2015-05-18 qsub call_vars.sge

 

After your single-sample variant calling job completes

if on an iDEV node you will need to put the running command to the background first
mkdir single-sample-variant-speedup cd single-sample-variant-speedup  ln -s $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.samtools.vcf trios_tutorial.raw.vcf # this creates a new file called trios_tutorial.raw.vcf that has the same contents as the NA12878.chrom20.samtools.vcf file.


You can continue with the tutorial in this directory, but remember you can return to your job directory if you want to see your actual data.

more trios_tutorial.raw.vcf

Note the extensive header information, followed by an ordered list of variants.

bcftools auxiliary perl script called vcfutils.pl can provide some useful stats. It's safe to run this on the head node since it's quick. This is not part of a standard TACC module but it's in our common $BI/bin directory.

Running vcfutils.pl
vcfutils.pl qstats trios_tutorial.raw.vcf
login1$ vcfutils.pl qstats trios_tutorial.raw.vcf QUAL #non-indel #SNPs #transitions #joint ts/tv #joint/#ref #joint/#non-indel 186 1011 1011 757 0 2.9803 0.0000 0.0000 2.9803 142 2036 2036 1441 0 2.4218 0.0000 0.0000 2.0059 122 3091 3091 2156 0 2.3059 0.0000 0.0000 2.1029 109 4177 4177 2925 0 2.3363 0.0000 0.0000 2.4259 99.5 5237 5237 3647 0 2.2937 0.0000 0.0000 2.1361 92 6273 6273 4354 0 2.2689 0.0000 0.0000 2.1489 85.5 7328 7328 5105 0 2.2964 0.0000 0.0000 2.4704 79 8352 8352 5811 0 2.2869 0.0000 0.0000 2.2201 74 9369 9369 6497 0 2.2622 0.0000 0.0000 2.0725 69 10553 10553 7311 0 2.2551 0.0000 0.0000 2.2000 65 11782 11782 8176 0 2.2673 0.0000 0.0000 2.3764 61 13059 13059 9090 0 2.2902 0.0000 0.0000 2.5179 57 14280 14280 9945 0 2.2941 0.0000 0.0000 2.3361 53 15301 15301 10656 0 2.2941 0.0000 0.0000 2.2935 48.8 16323 16323 11347 0 2.2803 0.0000 0.0000 2.0876 45 17460 17460 12122 0 2.2709 0.0000 0.0000 2.1409 41.8 18639 18639 12899 0 2.2472 0.0000 0.0000 1.9328 39.8 19660 19660 13572 0 2.2293 0.0000 0.0000 1.9339 37.8 21013 21013 14496 0 2.2243 0.0000 0.0000 2.1538 35.8 22309 22309 15380 0 2.2197 0.0000 0.0000 2.1456 33.8 23568 23568 16251 0 2.2210 0.0000 0.0000 2.2448 31.8 24846 24846 17101 0 2.2080 0.0000 0.0000 1.9860 29.8 26171 26171 17975 0 2.1931 0.0000 0.0000 1.9379 28 27308 27308 18688 0 2.1680 0.0000 0.0000 1.6816 26 28654 28654 19551 0 2.1478 0.0000 0.0000 1.7867 24 29947 29947 20371 0 2.1273 0.0000 0.0000 1.7336 22 31050 31050 21065 0 2.1097 0.0000 0.0000 1.6968 20 32081 32081 21719 0 2.0960 0.0000 0.0000 1.7347 17.1 33251 33251 22446 0 2.0774 0.0000 0.0000 1.6411 13.2 34447 34447 23238 0 2.0732 0.0000 0.0000 1.9604 10.4 35751 35751 24067 0 2.0598 0.0000 0.0000 1.7453 9.53 36772 36772 24724 0 2.0521 0.0000 0.0000 1.8049 8.65 38023 38023 25539 0 2.0457 0.0000 0.0000 1.8693 7.8 39301 39301 26360 0 2.0369 0.0000 0.0000 1.7965 6.98 40665 40665 27196 0 2.0192 0.0000 0.0000 1.5833 6.2 42394 42394 28243 0 1.9958 0.0000 0.0000 1.5352 5.46 44018 44018 29211 0 1.9728 0.0000 0.0000 1.4756 4.77 45553 45553 30168 0 1.9609 0.0000 0.0000 1.6557 4.13 47020 47020 31081 0 1.9500 0.0000 0.0000 1.6480 3.54 48572 48572 32063 0 1.9422 0.0000 0.0000 1.7228 3.01 50463 50463 33139 0 1.9129 0.0000 0.0000 1.3202

 

Here is an honest write-up about the Ts/Tv metricreferencing this 2002 paper by Ebersberger.  Bottom line: between about 2.1 and 2.8 is OK for Ts/Tv (also called Ti/Tv). You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.

tacc:$SCRATCH/BDIB_Human_tutorial/single-sample-variant-speedup$ grep -c INDEL trios_tutorial.raw.vcf 3432 tacc:$SCRATCH/BDIB_Human_tutorial/single-sample-variant-speedup$ cat trios_tutorial.raw.vcf | awk 'BEGIN {FS=";"} {for (i=1;i<=NF;i++) {if (index($i,"AF1")!=0) {print $i} }}' | \ awk 'BEGIN {FS="="} {print int($2*10)/10}' | sort | uniq -c | sort -n -r | head 36757 1 19404 0.5 1819 0.4 36 0.6 17 0.8 16 0.7 1 0

Keep in mind that variant files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.

Multiple-sample variant calling with samtools

This is all fine, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.

To instruct samtools to call variants across many samples, you must simply give it mapped data with each sample tagged separately. Samtools allows two methods to do this:

  1. By providing separate bam files for each sample, like this:

    samtools multi-sample variants: separate bam files

    samtools mpileup -uf $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa \ $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/BDIB_Human_tutorial/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/BDIB_Human_tutorial/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ | bcftools view -vcg - > trios_tutorial.all.samtools.vcf
  2. By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools @RG tags.

    samtools multi-sample variants: one or more bam files using @RG

    samtools mpileup -uf $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa all.bam | bcftools view -vcg - > trios_tutorial.all.raw.vcf

The output file from the first option is in $BI/ngs_course/human_variation as all.samtools.vcf

CAVEAT: If you intend to use the second option, you must make sure you tag your reads with the right RG tag; this can easily be done during the samse or sampe stage of mapping with bwa with the -r option, using samtools merge, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. Note that our align_bwa.sh script takes care of this detail for us.

Identify the lineage

If genetics works, you should be able to identify the child based strictly on the genotypes.  Can you do it?

You're trying to find the genotypes in the all.samtools.vcf and all.GATK.vcf files, and then use your knowledge of Mendelian inheritance to figure out which of the three samples is the only one that could be a child of the other two. 

cat all.samtools.vcf | head -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/' '/g | awk '{print $2"\t"$5"\t"$8}' | tail -100 | sort | uniq -c
tacc:/scratch/01057/sphsmith/human_variation$ cat all.samtools.vcf | head -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/' '/g | awk '{print $2"\t"$5"\t"$8}' | tail -100 | sort | uniq -c 12 0/0 0/1 0/0 5 0/0 0/1 0/1 3 0/1 0/0 0/0 4 0/1 0/0 0/1 8 0/1 0/0 1/1 43 0/1 0/1 0/0 24 0/1 1/1 0/0 1 1/1 0/1 0/0

Here are the steps going into this command:

1) Dump the contents of all.samtools.vcf

2) Take the first 10,000 lines

3) If the variant quality score is greater than 500, then print fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes).

4) Filter for only lines that have at least one homozygous SNP (exercise to the reader to understand why...)

5) Break the genotype call apart from other information about depth: "sed" turns the colons into spaces so that awk can just print the genotype fields.

6) Take the last 100 lines, sort them, then count the unique lines

 

Here is my interpretation of the data:

1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block.

2) The most telling data: the child will have heterozygous SNPs from two homozygous parents.

3) So all this data is consistent with column 1 (NA12878) being the child:

12 0/0 0/1 0/0
5 0/0 0/1 0/1
4 0/1 0/0 0/1
8 0/1 0/0 1/1
43 0/1 0/1 0/0
24 0/1 1/1 0/0

"Outlier" data are:

3 0/1 0/0 0/0

1 1/1 0/1 0/0

 

This is, in fact, the correct assessment - NA12878 is the child.

GATK

GATK deserves it's own page which is here, but we've already run it and will now look at some differences in the SNP calls.

Look at the differences between single- and multiple-sample SNP calls, and between Samtools/Bcftools SNP calls and GATK SNP calls

To keep things a bit more separated use these commands to orient the relevant data files
cds cd BDIB_Human_tutorial mkdir method_comparison cd method_comparison cp ../raw_files/*.vcf .

 

Try to find your own way of answering each of the following and compare it to the results of these 1 liners.

A note on 1 liners

1 liners represent more advanced work, and it is unlikely that you will come up with the same method of using them without a large amount of time spent trying to do these types of things. They also include several linux commands not covered in this class, but answers may be possible in other ways that do not require these commands. These commands are listed here in a more advanced form to serve as a future reference if you are trying to answer these, or similar questions.

for file in `ls *.vcf`; do echo "File: $file `cat $file | grep -v '^#' | wc -l`"; done
cat NA128*samtools.vcf | grep -v '^#' | awk '{print $2}' | sort | uniq > all.single.samtools.snps cat all.samtools.vcf | grep -v '^#' | awk '{print $2}' | sort | uniq > all.mutiple.samtools.snps comm all.single.samtools.snps all.mutiple.samtools.snps | awk 'BEGIN {print "Only in single\tOnly in multiple\tIn both"; FS="\t"} {i[NF]++} END {print i[1]"\t"i[2]"\t"i[3]}'

 

In theory, GATK does a much better job ruling out false positives. 

grep -v '^#' all.GATK.vcf | awk '{$2=$2"AAAAAAA"; print}' | sort -k 2 > g.v grep -v '^#' all.samtools.vcf | awk '{$2=$2"AAAAAAA"; print}' | sort -k 2 > s.v join -a 1 -1 2 -2 2 g.v s.v | awk '{if (NF==12) {print}}' | grep PASS > g.v.pass.only cat g.v.pass.only | sort -k 6 -n -r | head

 

What's going on here:

1) Yank out all the variant calls (comments start with '#') and add a string of "AAAAAAA" to them to make "sort" do it's job in a way that "join" will later like

2) Join the two files using their chromosome position as the join field, but also include any lines from the GATK file that DON'T match the samtools file. Use "awk" to figure out which ones came only from GATK (they are missing a bunch of fields from the samtools variant calls), look only for those that GATK has labeled "PASS" and write them to a file.

3) Sort the resultant file on the variant quality value - take the top 10 lines.

 

You will note that many of these are complex variants, particularly insertions, so it's not too surprising that GATK does better. But here's a SNP that GATK does much better on:

chr21:34278313

It has an interesting quantitative signature though... you might want to look at it in IGV.

 

Other notes on bcftools

bcftools has many other sub-commands such as performing association tests and estimating allele frequency spectrums.

You can't get gene-context information from bcftools - you have to shift to variant annotation tools to do that.

Side-note on samtools mpileup

If you don't need a variant caller that uses a full Bayesian inference engine with robust filtering, protections, etc. but just want to look for weird base-artifacts within your mapped data, check out samtools mpileup output directly.