Filtering and Screening Variants (GVA14)

Filtering

On any variant caller you use, there will be an inherent trade-off between sensitivity and specificity. Typically, you would carry forward as much data as practical at each processing step, deferring final judgement until later so you have more information. For example, you might not want to exclude low coverage regions in one sample from your raw data because you may be able to infer a genotype based on family information later.

This typically leads to NGS pipelines that maximize sensitivity, leading to a substantial number of false positives. Filtering after variant calling can be useful to eliminate false positives based on all the data available after numerous analyses.

In the samtools/bcftools world, the vcfutils.pl script provides a means to filter SNPs on many criteria.

Exercises

Explore some filter settings for vcfutils.pl varFilter to see how many SNPs get filtered out, using the linux tool xargs to do a parameter sweep.

Filtering, counting, and parameter iteration
# First - create a tiny shell script to run vcfutils and take a single parameter:
echo "#\!/bin/bash" > tiny.sh
echo "echo \"Sweeping with vcfutils.pl, min read depth of: \$1\" " >> tiny.sh
echo "vcfutils.pl varFilter -Q 20 -d \$1 test.raw.vcf | grep -v '^#' | wc -l " >> tiny.sh

# Now make it executable
chmod +x tiny.sh

# Use xargs to do a lovely sweep
echo 2 3 4 5 6 7 | xargs -n 1 tiny.sh
 Output
Sweeping with vcfutils.pl, min read depth of: 2
41443
Sweeping with vcfutils.pl, min read depth of: 3
32652
Sweeping with vcfutils.pl, min read depth of: 4
22861
Sweeping with vcfutils.pl, min read depth of: 5
14605
Sweeping with vcfutils.pl, min read depth of: 6
9112
Sweeping with vcfutils.pl, min read depth of: 7
5809
Exercise

Try modifying tiny.sh on your own to sweep through other filter parameters.

Exercise

Use bedtools and some linux built-in commands to compare the single-sample vcf file(s) to the multiple-sample vcf file (you might need module load bedtools unless you've installed that in your .profile). Do the following:

  1. Count the total number of variants called in test.raw.vcf and all.samtools.vcf
  2. Count how many of the SNPS in these two files are common
  3. Calculate the average and maximum quality values for the SNPs that are in test.raw.vcf but NOT in all.samtools.vcf
  4. Calculate the average and maximum quality values for the SNPs that are in test.raw.vcf
 Hints
  1. Investigate grep -c
  2. Try intersectBed and wc
  3. Try subtractBed and some awk
    Note that for intersections and subtractions on structured data like this, you can use the linux join command too.
 Solution
Comparison of single- and multiple-sample vcf files using linux and bedtools
# This command just counts the # of called variants in test.raw.vcf (from individual NA12878)
grep -c -v '^#' test.raw.vcf

# This command just counts the # of called variants in all 3 individuals
grep -c -v '^#' $BI/ngs_course/human_variation/all.samtools.vcf

# Found out how many are common between the two
intersectBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | wc -l

# Take all those that are not in all.samtools.vcf and examine their quality (in field 6 of the vcf file)
subtractBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | \
   awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'

# Look at all the qualities from the NA12878 variants
grep -v '^#' test.raw.vcf | awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
 Output
Output of comparison of single- and multiple-sample vcf files
login2$ # This command just counts the # of called variants in test.raw.vcf (from individual NA12878)
login2$ grep -c -v '^#' test.raw.vcf
58049
login2$
login2$ # This command just counts the # of called variants in all 3 individuals
login2$ grep -c -v '^#' $BI/ngs_course/human_variation/all.samtools.vcf
80972
login2$
login2$ # Found out how many are common between the two
login2$ intersectBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | wc -l
52039
login2$
login2$ # Take all those that are not in all.samtools.vcf and examine their quality (in field 6 of the vcf file)
login2$ subtractBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | \
>    awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
Average qual: 8.70595	Max qual: 212
login2$
login2$ # Look at all the qualities from the NA12878 variants
login2$ grep -v '^#' test.raw.vcf | awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
Average qual: 43.97	Max qual: 225

This data shows that most of the variants that were called on sample NA12878 but are not in the multi-sample variant calls have lower quality.

For major bonus points and a great THANK YOU from Scott, compute the mean and standard deviation of the intersected and subtracted SNPs from NA12878 vs all and then perform a t-test to make sure the differences are statistically significant using only linux command line tools (probably in a shell script). Yes, it's probably easier in Python, Perl, or R.

 

Other linux utilities useful for making subsets of VCF files and comparing them

 

Make files containing all the het & hom alt alleles from a vcf, and simplify it somewhat:
cat NA12878.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
  | sort -n | grep "0/1" > NA12878.raw.vcf.simple.het
cat NA12878.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
  | sort -n | grep "1/1" > NA12878.raw.vcf.simple.hom

 

 

 

Make a file containing all the het & hom alt alleles from a vcf, with the same simplification we used above:
cat NA12891.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
  | sort -n | grep "0/1" | sort > NA12891.raw.vcf.simple.het
cat NA12892.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
  | sort -n | grep "0/1" | sort > NA12892.raw.vcf.simple.het
cat NA12891.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
  | sort -n | grep "1/1" | sort > NA12891.raw.vcf.simple.hom
cat NA12892.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \
  | sort -n | grep "1/1" | sort > NA12892.raw.vcf.simple.hom

 

  1. Now count how many GT are het in both of the second two (parents) but hom in the first (child):

    join NA12892.raw.vcf.simple.het NA12891.raw.vcf.simple.het > both.het
    join both.het NA12878.raw.vcf.simple.hom | wc -l
    

    (would you have expected this result?)

 

  1. Now find which GT are hom in both of the second two (parents) but het in the first (child):

    join NA12892.raw.vcf.simple.hom NA12891.raw.vcf.simple.hom > both.hom
    join both.hom NA12878.raw.vcf.simple.het | wc -l