DNAseq Variant Calling Pipeline

DNAseq Variant Calling Pipeline

Identification and annotation of SNPs and/or somatic mutations compared to reference genome. 10 hour minimum ($730 internal, $930 external) per project.

1. Quality Assessment

Quality of data assessed by FastQC and SAMStat; results of quality assessment will be evaluated prior to downstream analysis.

  • Deliverables:

    • reports generated by FastQC and SAMStat

    • metrics specific to hybrid selection analysis calculated using Picard available as well

  • Tools Used:

    • FastQC: (Andrews 2010) used to generate quality summaries of data:

      • Per base sequence quality report: useful for deciding if trimming necessary.

      • Sequence duplication levels: evaluation of library complexity.

      • Overrepresented sequences: evaluation of adapter contamination.

    • SAMStat: (Lassman et. al. 2011) provides summary statistics at both fastq and SAM/BAM alignment levels.

    • Picard CalculateHsMetrics: (http://broadinstitute.github.io/picard) evaluates hybrid selection protocols (target coverage and AT/GC dropout levels).

2. Mapping

Mapping to genome reference using BWA-mem (alternative algorithms available on request).

  • Deliverables:

    • bam files from both the initial alignment (BWA-mem by default, though other algorithms are available if desired)

    • bam files resulting from further processing using GATK

  • Tools Used:

    • BWA-mem: (Li 2013) primary aligner used to generate first pass read alignments (BWA-aln and BWA-sampe also available if desired, as are bowtie/bowtie2).

    • GATK: (McKenna et. al. 2010, Auwera et. al. 2013) IndelRealigner and BaseRecalibrator applied to correct indel-based misalignments and increase accuracy/dispersion of individual base quality scores

3a. Variant Calling Option 1: GATK

Genome Analysis Toolkit (GATK) used to call SNPs and indels according to best practices recommended by Broad institute.

  • Deliverables:

    • individual sample vcf files output by HaplotypeCaller

    • regenotyped and recalibrated merged vcf file output by GenotypeGVCFs

  • Tools Used (GATK):

    • HaplotypeCaller: reassembles "active regions" and applies PairHMM algorithm to select most likely genotype

    • GenotypeGVCFs: jointly re-genotypes, re-annotates and merges individual sample gVCFs from HaplotypeCaller into single aggregated vcf file

    • VariantRecalibrator: recalibrates variant call probabilities based on call annotations

3b. Variant Calling Option 2: Somatic Mutation Identification

MuTect and MutSig from the Broad institute are available for calling somatic mutations; other methods may be available upon request as well.

  • Deliverables:

    • MuTect and MutSig output files.

4. Annotation

Further annotation of variant calls may be provided using ANNOVAR.

  • Deliverables:

    • ANNOVAR output in tabular format (in plain text, csv, or excel format as desired).

  • Tools Used:

    • ANNOVAR: (Wang et. al. 2010) provides functional annotation of genetic variation encompassing multiple modalities (e.g., gene and region annotation and/or filtration based on established data sets).