Tumor/normal analysis with Virmid (GVA14)

Tumor/normal analyses

Tumor/normal analyses are unique for two key reasons:

  1. "Tumor" is rarely a homogeneous material, both because tumor genomes are inherently unstable and so may be quite diverse even within "tumor" microenvironments and because most pathology samples contain at least some normal tissue with them.  The surgeon's job, after all, is to remove ALL the tumor - the pathologist's needs are secondary to the patient's health!
  2. USUALLY, tumor biologists are interested in mutations unique to the tumor ("somatic mutations") relative to the patient's normal tissue ("germ line").

 

Point #1 above means that our typical highly stringent rules for determining homozygous alternate alleles or heterozygous alleles may go out the window.  If just a few cells in a larger tumor body have a strong driver mutation in only one allele which makes them metastatic, then you might be looking for a few percent of total reads as "signature" of metastasis.  And that may REALLY matter for diagnosis and treatment! 

But as you've already seen, variant calling is abundant with false positives already so if we were to relax stringency even further it could be a nightmare to find mutations relevant to the cancer.

Point #2 is what helps reduce false positives significantly.  Even moreso than trio analysis, a patient's own germ line is the best "reference" (which we assume is "healthy" or at least "not cancer") to subtract variants not related to the tumor.  This may not help completely with samples that have a very low percentage of cells that we care about (e.g. metastatic cells mentioned above), but it's way better than nothing!

Virmid

Virmid is a relatively new tool that SPHS likes a lot.  It was published in August of 2013 and the Sourceforge page for downloads is here.  It is already installed here: $BI/bin/virmid-1.1.0.

It has some very nice features:

  1. It explicitly models the mixture of normal & tumor in the sample you designate as "tumor", trying to maximize sensitivity without compromising specificity too much.
  2. It's very simple to run - not tons of options you have to fiddle through like you would with GATK
  3. It's very smart - not only does it report somatic mutations (those variants unique to tumor), it also looks at the overall quantity of data to estimate any changes in heterozygosity.  This additional information about large scale deletions or duplications in the tumor genome can be valuable for interpretation of the disease state.

The down side is that it does NOT consider any external information like dbSNP, indel databases, etc. and it does not give results in "gene context" so you still have to annotate the results (e.g. Annovar), but that's not too hard now that you know what you're doing!

How to run Virmid

Virmid is yet another Java program - try this to validate the installation:

java -jar $BI/bin/virmid-1.1.0/Virmid.jar

You should get the abbreviated help information back.

 

Try running virmid on the trio data we already have - perhaps try running the child NA12878 as "Disease" and parent NA12891 as "Normal".  It may not make a lot of sense, but should give output you can look at as an example.  Remember that the relevant reference is in the subdirectory ref.

 Need more help? Click here...

If you are inside your copy of the human_variation subdirectory, this should work:

Commands to submit a virmid job on Lonestar
launcher_creator.py -n virmid -b "java -d64 -Xms512m -Xmx4g -jar /work/01057/sphsmith/virmid-1.1.0/Virmid.jar -R ref/hs37d5.fa -N NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam -D NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam" -t 02:00:00 -q normal -a CCBB -m java64
 
(you may have to edit the virmid.sge file to make sure it says, "module load java64" and not just "java64")
 
qsub virmid.sge

 

As an exercise for the reader - create a "mixed tumor sample" by duplicating the raw read data from NA12891 and then adding in a small fraction (maybe 30% to start) of raw read data from NA12878.  Run virmid again with that as the Diseased sample and NA12878 as the Normal sample.