Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

We'll return to this example data shortly to demonstrate a much more involved tool, GATK, to do the same steps.

Note if you're trying this on Stampede: The $BI directory is not accessible from compute nodes on Stampede so you will need to make a copy of your data on $SCRATCH and update file locations accordingly to get this demo to run.

...

Expand
titleExpand here if you'd like to try this on your own...

Let's use Anna Battenhouse's shell script align_bwa.sh to align the fastq files to the hg19 version of the genome, and the python program launcher_creator.py to create the job submission script.

Don't forget that for this to work, you need to have appended $BI/bin to your path.

Expand
Show me how to modify my path...
Show me how to modify my path...

BI="/corral-repl/utexas/BioITeam"
PATH=$PATH:$BI/bin
export PATH

Move into your scratch directory and then try to figure out how to create and qsub the align_bwa.sh command to align the data file $BI/ngs_course/human_variation/allseqs_R1.fastq against the hg19 reference. Call the output "test".

 

Expand
Solution
Solution
Code Block
titleMake job submission script for mapping & variant calling
echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1 > aln.test.log 2>&1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
module load bwa
module load samtools
qsub map_call.sge

Caution: If you are using this example outside an SSC course, you must use the -a option to specify a valid allocation (e.g. BME_2012)


Expand
Output
Output
Code Block
login1$ echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub map_call.sge" to queue your job.
login1$ qsub map_call.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...

Your job 586249 ("map_call") has been submitted

Note that the input is paired-end data.

Expand
Output
Output

Your directory should have content like this when done (from ls -lt):

Code Block
titleMapping output
-rw-r--r-- 1 sphsmith G-801020      5732 May 20 23:01 map_call.o586338
-rw------- 1 sphsmith G-801020       392 May 20 23:01 test.flagstat.txt
-rw------- 1 sphsmith G-801020   2175952 May 20 23:01 test.sorted.bam.bai
-rw------- 1 sphsmith G-801020 334782188 May 20 23:01 test.sorted.bam
-rw-r--r-- 1 sphsmith G-801020     13135 May 20 23:00 map_call.e586338
-rw------- 1 sphsmith G-801020 411244396 May 20 23:00 test.bam
-rw------- 1 sphsmith G-801020  45695040 May 20 22:49 test.read2.sai
-rw------- 1 sphsmith G-801020  45372400 May 20 22:39 test.read1.sai
-rw-r--r-- 1 sphsmith G-801020         0 May 20 22:26 map_call.pe586338

and samtools flagstat test.sorted.bam should yield:

Code Block
titlesamtools flagstat results
Running flagstat...
4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992274 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
40290 + 0 properly paired (0.89%:nan%)
3636946 + 0 with itself and mate mapped
355328 + 0 singletons (7.82%:nan%)
44128 + 0 with mate mapped to a different chr
15634 + 0 with mate mapped to a different chr (mapQ>=5)

...

You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.

...

Expand
titleOne solution...

This linux one-liner should give you a snapshot of data sufficient to figure it out:

Code Block
cat all.samtools.vcf | head -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/' '/g | awk '{print $2"\t"$5"\t"$8}' | tail -100 | sort | uniq -c
Expand
titleHere's the output and discussion
Code Block
tacc:/scratch/01057/sphsmith/human_variation$ cat all.samtools.vcf | head -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/' '/g | awk '{print $2"\t"$5"\t"$8}' | tail -100 | sort | uniq -c 
     12 0/0	0/1	0/0
      5 0/0	0/1	0/1
      3 0/1	0/0	0/0
      4 0/1	0/0	0/1
      8 0/1	0/0	1/1
     43 0/1	0/1	0/0
     24 0/1	1/1	0/0
      1 1/1	0/1	0/0

 

Here are the steps going into this command:

1) Dump the contents of all.samtools.vcf

2) Take the first 10,000 lines

3) If the variant quality score is greater than 500, then print fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes).

4) Filter for only lines that have at least one homozygous SNP (exercise to the reader to understand why...)

5) Break the genotype call apart from other information about depth: "sed" turns the colons into spaces so that awk can just print the genotype fields.

6) Take the last 100 lines, sort them, then count the unique lines

 

Here is my interpretation of the data:

1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block.

2) The most telling data: the child will have heterozygous SNPs from two homozygous parents.

3) So all this data is consistent with column 1 (NA12878) being the child:

	 12 0/0	0/1	0/0
5 0/0 0/1 0/1
4 0/1 0/0 0/1
8 0/1 0/0 1/1
43 0/1 0/1 0/0
24 0/1 1/1 0/0

"Outlier" data are:

      3 0/1	0/0	0/0
      1 1/1	0/1	0/0
 

This is, in fact, the correct assessment - NA12878 is the child.

 

 

GATK

GATK deserves it's own page which is here, but we've already run it and will now look at some differences in the SNP calls.

...

You can't get gene-context information from bcftools - you have to shift to variant annotation tools to do that.

Side-note on samtools mpileup

...