Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We'll return to this example data later to demonstrate a much more involved tool, GATK, to do the same steps in another tutorial.

...

Expand
titleExpand here if you'd like to try this on your own...

Let's use Anna Battenhouse's shell script align_bwa.sh to align the paired end fastq file to the hg19 version of the genome calling the output trios_mapping. The input fastq file can be found at $BI/ngs_course/human_variation/allseqs_R1.fastq or if you have already copied today's materials to your scratch directory at $SCRATCH/BDIB_Human_tutorial/raw_files. Throughout the course we have been telling you to copy things to a local location before working with them, and you can continue to do so, but it is important to remember this is not necessarily a requirement.

If you use launcher_creator.py to create the job submission script. Be sure to append &> "meaningful file name" to the end of the script to capture the standard output and standard

 

Expand
titleClick here for a hint

Type align_bwa.sh without any options to gain insight into the programs expectations. As you might have put together, the "BWA" part of the script name represents the BWA mapping tool, but what isn't obvious is that it also uses SAMtools. Make sure that these things are available before starting the commands

Code Block
languagebash
titleIf on an iDev node:
collapsetrue
module load bwa
module load samtools
cds
cd BDIB_Human_tutorial
align_bwa.sh raw_files/allseqs_R1.fastq trios_mapping hg19 1
Expand
titleIf you got an error message check here for a possible reason why
Code Block
titleIf your error message was this:
Fastq read1 File 'raw_files/allseqs_R1.fastq' not found...exiting
Code Block
languagebash
titleDid you remember to do this after the copy command finished?
cd $SCRATCH/BDIB_Human_tutorial/rawfiles
gunzip *.gz  # this will unzip all the compressed files you have just copied
Code Block
languagebash
titleMake job submission script for mapping & variant calling
collapsetrue
echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq trios_mapping hg19 &> aln.trios_mapping.log" > commands  # this is an alternative way to generate a commands file than nano, use nano if more comfortable, or try this new option
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands -A UT-2015-05-18
module load bwa
module load samtools
qsub map_call.sge

 


Code Block
title"ls -lt" output ... (a way of checking if things completed correctly
collapsetrue
-rw-rw-r-- 1 ded G-802740       392 May 28 03:13 trios_mapping.flagstat.txt
-rw-rw-r-- 1 ded G-802740   2149304 May 28 03:13 trios_mapping.sorted.bam.bai
-rw-rw-r-- 1 ded G-802740 334770401 May 28 03:13 trios_mapping.sorted.bam
-rw-rw-r-- 1 ded G-802740 411155732 May 28 03:12 trios_mapping.bam
-rw-rw-r-- 1 ded G-802740  62257040 May 28 03:01 trios_mapping.read2.sai
-rw-rw-r-- 1 ded G-802740  61884104 May 28 02:51 trios_mapping.read1.sai
Code Block
title"samtools flagstat trios_mapping.sorted.bam" expected results
collapsetrue
4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992266 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
40234 + 0 properly paired (0.88%:nan%)
3636928 + 0 with itself and mate mapped
355338 + 0 singletons (7.82%:nan%)
43366 + 0 with mate mapped to a different chr
15352 + 0 with mate mapped to a different chr (mapQ>=5)

...

We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it). 

$SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

...

Here is an honest write-up about the Ts/Tv metricreferencing this 2002 paper by Ebersberger.  Bottom line: between about 2.1 and 2.8 is OK for Ts/Tv (also called Ti/Tv). You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.

...

Expand
titleExplanation of command, and discussion of the output

Here are the steps going into this command:

1) Dump the contents of all.samtools.vcf

2) Take the first 10,000 lines

3) If the variant quality score is greater than 500, then print fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes).

4) Filter for only lines that have at least one homozygous SNP (exercise to the reader to understand why...)

5) Break the genotype call apart from other information about depth: "sed" turns the colons into spaces so that awk can just print the genotype fields.

6) Take the last 100 lines, sort them, then count the unique lines

 

Here is my interpretation of the data:

1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block.

2) The most telling data: the child will have heterozygous SNPs from two homozygous parents.

3) So all this data is consistent with column 1 (NA12878) being the child:

	 12 0/0	0/1	0/0
5 0/0 0/1 0/1
4 0/1 0/0 0/1
8 0/1 0/0 1/1
43 0/1 0/1 0/0
24 0/1 1/1 0/0

"Outlier" data are:

      3 0/1	0/0	0/0
      1 1/1	0/1	0/0
 

This is, in fact, the correct assessment - NA12878 is the child.

GATK

GATK deserves it's own page which is here, but we've already run it and will now look at some differences in the SNP calls.

...

You can't get gene-context information from bcftools - you have to shift to variant annotation tools to do that.

Side-note on samtools mpileup

...