Page Comparison

...

We'll return to this example data later to demonstrate a much more involved tool, GATK, to do the same steps in another tutorial.

...

Expand

title	Expand here if you'd like to try this on your own...

Let's use Anna Battenhouse's shell script align_bwa.sh to align the paired end fastq file to the hg19 version of the genome calling the output trios_mapping. The input fastq file can be found at $BI/ngs_course/human_variation/allseqs_R1.fastq or if you have already copied today's materials to your scratch directory at $SCRATCH/BDIB_Human_tutorial/raw_files. Throughout the course we have been telling you to copy things to a local location before working with them, and you can continue to do so, but it is important to remember this is not necessarily a requirement.

If you use launcher_creator.py to create the job submission script. Be sure to append &> "meaningful file name" to the end of the script to capture the standard output and standard

Expand

title	Click here for a hint

Type align_bwa.sh without any options to gain insight into the programs expectations. As you might have put together, the "BWA" part of the script name represents the BWA mapping tool, but what isn't obvious is that it also uses SAMtools. Make sure that these things are available before starting the commands

Code Block

language	bash
title	If on an iDev node:
collapse	true

module load bwa
module load samtools
cds
cd BDIB_Human_tutorial
align_bwa.sh raw_files/allseqs_R1.fastq trios_mapping hg19 1

Expand

title	If you got an error message check here for a possible reason why

Code Block

title	If your error message was this:

Fastq read1 File 'raw_files/allseqs_R1.fastq' not found...exiting

Code Block

language	bash
title	Did you remember to do this after the copy command finished?

cd $SCRATCH/BDIB_Human_tutorial/rawfiles
gunzip *.gz  # this will unzip all the compressed files you have just copied

Code Block

language	bash
title	Make job submission script for mapping & variant calling
collapse	true

echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq trios_mapping hg19 &> aln.trios_mapping.log" > commands  # this is an alternative way to generate a commands file than nano, use nano if more comfortable, or try this new option
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands -A UT-2015-05-18
module load bwa
module load samtools
qsub map_call.sge

Code Block

title	"ls -lt" output ... (a way of checking if things completed correctly
collapse	true

-rw-rw-r-- 1 ded G-802740       392 May 28 03:13 trios_mapping.flagstat.txt
-rw-rw-r-- 1 ded G-802740   2149304 May 28 03:13 trios_mapping.sorted.bam.bai
-rw-rw-r-- 1 ded G-802740 334770401 May 28 03:13 trios_mapping.sorted.bam
-rw-rw-r-- 1 ded G-802740 411155732 May 28 03:12 trios_mapping.bam
-rw-rw-r-- 1 ded G-802740  62257040 May 28 03:01 trios_mapping.read2.sai
-rw-rw-r-- 1 ded G-802740  61884104 May 28 02:51 trios_mapping.read1.sai

Code Block

title	"samtools flagstat trios_mapping.sorted.bam" expected results
collapse	true

4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992266 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
40234 + 0 properly paired (0.88%:nan%)
3636928 + 0 with itself and mate mapped
355338 + 0 singletons (7.82%:nan%)
43366 + 0 with mate mapped to a different chr
15352 + 0 with mate mapped to a different chr (mapQ>=5)

...

We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it).

$SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

...

Here is an honest write-up about the Ts/Tv metric, referencing this 2002 paper by Ebersberger. Bottom line: between about 2.1 and 2.8 is OK for Ts/Tv (also called Ti/Tv). You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.

...

Expand

title	Explanation of command, and discussion of the output

Here are the steps going into this command:

1) Dump the contents of all.samtools.vcf

2) Take the first 10,000 lines

3) If the variant quality score is greater than 500, then print fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes).

4) Filter for only lines that have at least one homozygous SNP (exercise to the reader to understand why...)

5) Break the genotype call apart from other information about depth: "sed" turns the colons into spaces so that awk can just print the genotype fields.

6) Take the last 100 lines, sort them, then count the unique lines

Here is my interpretation of the data:

1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block.

2) The most telling data: the child will have heterozygous SNPs from two homozygous parents.

3) So all this data is consistent with column 1 (NA12878) being the child:

	 12 0/0	0/1	0/0
      5 0/0	0/1	0/1
      4 0/1	0/0	0/1
      8 0/1	0/0	1/1
     43 0/1	0/1	0/0
     24 0/1	1/1	0/0

"Outlier" data are:

      3 0/1	0/0	0/0

      1 1/1	0/1	0/0

This is, in fact, the correct assessment - NA12878 is the child.

GATK

GATK deserves it's own page which is here, but we've already run it and will now look at some differences in the SNP calls.

...

You can't get gene-context information from bcftools - you have to shift to variant annotation tools to do that.

Side-note on `samtools mpileup`

...

Versions Compared

Old Version 4

New Version Current

Key

GATK

Side-note on `samtools mpileup`

Page Comparison

Versions Compared

Old Version 4

New Version Current

Key

GATK

Side-note on samtools mpileup

Side-note on `samtools mpileup`