...
We'll return to this example data later to demonstrate a much more involved tool, GATK, to do the same steps in another tutorial.
...
Expand |
---|
title | Expand here if you'd like to try this on your own... |
---|
|
Let's use Anna Battenhouse's shell script align_bwa.sh to align the paired end fastq file to the hg19 version of the genome calling the output trios_mapping. The input fastq file can be found at $BI/ngs_course/human_variation/allseqs_R1.fastq or if you have already copied today's materials to your scratch directory at $SCRATCH/BDIB_Human_tutorial/raw_files. Throughout the course we have been telling you to copy things to a local location before working with them, and you can continue to do so, but it is important to remember this is not necessarily a requirement. If you use launcher_creator.py to create the job submission script. Be sure to append &> "meaningful file name" to the end of the script to capture the standard output and standard Expand |
---|
title | Click here for a hint |
---|
| Type align_bwa.sh without any options to gain insight into the programs expectations. As you might have put together, the "BWA" part of the script name represents the BWA mapping tool, but what isn't obvious is that it also uses SAMtools. Make sure that these things are available before starting the commands Code Block |
---|
language | bash |
---|
title | If on an iDev node: |
---|
collapse | true |
---|
| module load bwa
module load samtools
cds
cd BDIB_Human_tutorial
align_bwa.sh raw_files/allseqs_R1.fastq trios_mapping hg19 1
|
Expand |
---|
title | If you got an error message check here for a possible reason why |
---|
| Code Block |
---|
title | If your error message was this: |
---|
| Fastq read1 File 'raw_files/allseqs_R1.fastq' not found...exiting |
Code Block |
---|
language | bash |
---|
title | Did you remember to do this after the copy command finished? |
---|
| cd $SCRATCH/BDIB_Human_tutorial/rawfiles
gunzip *.gz # this will unzip all the compressed files you have just copied |
|
Code Block |
---|
language | bash |
---|
title | Make job submission script for mapping & variant calling |
---|
collapse | true |
---|
| echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq trios_mapping hg19 &> aln.trios_mapping.log" > commands # this is an alternative way to generate a commands file than nano, use nano if more comfortable, or try this new option
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands -A UT-2015-05-18
module load bwa
module load samtools
qsub map_call.sge
|
|
Code Block |
---|
title | "ls -lt" output ... (a way of checking if things completed correctly |
---|
collapse | true |
---|
| -rw-rw-r-- 1 ded G-802740 392 May 28 03:13 trios_mapping.flagstat.txt
-rw-rw-r-- 1 ded G-802740 2149304 May 28 03:13 trios_mapping.sorted.bam.bai
-rw-rw-r-- 1 ded G-802740 334770401 May 28 03:13 trios_mapping.sorted.bam
-rw-rw-r-- 1 ded G-802740 411155732 May 28 03:12 trios_mapping.bam
-rw-rw-r-- 1 ded G-802740 62257040 May 28 03:01 trios_mapping.read2.sai
-rw-rw-r-- 1 ded G-802740 61884104 May 28 02:51 trios_mapping.read1.sai |
Code Block |
---|
title | "samtools flagstat trios_mapping.sorted.bam" expected results |
---|
collapse | true |
---|
| 4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992266 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
40234 + 0 properly paired (0.88%:nan%)
3636928 + 0 with itself and mate mapped
355338 + 0 singletons (7.82%:nan%)
43366 + 0 with mate mapped to a different chr
15352 + 0 with mate mapped to a different chr (mapQ>=5) |
|
...
We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq
file above was derived, leading to some oddities in it).
$SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
...
Here is an honest write-up about the Ts/Tv metric, referencing this 2002 paper by Ebersberger. Bottom line: between about 2.1 and 2.8 is OK for Ts/Tv (also called Ti/Tv). You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.
...
Expand |
---|
title | Explanation of command, and discussion of the output |
---|
|
Here are the steps going into this command: 1) Dump the contents of all.samtools.vcf 2) Take the first 10,000 lines 3) If the variant quality score is greater than 500, then print fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes). 4) Filter for only lines that have at least one homozygous SNP (exercise to the reader to understand why...) 5) Break the genotype call apart from other information about depth: "sed" turns the colons into spaces so that awk can just print the genotype fields. 6) Take the last 100 lines, sort them, then count the unique lines Here is my interpretation of the data: 1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block. 2) The most telling data: the child will have heterozygous SNPs from two homozygous parents. 3) So all this data is consistent with column 1 (NA12878) being the child: 12 0/0 0/1 0/0 5 0/0 0/1 0/1 4 0/1 0/0 0/1 8 0/1 0/0 1/1 43 0/1 0/1 0/0 24 0/1 1/1 0/0 "Outlier" data are: 3 0/1 0/0 0/0 1 1/1 0/1 0/0 This is, in fact, the correct assessment - NA12878 is the child. |
GATK
GATK deserves it's own page which is here, but we've already run it and will now look at some differences in the SNP calls.
...
You can't get gene-context information from bcftools - you have to shift to variant annotation tools to do that.
Side-note on samtools mpileup
...