Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. compressed raw data (the .fastq.gz files)
  2. mapped data (the .bam files)
  3. variant calls (the .vcf files)
  4. the subdirectory ref with special references
  5. .bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
  6. We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
  7. We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf. (these files are from old versions)
  8.  The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project. 

Single-sample variant calling with

...

bcftools

We would normally use the BAM file from a previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it). As a bonus tutorial, you could map the data yourself and using what you learned in the bowtie2 tutorial and then use the resultant .bam files.

...

Warning
titleRemember to make sure you are on an idev done

It is unlikely that you are currently on an idev node as copying the files while on an idev node causes problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes.

If you need more information or help re-launching a new idev node, please see this tutorial.

You should request at least 60 minutes on the idev session to make sure the commands have time to finish running.

Recall that we installed samtools and bcftools in our GVA-SNV conda environment. Make sure you have activated your GVA-SNV environment and you have access to samtools and bcftools version 1.15.1

...

Expand
titleWhat to do if you do not get version 1.15.1 for both samtools and bcftools in the above version checks?

If you are not seeing the correct versions, there is either a problem activating or creating your environment. Either try to activate the environment again, go back to the SNV tutorial, or ask for help before continuing.

...

One potential issue with this type of approach is that vcf files only record variation that can be seen with the data provided. When all reads mapping to a given location exactly match the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic. 

Multiple-sample variant calling with

...

bcftools

Not being able to tell between no data and wildtype is not the end of the world for a single sample, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.

...

Based on the discussion above, we are selecting the first solution and providing details of how this command should be run.  As As this command will generate very little output and take ~45 ~30 minutes to complete, you are once again reminded that the output file from this option is available $BI/gva_course/GVA.multi-sample.vcf if you want to work with it without having to wait on the analysis to run personally.

...