Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. compressed raw data (the .fastq.gz files)
  2. mapped data (the .bam files)
  3. variant calls (the .vcf files)
  4. the subdirectory ref with special references
  5. .bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
  6. We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
  7. We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf. (these files are from old versions)
  8.  The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project. 

...

Warning
titleRemember to make sure you are on an idev done

It is unlikely that you are currently on an idev node as copying the files while on an idev node causes problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes.

If you need more information or help re-launching a new idev node, please see this tutorial.

DO NOT run these commands on the head node

Recall that we installed samtools on our GVA2021 conda installation. Make sure you have activated your GVA2021 environment and you have access to samtools and bcftools version 1.9.

...

  1. By providing separate bam files for each sample. Note that in the following code block, he trailing \ symbols tells the command line that you are not done entering the command and not to execute any commands until you hit return without a preceding \ mark.

    Note
    titleWarning, DO NOT run this command yet.
    Code Block
    languagebash
    titlesamtools multi-sample variants: separate bam files
    #samtools mpileup -uf $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa \
      $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      $SCRATCH/GVA_Human_trios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      $SCRATCH/GVA_Human_trios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
        | bcftools call -v -c - > multi_sample/trios_tutorial.all.samtools.vcf
    


    The output file from this option is in 
    $BI/ngsgva_course/human_variation as GVA2021.all.samtools.vcf.samtools.vcf if you want to work with it without having to wait on the analysis to run personally.


  2. By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools @RG tags.

    Note
    titleDo not run this command

    This command comes with a sizable caveat: If you intend to use this option, you must make sure you tag your reads with the right RG tag; this can easily be done during the samse or sampe stage of mapping with bwa with the -r option, using samtools merge, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. The problem is that it requires prior knowledge of all the samples you are going to group together ahead of time, and individual sample input files have to be merged together. There are obviously times that this will be the desired and easier, but for the purpose of this course, we believe it makes more sense to keep all input files separate and only have a single merged output file. As part of the tutorial we do provide the appropriate files necessary for the following command to run.

    Code Block
    titlesamtools multi-sample variants: one or more bam files using @RG
    samtools mpileup -u -f $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa all.bam | bcftools call -v -c - > trios_tutorial.all.raw.vcf
    
    Tip
    titleAn observation for the individuals interested in phylogenies

    I have a hunch that the use of read tags could be useful for your analysis IF you have a finite set of samples that you intend to analyze, have all the data available at once, and are willing to repeat your analysis from near scratch in the event of either of these factors changing. This is an excellent example of the importance of finding papers that have done similar analyses, and mimicking as much of what they have done as possible/reasonable.

Based on the discussion above, we are selecting the first solution and providing details of how this command should be run. As this command will generate very little output and take ~45 minutes to complete, you are once again reminded that the output file from this option is in available $BI/ngsgva_course/human_variation as GVA2021.all.samtools.vcf vcf if you want to work with it without having to wait on the analysis to run personally.

...

Expand
titleHint...

You're trying to find the genotypes in the trios_tutorial.all.samtools.vcf file, and then use your knowledge of Mendelian inheritance to figure out which of the three samples is the only one that could be a child of the other two. 

Tip
titleIf you are working with provided data rather than generating the data yourself

Notice that if you are working with data provided rather than data generated you may have 2 different file names "trios_tutorial.all.samtools.vcf" if you analyzed the data yourself and "GVA2021.all.samtools.vcf" if you copied the data from the BioITeam directory.

In the next code block you will either need to change the name of the file that you print to the screen with the cat command, or you need to change the name of the file using the move command.

Code Block
languagebash
titleThis linux one-liner should give you a snapshot of data sufficient to figure it out:
collapsetrue
cat trios_tutorial.all.samtools.vcf | tail -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/'\t'/g | awk '{print $2"\t"$4"\t"$6}' | tail -100 | sort | uniq -c | sort -n -r

...