Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you see 316 lines of a long scrolling output detailing some copyright information and a bunch of different commands everything is correctly loaded. While individual tools will require different options and the program itself takes many different options only 3 things are ALWAYS required:

flagDescription
-T

Tool name, what tool are you trying to use
-RReference sequence file
-IInput bam file

Stealing a nice mnemonic devices from a GATK toturial (which is condensed below), these 3 arguments don't have to be in this order, but if you learn them in this order, you will be able to remember them if you TRI. Remember, specific tools will require additional arguments.

...

Tutorial: Use GATK to count the number of reads in a bam file

Using the above following information we will use gatk the CountReads tool to count the number of reads in the SRR030257.bam file which was from the NC_012967.fasta reference file. pay Pay attention to the the words in bold and the table/discussion in the previous tutorial section and see if you can figure out how to do this on your own.

Expand
titleNeed a hint?

Don't forget you will still need to start your command with java -jar /opt/apps/gatk/3.5.0/GenomeAnalysisTK.jar to envoke java and gatk before specifying your arguments.

Code Block
titleClick here for the solution
collapsetrue
java -jar /opt/apps/gatk/3.5.0/GenomeAnalysisTK.jar -T CountLoci -R exampleFASTA.fasta -I exampleBAM.bam
Expand
Expected output:
titleCheck your answer

gatk  CountReads -R NC_012967.1.fasta -I SRR030257.bam


Did you get the result you expected?

Expand
titleDid this command do what you expected? click here and continue once you decide if you think it worked as expected

The foloowing line tells us things did not go as intended.

A USER ERROR has occurred: Fasta dict file file:///$SCRATCH/GVA_GATK/NC_012967.1.dict for reference file:///$SCRATCH/GVA_GATK/NC_012967.1.fasta does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it

In following that very helpful link you'll see a discussion about a .dict file that is required for fasta references that might normally be generated by GATK but as we did not do the read mapping inside of GATK we lack it. Based on the reading on the linked web page I came up with the following solution to create that dictionary file.

Code Block
languagebash
titleNote that i handle creating the dictionary file directly through GATK rather than through isolating a specific .jar file
gatk CreateSequenceDictionary -R NC_012967.1.fasta -O NC_012967.1.dict

Now if we retry our CountReads command we get very different output (note that yours will have subtle differences in things like the names of directories):

No Format
Using GATK jar /opt/tacc_mounts/home1/01821/ded/local/bin/gatk-package-4.1.2.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/tacc_mounts/home1/01821/ded/local/bin/gatk-package-4.1.2.0-local.jar CountReads -R NC_012967.1.fasta -I SRR030257.bam
09:27:58.776 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/tacc_mounts/home1/01821/ded/local/bin/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
May 31, 2019 9:27:59 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
09:27:59.973 INFO  CountReads - ------------------------------------------------------------
09:27:59.973 INFO  CountReads - The Genome Analysis Toolkit (GATK) v4.1.2.0
09:27:59.973 INFO  CountReads - For support and documentation go to https://software.broadinstitute.org/gatk/
09:27:59.973 INFO  CountReads - Executing as ded@login1 on Linux v4.4.103-6.38-default amd64
09:27:59.973 INFO  CountReads - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_151-b12
09:27:59.973 INFO  CountReads - Start Date/Time: May 31, 2019 9:27:58 AM CDT
09:27:59.973 INFO  CountReads - ------------------------------------------------------------
09:27:59.973 INFO  CountReads - ------------------------------------------------------------
09:27:59.974 INFO  CountReads - HTSJDK Version: 2.19.0
09:27:59.974 INFO  CountReads - Picard Version: 2.19.0
09:27:59.974 INFO  CountReads - HTSJDK Defaults.COMPRESSION_LEVEL : 2
09:27:59.974 INFO  CountReads - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:27:59.974 INFO  CountReads - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
09:27:59.974 INFO  CountReads - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:27:59.974 INFO  CountReads - Deflater: IntelDeflater
09:27:59.974 INFO  CountReads - Inflater: IntelInflater
09:27:59.974 INFO  CountReads - GCS max retries/reopens: 20
09:27:59.974 INFO  CountReads - Requester pays: disabled
09:27:59.974 INFO  CountReads - Initializing engine
09:28:00.322 INFO  CountReads - Done initializing engine
09:28:00.322 INFO  ProgressMeter - Starting traversal
09:28:00.322 INFO  ProgressMeter -        Current Locus  Elapsed Minutes       Reads Processed     Reads/Minute
09:28:11.059 INFO  CountReads - 7600360 read(s) filtered by: WellformedReadFilter 
09:28:11.061 INFO  ProgressMeter -             unmapped              0.2                     0              0.0
09:28:11.062 INFO  ProgressMeter - Traversal complete. Processed 0 total reads in 0.2 minutes.
09:28:11.062 INFO  CountReads - Shutting down engine
[May 31, 2019 9:28:11 AM CDT] org.broadinstitute.hellbender.tools.CountReads done. Elapsed time: 0.21 minutes.
Runtime.totalMemory()=2847932416
Tool returned:
0


What in all that are we actually looking for you might ask? 

No Format
nopaneltrue
09:28:11.059 INFO  13:05:21,093 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 33CountReads - 7600360 read(s) filtered by: WellformedReadFilter 

09:28:11.062 INFO  ProgressMeter - Traversal complete. Processed 0 total reads (in 0.00%)2 minutes.

This tells us that the bam file contains 33 7600360 total reads and that none were removed by any filtering options. The lack of anything being removed should make sense since we didn't try to filter anything out. Can you figure out why this might be different than the total 9555594 reads that were present in the original fastq files SRR030257_1.fastq and SRR030257_1.fastq (check them in your GVA_bowtie2_mapping directory)?

Expand
titleClick here for an answer

The .bam file (and .sam file only deal with reads that correctly mapped back to to the genome within the bounds of the bowtie2 mapping program. This means the 'missing reads did not map. Perhaps they are adapter dimers or reads that have adapters present on the ends causing mapping problems (could be checked with fastQC fairly simply), Perhaps they are a plasmid not present in the reference genome (could try to assemble this plasmid with SPAdes).


As mentioned this is a very small introduction to GATK adapted from one of the broad's tutorials which can be found here http://gatkforums.broadinstitute.org/gatk/discussion/1209/howto-run-the-gatk-for-the-first-time. Feel free to explore that link and the other tutorial links for taking GATK further.

...