Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Overview

As mentioned in the introduction tutorial as well as the read processing tutorial, read processing can make a huge impact on downstream work. While cutadapt which was introduced in the read processing tutorial is great for quick evaluation or dealing with a single bad sample, it is not as robust as some other trimmers in particular when it comes to removing sequence that you know shouldn't be present but may exist in odd orientations (such as adapter sequences from the library preparation).

Note
titleA note on the adapter file used here

The adapter file listed here is likely the correct one to use for standard library preps that have been generated in the last few years, but may not be appropriate for all library preps (such as single end sequencing adapters, or nextera based preps, and certainly not appropriate for PacBio generated data). look Look to both the trimmomatic documentation and your experimental procedures at the bench to figure out if the adapter file is sufficient or if you need to create your own.


Learning objectives:

  1. Install trimmomatic

  2. Set up a small script to work around the annoying java invocation

  3. Remove adapter sequences from some plasmids and evaluate effect on read quality, or assembly.

...

Trimmomatic's home page can be found at this link which includes links to the paper discussing the program, and a user manual. Trimmomatic is far above average for as far as programs go, most will not have a user manual, may not have been updated since originally published, etc. This is what one thing that makes it such a good tool.

Checking for installation

Code Block
languagebash
titleTo verify you have placed the .jar file in a location that is already in your path try the following java invocation to pull up the help information.
java -jar $HOME/local/bin/trimmomatic-0.39.jar

If the above command works, jump down to the section on making a bash script. Otherwise continue with the next section to install the program

Installing using wget

In a new web browser window/tab, navigate to the trimmomatic home page. In the Downloading Trimmomatic section; right click on the 'binary' link for version 0.39 and copy that link address.

Info
titleWhich to choose binary files or uncompiled source code
The binary files will be what you want 100 out of 100 times, likely until you begin working with a specific program that you identify bugs in, submit them to the developers, they actually respond (most programs are not in active development), they try to address them, and begin asking you to try using the compiled version to check different scenarios. 

Use the wget command to download the link you just copied to a new folder named src in your $WORK directory.

Code Block
languagebash
titleUsing the mkdir command to create a folder named 'src' inside of your $WORK directory
collapsetrue
mkdir $WORK/src
cd $WORK/src

If you already have a src directory, you'll get a very benign error message stating that the folder already exists and thus can not be created. 

Code Block
languagebash
titleThe wget command is very simple. It has 2 parts: 1. the command 'wget', and 2. the location of the file you want to download.
collapsetrue
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip

You should see a download bar showing you the file has begun downloading, when complete the ls command will show you a new compressed file named "Trimmomatic-0.39.zip". Next we need to uncompress this file, and copy the executable file to a location already in our $PATH variable.

Code Block
languagebash
unzip Trimmomatic-0.39.zip
cp Trimmomatic-0.39/trimmomatic-0.39.jar $HOME/local/bin

If you don't see the zip file or are unable to cd into the 0.39 directory after unzipping it let the instructor know.

Code Block
languagebash
titleTo verify you have placed the .jar file in a location that is already in your path try the following java invocation to pull up the help information.
java -jar $HOME/local/bin/trimmomatic-0.39.jar

When you compare how wordy and complicated that is to the other programs you have encountered in the course, it makes sense that we would want a simpler way of accessing the program which is exactly what we will do next.

...

The goal here will be to create an executable file with 2 lines. The first line will specify that it is a bash script. The second line will be the command we want to run, followed by information needed by bash to know to put the rest of the command you type after the executable file name.

  1. Launch nano with the command: nano $HOME/local/bin/trimmomatic
  2. In the nano window enter the following 2 lines exactly as they are typed

    No Format
    #!/bin/bash
    java -jar $HOME/local/bin/trimmomatic-0.39.jar $@
  3. write and exit nano with control+o and control+x
  4. make the trimmomatic file you just created executable 

    Code Block
    languagebash
    titlemaking file executable
    chmod +x $HOME/local/bin/trimmomatic

Verify that you have successfully made your bash script by typing 'trimmomatic' (try using the tab key in the middle as further sign things are working) and then press enter. You should see the same help that you saw above.

Trimming adapter sequences

Example generic command

Code Block
languagebash
titleExample command for trimming illumina paired end adapters
trimmomatic PE <READ1> <READ2> -baseout Trim_Reads/<basename> ILLUMINACLIP:<FASTAadapterFILE>:4:30:10 MINLEN:30

Breaking down the parts of the above command:

...

ILLUMINACLIP

...

<FASTAadapterFILE>

...

:4:30:10

...

 MINLEN:30

...

All of the above has been put together using the trimmomatic manual and experience in our lab given the adapters we use and the library kits we prepare the samples with.

What does this look like in practice you may ask? Read on for some examples of trimming a single sample, or trimming a large number of samples.

Trimming a single sample

Get some data

Code Block
languagebash
titleset up directories and copy files
mkdir -p $SCRATCH/GVA_trimmomatic_1sample/Trim_Reads
cd $SCRATCH/GVA_trimmomatic_1sample
cp $BI/gva_course/plasmid_qc/E1-7* .

cp $WORK/src/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa .

The ls command should show you 2 gzipped fastq files and a fasta file

Trim the fastq files

The following command can be run on the head node. Like with FastQC if we are dealing with less than say 1-2Million reads, it is reasonable to run the command on the head node unless we have 100s of samples in which case submitting to the queue will be faster as the files can be trimmed all at once rather than 1 at a time. Use what you have learned in the class to determine if you think this command should be run on the head node. (this was covered in more detail in the first part of the evaluating and processing read quality tutorial.)

Code Block
languagebash
titleFiguring out how many reads are in each file
collapsetrue
zgrep -c "^+$" *.fastq.gz

...

languagebash
titleExample command for trimming illumina paired end adapters

...

Another sign of its popularity is that while the home page lists information for installing it that is not conda based, searching for the tool on anaconda still reveals this page https://anaconda.org/bioconda/trimmomatic which is a conda installation for the tool. Anytime you have other people in the community working with a tool and sharing resources like this it is a good sign.

Expand
titleSince at this point we have several different conda environments going, take a moment to think about what environment a read processing tool would work well in.

There actually are not a lot of "wrong" answers here at least from the theoretical side. As read processing takes place upstream of other analysis steps it makes sense to put it in almost any environment. The 1 notable exception might be the SVdetect environment that we specifically only installed the SVDetect tool as it wouldn't make a lot of sense to activate the SVDetect environment to process the reads, activate a different environment to map the reads, and reactivate the SVDetect environment to use that tool

Practically, as we have seen in other tutorials, it may cause conflicts with other programs on a given environment making it better or worse to install in different locations. My first thought of where to try to install it was on the multiqc environment, and it appears to work there without concern (single message about a single dependent package being superseded by higher priority channel). If you have not done that tutorial, will also install into the GVA2021 environment with no messages about potential issues with other packages. If you want to try installing it in another environment feel free to try, if you are unsure if it will create problems based on the output, ask.


As noted in our IVG tutorial, when we dealt with the readseq program, trimmomatic is also a java based program, and like the conda installation of readseq, the conda installation of trimmomatic includes a bash wrapper script around the java invocation. For those interested in how such wrapper scripts are made, last year's tirmmomatic tutorial actually built a trimmomatic wrapper to avoid the java envokation. Using "-version" verify that you have version 0.39 of trimmomatic installed.

Info

Notice that this time pulling up the version information requires a single - rather than double. When looking for things I expect or hope a program to have I always start from the assumption that the word will have a double dash, while a single letter will have a single dash. This is only mentioned now when it is not true to illustrate the point that it is not a perfect rule.

Examples of things I try without looking:

  1. --version
  2. --help
  3. -version
  4. -help


Trimming adapter sequences

Example generic command

Code Block
languagebash
titleExample command for trimming illumina paired end adapters
trimmomatic PE <READ1> <READ2> -baseout Trim_Reads/<basename> ILLUMINACLIP:<FASTAadapterFILE>:4:30:10 MINLEN:30

Breaking down the parts of the above command:

PartPurposereplace with/note
trimmomatictell the computer you are using the trimmmoatic wrapper 
PEtell trimmomatic program you are using the paired end mode
<READ1>fastq file read1 you are trying to trimactual name of fastq file
<READ2>fastq file read2 you are trying to trimactual name of paired fastq file
-baseoutadd a prefix to the resulting trimmed files
Trimreads/<basename>put all trimmed reads in a new folder. This is very good practicetypically you will replace with the first part of the fastq file name (before the Snumber or Lnumber depending on the file usually)
ILLUMINACLIP
tell trimmomatic program how you want to trim the adapters
<FASTAadapterFILE>
name of file you containing your adaptersgood practice to copy the fasta file you want to use to the current directory
:4:30:10
allow 4 mismatches
require 30 bp of overlap between R1 and R2 to identify the fragment as being less than the read length
Require 10bp of sequence to match before removing anything

 MINLEN:30
discard any reads that are less than 30bp after trimming

All of the above has been put together using the trimmomatic manual and experience in our lab given the adapters we use and the library kits we prepare the samples with.

What does this look like in practice you may ask? Read on for some examples of trimming a single sample, or trimming a large number of samples.

Trimming a single sample

Get some data

Code Block
languagebash
titleset up directories and copy files
mkdir -p $SCRATCH/GVA_trimmomatic_1sample/Trim_Reads
cd $SCRATCH/GVA_trimmomatic_1sample
cp $BI/gva_course/plasmid_qc/E1-7* .
cp $HOME/miniconda3/envs/*/share/trimmomatic/adapters/TruSeq3-PE-2.fa .

The ls command should show you 2 gzipped fastq files and a fasta file. You may notice that here that we used a wildcard in the middle of our copy path. This is done so that regardless of what environment you installed trimmomatic into, it should still be able to find the correct adapter file. The downside is that if you have installed it in more than one environment, you may get messages like this:

No Format
cp: will not overwrite just-created ‘./TruSeq2-PE.fa’ with ‘/home1/0004/train402/miniconda3/envs/GVA-breseq/share/trimmomatic/adapters/TruSeq2-PE.fa’
cp: will not overwrite just-created ‘./TruSeq2-PE.fa’ with ‘/home1/0004/train402/miniconda3/envs/GVA-multiqc/share/trimmomatic/adapters/TruSeq2-PE.fa’

This is actually not concerning as all 3 of these files are from the same source, are identical not just in name, but also in content, and we don't care which of the files we use.

Trim the fastq files

The following command can be run on the head node. Like with FastQC if we are dealing with less than say 1-2Million reads, it is reasonable to run the command on the head node unless we have 100s of samples in which case submitting to the queue will be faster as the files can be trimmed all at once rather than 1 at a time. Use what you have learned in the class to determine if you think this command should be run on the head node. (this was covered in more detail in the first part of the evaluating and processing read quality tutorial.)

Code Block
languagebash
titleFiguring out how many reads are in each file
collapsetrue
zgrep -c "^+$" *.fastq.gz
Code Block
languagebash
titleExample command for trimming illumina paired end adapters
trimmomatic PE E1-7_S187_L001_R1_001.fastq.gz E1-7_S187_L001_R2_001.fastq.gz -baseout Trim_Reads/E1-7.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30

Evaluating the output

The last 2 lines of text you get should read:

No Format
Input Read Pairs: 6891 Both Surviving: 2391 (34.70%) Forward Only Surviving: 1844 (26.76%) Reverse Only Surviving: 2644 (38.37%) Dropped: 12 (0.17%)
TrimmomaticPE: Completed successfully

2nd to last line tells us:

  1. we had 6891 total reads
  2. 34.7% of reads both R1 and R2 were long enough to be kept after trimming
  3. 26.76% of reads and 38.37% of reads only 1 of the reads were long enough and/or not a complete duplicate of the other read
  4. only 0.17% of reads were discarded  for both R1 and R2. This is a rough estimate of adapter dimer being present in the sample.

From the last line alone we would believe things worked correctly. Unfortunately if we run the command with an adapter file that we don't have access to (say you have a typo in the name of the adapter file) our full output would look like this:

No Format
TrimmomaticPE: Started with arguments:
 E1-7_S187_L001_R1_001.fastq.gz E1-7_S187_L001_R2_001.fastq.gz -baseout Trim_Reads/E1-7.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30

Evaluating the output

The last 2 lines of text you get should read:

No Format
Input Read Pairs: 6891 Both Surviving: 2391 (34.70%) Forward Only Surviving: 1844 (26.76%) Reverse Only Surviving: 2644 (38.37%) Dropped: 12 (0.17%)
TrimmomaticPE: Completed successfully

From the last line we know things worked correctly.

2nd to last line tells us:

  1. we had 6891 total reads
  2. 34.7% of reads both R1 and R2 were long enough to be kept after trimming
  3. 26.76% of reads and 38.37% of reads only 1 of the reads were long enough and/or not a complete duplicate of the other read
  4. only 0.17% of reads were discarded  for both R1 and R2. This is a rough estimate of adapter dimer being present in the sample.

When we interrogate the Trim_Reads folder with ls we see 4 files:

  1. _1P
  2. _2P
  3. _1U
  4. _2U

1/2 represent read 1 and read2 ... P/U represent paired and unpaired reads. They are kept separate as many programs require R1 and R2 files to be the same length when being used as input.

Trim all the samples from the multiqc tutorial

As mentioned above, if you have already done the multiqc tutorial, you can use your new trimmomatic command to remove the adatper sequences from all 544 samples.

Get some data

Code Block
languagebash
titleset up directories and copy files
mkdir -p $SCRATCH/GVA_trimmomatic_multiqcSamples/Trim_Reads
cd $SCRATCH/GVA_trimmomatic_multiqcSamples
cp -r $BI/gva_course/plasmid_qc/ Raw_Reads/
cp $WORK/src/Trimmomatic-0.39/adapters/TruSeq3-PE-2.fa .

 The ls command will now show 2 directories, and a fasta file.

Trim the fastq files

Just as we used a for loop to set up a set of FastQC commands in the multiqc tutorial, we can use a similar for loop to generate a single file with 272 trim commands for the 544 total files.

The following command will pair all R1 reads in the Raw_Reads folder with its R2 pair, determine the base name, and generate a command to trim the file 

Code Block
languagebash
titlefor loop to generate all trim commands needed
for r1 in Raw_Reads/*_R1_*.fastq.gz; do r2=$(echo $r1|sed 's/_R1_/_R2_/'); name=$(echo $r1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "trimmomatic PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30";done > trim_commands

Use the head and wc -l to see what the output is and how much there is of it respectively.

Again as we discussed in the multiqc tutorial, running this number of commands is kind of a boarder line case, there are not a lot of total reads, but there are a large number of samples so we are likely to benefit from not being run on an idev node. 

Code Block
languagebash
titlesubmit the job to run on the que
launcher_creator.py -N 1 -w 48 -n "trimmomatic" -t 00:10:00 -a "UT-2015-05-18" -j trim_commands -m "module load launcher/3.0.3"
sbatch trimmomatic
Using templated Output files: Trim_Reads/E1-7_1P.fastq.gz Trim_Reads/E1-7_1U.fastq.gz Trim_Reads/E1-7_2P.fastq.gz Trim_Reads/E1-7_2U.fastq.gz
java.io.FileNotFoundException: /scratch/0004/train402/GVA_trimmomatic_1sample/TruSeq3-PE-2.fa (No such file or directory)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.usadellab.trimmomatic.fasta.FastaParser.parse(FastaParser.java:54)
	at org.usadellab.trimmomatic.trim.IlluminaClippingTrimmer.loadSequences(IlluminaClippingTrimmer.java:110)
	at org.usadellab.trimmomatic.trim.IlluminaClippingTrimmer.makeIlluminaClippingTrimmer(IlluminaClippingTrimmer.java:71)
	at org.usadellab.trimmomatic.trim.TrimmerFactory.makeTrimmer(TrimmerFactory.java:32)
	at org.usadellab.trimmomatic.Trimmomatic.createTrimmers(Trimmomatic.java:59)
	at org.usadellab.trimmomatic.TrimmomaticPE.run(TrimmomaticPE.java:552)
	at org.usadellab.trimmomatic.Trimmomatic.main(Trimmomatic.java:80)
Quality encoding detected as phred33
Input Read Pairs: 6891 Both Surviving: 6891 (100.00%) Forward Only Surviving: 0 (0.00%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully

Considering this output, we also want to be very suspicious of a trimming that results in 100% of the reads doing anything.

When we interrogate the Trim_Reads folder with ls we see 4 files:

  1. _1P
  2. _2P
  3. _1U
  4. _2U

1/2 represent read 1 and read2 ... P/U represent paired and unpaired reads. They are kept separate as many programs require R1 and R2 files to be the same length when being used as input.

Trim all the samples from the multiqc tutorial

As mentioned above, if you have already done the multiqc tutorial, you can use your new trimmomatic command to remove the adapter sequences from all 544 samples.

Get some data

Code Block
languagebash
titleset up directories and copy files
mkdir -p $SCRATCH/GVA_trimmomatic_multiqcSamples/Trim_Reads
cd $SCRATCH/GVA_trimmomatic_multiqcSamples
cp -r $BI/gva_course/plasmid_qc/ Raw_Reads/
cp $HOME/miniconda3/envs/*/share/trimmomatic/adapters/TruSeq3-PE-2.fa .

 Again we may see additional output about not overwriting files if we have trimmomatic installed on multiple environments. The ls command will now show 2 directories, and a fasta file.

Trim the fastq files

Just as we used a for loop to set up a set of FastQC commands in the multiqc tutorial, we can use a similar for loop to generate a single file with 272 trim commands for the 544 total files.

The following command will pair all R1 reads in the Raw_Reads folder with its R2 pair, determine the base name, and generate a command to trim the file 

Code Block
languagebash
titlefor loop to generate all trim commands needed
for r1 in Raw_Reads/*_R1_*.fastq.gz; do r2=$(echo $r1|sed 's/_R1_/_R2_/'); name=$(echo $r1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "trimmomatic PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30";done > trim_commands

Use the head and wc -l to see what the output is and how much there is of it respectively.

Again as we discussed in the multiqc tutorial, running this number of commands is kind of a boarder line case, there are not a lot of total reads, but there are a large number of samples so we are likely to benefit from not being run on an idev node. 

Code Block
languagebash
titleModify your slurm file
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021.launcher.slurm trim.slurm
nano trim.slurm

Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is

Line numberAs isTo be
16

#SBATCH -J jobName

#SBATCH -J mutli_trimmomatic
17

#SBATCH -n 1

#SBATCH -n 4

21

#SBATCH -t 12:00:00

#SBATCH -t 0:20:00

22

##SBATCH --mail-user=ADD

#SBATCH --mail-user=<YourEmailAddress>

23

##SBATCH --mail-type=all

#SBATCH --mail-type=all

27

conda activate GVA2021

conda activate GVA2021

31

export LAUNCHER_JOB_FILE=commands

export LAUNCHER_JOB_FILE=trim_commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

Line 27 assumes you named your breseq environment GVA-breseq in the earlier tutorial. 

Again use ctl-o and ctl-x to save the file and exit.


Code Block
languagebash
titlesubmit the job to run on the que
sbatch trim.slurm


The job should take less than 5 10 minutes once it starts if everything is working correctly, the showq -u command can be used to check for the job finishing.

...

Code Block
languagebash
titlesubmit the job to run on the que
ls Trim_Reads | wc -l -l
grep -c "TrimmomaticPE: Completed successfully" Queue_job.o*
grep -c "TrimmomaticPE: Completed successfully100.00%" trimmomaticQueue_job.eo*

The above 2 3 commands are expected to show 1088 and 272 respectivelyand 0 respectively, if you see other answers it suggests that something went wrong with the trimming command itself. If so remember I'm on zoom if you need help looking at whats going on.'m on zoom if you need help looking at whats going on. The most common error that I expect will be that you ran out of ram by trying to process too many samples at once (such as if you used a -n 68 option in your .slurm file). 

Beyond the job finishing successfully, the best way to evaluate this data would actually be to move back to the multiqc tutorial and repeat the analysis there that was done on the raw files for the trimmed files here.

...

Code Block
languagebash
titlecommand taken from my history to trim all files
echo -e "\nTotal read pairs:";wc -l trim_commands;echo "Successful trimmings:"; tail -n 1 trimLogs/*|grep -c "TrimmomaticPE: Completed successfully";echo "Potential trimming errors:"; cat trimLogs/*|grep -c "100.00%"

This gives me a quick readout of if all of the individual commands finished correctly.

...

  1. Consider moving back over to the multiqc tutorial use multiqc to determine well the trimming worked. \ 
  2. The reads could then further be assembled in the Spades tutorial as they are all plasmid samples.


Return to GVA2020GVA2021