Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

Version 1 Next »

Overview

As mentioned in the introduction tutorial as well as the read processing tutorial, read processing can make a huge impact on downstream work. While cutadapt which was introduced in the read processing tutorial is great for quick evaluation or dealing with a single bad sample, it is not as robust as some other trimmers in particular when it comes to removing sequence that you know shouldn't be present but may exist in odd orientations (such as adapter sequences from the library preparation). This tutorial is adapted from the 2021 trimmomatic tutorial which sought to do the same basic things as fastp: get rid of adapter sequences first and foremost, ideally even before fastQC so you can make any quality or length based improvements on actual data not artifacts. The #1 biggest reason why fastp is now the instructor's preferred trimming program is this box taken from the trimmomatic tutorial:

A note on the adapter file used here

The adapter file listed here is likely the correct one to use for standard library preps that have been generated in the last few years, but may not be appropriate for all library preps (such as single end sequencing adapters, nextera based preps, and certainly not appropriate for PacBio generated data). Look to both the trimmomatic documentation and your experimental procedures at the bench to figure out if the adapter file is sufficient or if you need to create your own.

The more collaborative your work is, the less confidence you will have in picking the correct adapter file with trimmoatic, and while thanks to conda installations it can be pretty easy to test multiple different ones, fastp does all the guess work for you, and can generate some interesting graphs itself.

Learning objectives:

  1. Install fastp

  2. Remove adapter sequences from some plasmids and evaluate effect on read quality, or assembly.

Installing fastp

fastp's home page can be found on github and has links to the paper discussing the program, installation instructions for conda, and information on each of the different options available to the program. This is far above the information the average programs will have, most will not have a user manual, may not have been updated since originally published, etc. It having been updated since the publication is one thing that makes it such a good tool as the more who use it the more likely problems are found, and having a group who is going to actively improve the program will significantly increase its longevity.

 Since at this point we have several different conda environments going, take a moment to think about what environment a read processing tool would work well in.

There actually are not a lot of "wrong" answers here at least from the theoretical side. As read processing takes place upstream of basically all other analysis steps it makes sense to put it in almost every environment. 

Practically, though that means that should be installed in every environment which starts to defeat the purpose of having any different environments at all. As will be discussed on Friday, you might want to start thinking about grouping programs into chunks. Almost no matter what analysis you do, you are going to want to trim adapters (fastp), check the quality (fastqc) and likely compare to other similar samples (multiqc). So putting all these programs into a single "read pre-processing" environment seems like a good grouping.

At this point in the class you can start making your own calls about what environments you want to put programs in, or what names you want to give them. While you can keep using the same names and groupings I suggest, last year there was feedback that having to make the choices of how to modify commands based on different environments was helpful.

Example command for creating a new environment
conda create -n GVA-ReadPrepProcessing -c bioconda -c conda-forge fastp fastqc multiqc


Trimming adapter sequences

Example generic command

Example command for trimming illumina paired end adapters
fastp -i <READ1> -I <READ2> -o <TRIM1> -O <TRIM2> --threads # --detect_adapter_for_pe -j <LOG.json> -h <LOG.html>

Breaking down the parts of the above command:

PartPurposereplace with/note
fastptell the computer you are using the fastp prgram 
-i <READ1>fastq file read1 you are trying to trimactual name of fastq file
-I <READ2>fastq file read2 you are trying to trimactual name of paired fastq file
-o <TRIM1>output file of trimmed fastq file of read 1desired name of trimmed fastq file
-O <TRIM2>output file of trimmed fastq file of read 2desired name of paired trimmed fastq file
--threads #use more processorsnumber of additional processors (68 max on stampede2)
--detect_adapter_for_peautomatically detect adapter sequence based on paired end reads, and remove them
-j <LOG.json>
json file with information about how the trim was accomplished. can be helpful for looking at multiple samples similar to multiqc analysisname of json file you want to use
-h <LOG.html>
html file with infomration similar to the json file, but with graphsname of html file you want to use

All of the above has been put together from the help fastp --help command.

Trimming a single sample

Get some data

set up directories and copy files
mkdir -p $SCRATCH/GVA_fastp_1sample/Trim_Reads
cd $SCRATCH/GVA_fastp_1sample
cp $BI/gva_course/plasmid_qc/E1-7* .

The ls command should show you 2 gzipped fastq files. You may notice that here that we used a wildcard in the middle of our copy path for the first time. This is done so that you can grab both R1 and R2 easily without having to type out the full command. Double tab will help tell you when you have a sufficiently specific base name to only get the files you are after.

Trim the fastq files

The following command can be run on the head node. Like with FastQC if we are dealing with less than say 1-2Million reads, it is reasonable to run the command on the head node unless we have 100s of samples in which case submitting to the queue will be faster as the files can be trimmed all at once rather than 1 at a time. Use what you have learned in the class to determine if you think this command should be run on the head node. (this was covered in more detail in the first part of the evaluating and processing read quality tutorial.)

Figuring out how many reads are in each file
zgrep -c "^+$" *.fastq.gz
Example command for trimming illumina paired end adapters
fastp -i E1-7_S187_L001_R1_001.fastq.gz -I E1-7_S187_L001_R2_001.fastq.gz -o E1-7_S187_L001_R1_001.trim.fastq.gz -O E1-7_S187_L001_R2_001.trim.fastq.gz -w 4 --detect_adapter_for_pe 

Evaluating the output

The last 2 lines of text you get should read:

Input Read Pairs: 6891 Both Surviving: 2391 (34.70%) Forward Only Surviving: 1844 (26.76%) Reverse Only Surviving: 2644 (38.37%) Dropped: 12 (0.17%)
TrimmomaticPE: Completed successfully

2nd to last line tells us:

  1. we had 6891 total reads
  2. 34.7% of reads both R1 and R2 were long enough to be kept after trimming
  3. 26.76% of reads and 38.37% of reads only 1 of the reads were long enough and/or not a complete duplicate of the other read
  4. only 0.17% of reads were discarded  for both R1 and R2. This is a rough estimate of adapter dimer being present in the sample.

From the last line alone we would believe things worked correctly. Unfortunately if we run the command with an adapter file that we don't have access to (say you have a typo in the name of the adapter file) our full output would look like this:

Trim all the samples from the multiqc tutorial

As mentioned above, if you have already done the multiqc tutorial, you can use your new fastp command to remove the adapter sequences from all 544 samples.

Get some data

set up directories and copy files
mkdir -p $SCRATCH/GVA_trimmomatic_multiqcSamples/Trim_Reads
cd $SCRATCH/GVA_trimmomatic_multiqcSamples
cp -r $BI/gva_course/plasmid_qc/ Raw_Reads/
cp $HOME/miniconda3/envs/*/share/trimmomatic/adapters/TruSeq3-PE-2.fa .

 The ls command will now show 2 directories, and a fasta file.

Trim the fastq files

Just as we used a for loop to set up a set of FastQC commands in the multiqc tutorial, we can use a similar for loop to generate a single file with 272 trim commands for the 544 total files.

The following command will pair all R1 reads in the Raw_Reads folder with its R2 pair, determine the base name, and generate a command to trim the file 

for loop to generate all trim commands needed
for r1 in Raw_Reads/*_R1_*.fastq.gz; do r2=$(echo $r1|sed 's/_R1_/_R2_/'); name=$(echo $r1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "trimmomatic PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30";done > trim_commands

Use the head and wc -l to see what the output is and how much there is of it respectively.

Again as we discussed in the multiqc tutorial, running this number of commands is kind of a boarder line case, there are not a lot of total reads, but there are a large number of samples so we are likely to benefit from not being run on an idev node. 

Modify your slurm file
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021.launcher.slurm trim.slurm
nano trim.slurm

Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is

Line numberAs isTo be
16

#SBATCH -J jobName

#SBATCH -J mutli_trimmomatic
17

#SBATCH -n 1

#SBATCH -n 4

21

#SBATCH -t 12:00:00

#SBATCH -t 0:20:00

22

##SBATCH --mail-user=ADD

#SBATCH --mail-user=<YourEmailAddress>

23

##SBATCH --mail-type=all

#SBATCH --mail-type=all

27

conda activate GVA2021

conda activate GVA2021

31

export LAUNCHER_JOB_FILE=commands

export LAUNCHER_JOB_FILE=trim_commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

Line 27 assumes you named your breseq environment GVA-breseq in the earlier tutorial. 

Again use ctl-o and ctl-x to save the file and exit.


submit the job to run on the que
sbatch trim.slurm


The job should take less than 10 minutes once it starts if everything is working correctly, the showq -u command can be used to check for the job finishing.

Evaluating the output

submit the job to run on the que
ls Trim_Reads | wc -l
grep -c "TrimmomaticPE: Completed successfully" Queue_job.o*
grep -c "100.00%" Queue_job.o*

The above 3 commands are expected to show 1088 and 272 and 0 respectively, if you see other answers it suggests that something went wrong with the trimming command itself. If so remember I'm on zoom if you need help looking at whats going on. The most common error that I expect will be that you ran out of ram by trying to process too many samples at once (such as if you used a -n 68 option in your .slurm file). 

Beyond the job finishing successfully, the best way to evaluate this data would actually be to move back to the multiqc tutorial and repeat the analysis there that was done on the raw files for the trimmed files here.

Personal experience

The for loop above focuses just on generating the trim commands. In my experience that is only half of the job, the other half is capturing the individual outputs so you can evaluate how many reads were trimmed in what way for each sample. Perhaps you will find this command helpful in your own work: 

command taken from my history to trim all files
mkdir trimLogs; mkdir Trim_Reads; for r1 in Raw_Reads/*_R1_*.fastq.gz; do r2=$(echo $r1|sed 's/_R1_/_R2_/'); name=$(echo $r1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "trimmomatic PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30 >& trimLogs/$name.log"; done > trim_commands

After the job completes, the following command is useful for evaluating its success:

command taken from my history to trim all files
echo -e "\nTotal read pairs:";wc -l trim_commands;echo "Successful trimmings:"; tail -n 1 trimLogs/*|grep -c "TrimmomaticPE: Completed successfully";echo "Potential trimming errors:"; cat trimLogs/*|grep -c "100.00%"

This gives me a quick readout of if all of the individual commands finished correctly.

Further, if you were to use this for loop rather than the one listed in the tutorial above, you would see each sample generates its own log file in the the trimLogs directory that when investigated with cat/more/less/tail/nano is identical to the output you saw when you trimmed a single sentence allowing you to figure out how different samples are being processed.

Optional next steps:

  1. Consider moving back over to the multiqc tutorial use multiqc to determine well the trimming worked. 
  2. The reads could then further be assembled in the Spades tutorial as they are all plasmid samples.


Return to GVA2021

  • No labels