IN PROGRESS Overview

As mentioned in the introduction tutorial as well as the read processing tutorial, read processing can make a huge impact on downstream work. While cutadapt which was introduced in the read processing tutorial is great for quick evaluation or dealing with a single bad sample, it is not as robust as some other trimmers in particular when it comes to removing sequence that you know shouldn't be present but may exist in odd orientations (such as adapter sequences from the library preparation). This tutorial is adapted from the 2021 trimmomatic tutorial which sought to do the same basic things as fastp: get rid of adapter sequences first and foremost, ideally even before fastQC so you can make any quality or length based improvements on actual data not artifacts. The #1 biggest reason why fastp is now the instructor's preferred trimming program is this box taken from the trimmomatic tutorial:

...

The more collaborative your work is, the less confidence you will have in picking the correct adapter file with trimmoatic, and while thanks to conda installations it can be pretty easy to test multiple different onesadapter filees, fastp does all the guess work for you, and can generate some interesting graphs itself.

...

Expand

title	Since at this point we have several different conda environments going, take a moment to think about what environment a read processing tool would work well in.

There actually are not a lot of "wrong" answers here at least from the theoretical side. As read processing takes place upstream of basically all other analysis steps it makes sense to put it in almost every environment.

Practically, though that means that it should be installed in every environment which starts to defeat the purpose of having any different environments at all. As will be discussed on Friday, you might want to start thinking about grouping programs into chunks. Almost no matter what analysis you do, you are going to want to trim adapters (fastp), check the quality (fastqc) and likely compare to other similar samples (multiqc). So putting all these programs into a single "read pre-processing" environment seems like a good grouping.

At this point in the class you can start making your own calls about what environments you want to put programs in, or what names you want to give them. While you can keep using the same names and groupings I suggest, last year there was feedback that having to make the choices of how to modify commands based on different environments was helpful.

...

Code Block

language	bash
title	Example command for creating a new environment

conda create -n GVA-ReadPreProcessing -c biocondaconda-forge -c conda-forgebioconda fastp fastqc multiqc

Trimming adapter sequences

...

The following command can be run on the head node. Like with FastQC if we are dealing with less than say 1-2Million reads, it is reasonable to run the command on the head node unless we have 100s of samples in which case submitting to the queue will be faster as the files can be trimmed all at once rather than 1 at a time. Use what you have learned in the class to determine if you think this command should be run on the head node. (this was covered in more detail in the first part of the evaluating and processing read quality tutorial.)

Code Block

language	bash
title	Figuring out how many reads are in each file
collapse	true

zgrep -c "^+$" Raw_Reads/*.fastq.gz

...

Expand

title	What are the 4 new files that were generated, and where did they come from?

E1-7_S187_L001_R2_001.trim.fastq.gz and E1-7_S187_L001_R1_001.trim.fastq.gz
- These were created with the -o and -O options, they are in the Trim_Reads folder, and you likely found them using the ls command
fastp.html and fastp.json
- These are log files created by default since we didn't specify their names. This is part of why -j and -h were discussed above with the general command.
- While the json file can be evaluated in the terminal (cat less more head tail), the html file has to be transferred back to your computer to view.

Expand

title	How many reads were left after trimming?

5884 paired end reads

11768 total reads

You likely found this out from using the zgrep command, or from the following blocks that printed as the command ran:

No Format

Read1 after filtering:
total reads: 5884
total bases: 791763
Q20 bases: 782948(98.8867%)
Q30 bases: 765510(96.6842%)

Read2 after filtering:
total reads: 5884
total bases: 791763
Q20 bases: 711414(89.8519%)
Q30 bases: 658164(83.1264%)

Filtering result:
reads passed filter: 11768
reads failed due to low quality: 2014
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 3970
bases trimmed due to adapters: 193972

Expand

title	How big was our fragment size? How is this estimated, what might make it inaccurate?

From the information generated while the command ran we see:
- No Format
  Insert size peak (evaluated by paired-end reads): 171
- This tells us that the average peak size was 171 bases, and that it was estimated by looking at the overlap between the read pairs. It is potentially inaccurate as reads which do not overlap each other can not estimate the size.
If you transferred the .html file back to your laptop, you would see this relevant histogram:
- The general section of the summary at the top of the html tells us that the average insert size was 171, while the histogram tells us that 50% of our data is <18 or >272 bases

Expand

title	Did our sample have any adapter present?

If you only look at the information that printed to the screen, you probably answer "No"
- you likely see the following block and think this is the end of the answer:
- No Format
  Detecting adapter sequence for read1... No adapter detected for read1 Detecting adapter sequence for read2... No adapter detected for read2

A more fuller answer might be "maybe" or "probably" or "I'm not sure" as:

1. Not finding any adapter would be super rare
2. If 45% of our reads have an insert size of 171 bases, and we did 151bp PE sequencing, we should be able to find adpater adapter sequences
3. in the filtering results we see:

No Format

Filtering result:
reads passed filter: 11768
reads failed due to low quality: 2014
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 3970
bases trimmed due to adapters: 193972

If you look at the html file you probably answered "yes"
- There is a section for Read1 and Read2 adapters which show a growing stretch of DNA which recreates the illumina adapter sequences.

Expand

title	Why was answering if there were adapters present not straight forward?

Like we saw in our fastqc reports (over represented sequences having "no hit" and adapter content staying at bottom of graph), for something to be classified as an "adapter" in the first section of the printed information, it has to meet certain criteria that in this (and many other instances) is perhaps a bit too stringent.

Expand

title	What other interesting things can you find from this command?

This is pretty open ended, take a look at the html file in particular, see what of it does or doesn't make sense and consider asking a question if you would like to know more.

...

Trim all the samples from the multiqc tutorial

As mentioned above, if you have already done the multiqc tutorial, you can use your new fastp command to remove the adapter sequences from all 544 samples or make other changes based on what you saw from the fastqc/multiqc reports.

...

Code Block

language	bash
title	Modify your slurm file

cp /corral-repl/utexas/BioITeam/gva_course/GVA2022GVA.launcher.slurm trim.slurm
nano trim.slurm

Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is

...

Based on what we have already seen, it is probably not surprising that using 68 (really 16) threads and only evaluating 1 sample at a time took approximately the same amount of time as it did when running on an idev node as those conditions are functionally equivalent. Whaat may be surprising is the lack of improvement despite running 4x more samples at the same time. Again hypothesiesPotential hypotheses:

The amount of overhead the job launcher has in checking when a command finishes and starting the next command is similar to the amount of overhead in splitting reads into 4 sets
The biggest contributor to time is not something that can be improved with additional threads

...

Info

The take away here is that "should I use more threads per command, or launch more simultaneous commands" is not a simple thing to answer. Perhaps as you become more familiar with a particular command you can tease these apart, but more of then not you will likely find yourself balancing the 2 where you luanch launch jobs with 4-16 threads each and as many commands at once as the 68 available threads and accommodate.

...

Consider moving back over to the multiqc tutorial use multiqc to determine well the trimming worked.
The reads could then further be assembled in the Spades tutorial as they are all plasmid samples.

Return to GVA2022 GVA2023 home page

Versions Compared

Old Version 2

New Version 3

Key

IN PROGRESS Overview

Trimming adapter sequences

Trim all the samples from the multiqc tutorial

Page Comparison

Versions Compared

Old Version 2

New Version 3

Key

IN PROGRESS Overview

Trimming adapter sequences

Trim all the samples from the multiqc tutorial