fastp - GVA2022

fastp - GVA2022

Overview

As mentioned in the introduction tutorial as well as the read processing tutorial, read processing can make a huge impact on downstream work. While cutadapt which was introduced in the read processing tutorial is great for quick evaluation or dealing with a single bad sample, it is not as robust as some other trimmers in particular when it comes to removing sequence that you know shouldn't be present but may exist in odd orientations (such as adapter sequences from the library preparation). This tutorial is adapted from the 2021 trimmomatic tutorial which sought to do the same basic things as fastp: get rid of adapter sequences first and foremost, ideally even before fastQC so you can make any quality or length based improvements on actual data not artifacts. The #1 biggest reason why fastp is now the instructor's preferred trimming program is this box taken from the trimmomatic tutorial:

A note on the adapter file used here

The adapter file listed here is likely the correct one to use for standard library preps that have been generated in the last few years, but may not be appropriate for all library preps (such as single end sequencing adapters, nextera based preps, and certainly not appropriate for PacBio generated data). Look to both the trimmomatic documentation and your experimental procedures at the bench to figure out if the adapter file is sufficient or if you need to create your own.

The more collaborative your work is, the less confidence you will have in picking the correct adapter file with trimmoatic, and while thanks to conda installations it can be pretty easy to test multiple different ones, fastp does all the guess work for you, and can generate some interesting graphs itself.

Learning objectives:

  1. Install fastp

  2. Remove adapter sequences from some plasmids and evaluate effect on read quality, or assembly.

Installing fastp

fastp's home page can be found on github and has links to the paper discussing the program, installation instructions for conda, and information on each of the different options available to the program. This is far above the quality the average programs will have as most will not have a user manual (or not nearly so detailed), may not have been updated since originally published (or may not have been published), etc. It having been updated since the publication is one thing that makes it such a good tool as the more who use it the more likely problems are found, and having a group who is going to actively improve the program will significantly increase its longevity.

There actually are not a lot of "wrong" answers here at least from the theoretical side. As read processing takes place upstream of basically all other analysis steps it makes sense to put it in almost every environment. 

Practically, though that means that should be installed in every environment which starts to defeat the purpose of having any different environments at all. As will be discussed on Friday, you might want to start thinking about grouping programs into chunks. Almost no matter what analysis you do, you are going to want to trim adapters (fastp), check the quality (fastqc) and likely compare to other similar samples (multiqc). So putting all these programs into a single "read pre-processing" environment seems like a good grouping.

At this point in the class you can start making your own calls about what environments you want to put programs in, or what names you want to give them. While you can keep using the same names and groupings I suggest, last year there was feedback that having to make the choices of how to modify commands based on different environments was helpful.

Example command for creating a new environment
conda create -n GVA-ReadPreProcessing -c bioconda -c conda-forge fastp fastqc multiqc



Trimming adapter sequences

Example generic command

Example command for trimming illumina paired end adapters
fastp -i <READ1> -I <READ2> -o <TRIM1> -O <TRIM2> --threads # --detect_adapter_for_pe -j <LOG.json> -h <LOG.html>

Breaking down the parts of the above command:

Part

Purpose

replace with/note

Part

Purpose

replace with/note

fastp

tell the computer you are using the fastp prgram 



-i <READ1>

fastq file read1 you are trying to trim

actual name of fastq file

-I <READ2>

fastq file read2 you are trying to trim

actual name of paired fastq file

-o <TRIM1>

output file of trimmed fastq file of read 1

desired name of trimmed fastq file

-O <TRIM2>

output file of trimmed fastq file of read 2

desired name of paired trimmed fastq file

--threads #

use more processors, make command run faster

number of additional processors (68 max on stampede2)

--detect_adapter_for_pe

automatically detect adapter sequence based on paired end reads, and remove them



-j <LOG.json>

json file with information about how the trim was accomplished. can be helpful for looking at multiple samples similar to multiqc analysis

name of json file you want to use

-h <LOG.html>

html file with infomration similar to the json file, but with graphs

name of html file you want to use

All of the above has been put together from the help fastp --help command.

Trimming a single sample

Get some data

set up directories and copy files
mkdir -p $SCRATCH/GVA_fastp_1sample/Trim_Reads $SCRATCH/GVA_fastp_1sample/Raw_Reads cd $SCRATCH/GVA_fastp_1sample cp $BI/gva_course/plasmid_qc/E1-7* Raw_Reads

The ls command should show you 2 gzipped fastq files. You may notice that here that we used a wildcard in the middle of our copy path for the first time. This is done so that you can grab both R1 and R2 easily without having to type out the full command. Double tab will help tell you when you have a sufficiently specific base name to only get the files you are after.

According to the --help information: "-p, --parents     no error if existing, make parent directories as needed" so it is allowing us to make nested directories rather than having to make them 1 at a time. Additionally we use a ::space:: to create 2 directories at the same time.

Almost every command has more information about it that can be read at the command line

We have used -h and --help and tried calling commands without any options and mentioned the 'man' command throughout the course for the various programs we have installed, but here we see we can actually use that same framework to access more information about even the most basic of commands without even needing the internet.



Trim the fastq files

The following command can be run on the head node. Like with FastQC if we are dealing with less than say 1-2Million reads, it is reasonable to run the command on the head node unless we have 100s of samples in which case submitting to the queue will be faster as the files can be trimmed all at once rather than 1 at a time. Use what you have learned in the class to determine if you think this command should be run on the head node. (this was covered in more detail in the first part of the evaluating and processing read quality tutorial.)

Figuring out how many reads are in each file
zgrep -c "^+$" Raw_Reads/*.fastq.gz
Example command for trimming illumina paired end adapters
fastp -i Raw_Reads/E1-7_S187_L001_R1_001.fastq.gz -I Raw_Reads/E1-7_S187_L001_R2_001.fastq.gz -o Trim_Reads/E1-7_S187_L001_R1_001.trim.fastq.gz -O Trim_Reads/E1-7_S187_L001_R2_001.trim.fastq.gz -w 4 --detect_adapter_for_pe

Most likely cause here is that you forgot to activate your new conda environment if you have another issue, you will likely want to ask a question.

Evaluating the output

Using everything you have learned so far in the class, can you answer the following questions?