Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

fastp's home page can be found on github and has links to the paper discussing the program, installation instructions for conda, and information on each of the different options available to the program. This is far above the information quality the average programs will have , as most will not have a user manual (or not nearly so detailed), may not have been updated since originally published (or may not have been published), etc. It having been updated since the publication is one thing that makes it such a good tool as the more who use it the more likely problems are found, and having a group who is going to actively improve the program will significantly increase its longevity.

...

Code Block
languagebash
titleExample command for creating a new environment
conda create -n GVA-ReadPrepProcessingReadPreProcessing -c bioconda -c conda-forge fastp fastqc multiqc

...

Breaking down the parts of the above command:

PartPurposereplace with/note
fastptell the computer you are using the fastp prgram 
-i <READ1>fastq file read1 you are trying to trimactual name of fastq file
-I <READ2>fastq file read2 you are trying to trimactual name of paired fastq file
-o <TRIM1>output file of trimmed fastq file of read 1desired name of trimmed fastq file
-O <TRIM2>output file of trimmed fastq file of read 2desired name of paired trimmed fastq file
--threads #use more processors, make command run fasternumber of additional processors (68 max on stampede2)
--detect_adapter_for_peautomatically detect adapter sequence based on paired end reads, and remove them
-j <LOG.json>
json file with information about how the trim was accomplished. can be helpful for looking at multiple samples similar to multiqc analysisname of json file you want to use
-h <LOG.html>
html file with infomration similar to the json file, but with graphsname of html file you want to use

All of the above has been put together from the help fastp --help command.

...

Code Block
languagebash
titleset up directories and copy files
mkdir -p $SCRATCH/GVA_fastp_1sample/Trim_Reads $SCRATCH/GVA_fastp_1sample/Raw_Reads
cd $SCRATCH/GVA_fastp_1sample
cp $BI/gva_course/plasmid_qc/E1-7* .Raw_Reads

The ls command should show you 2 gzipped fastq files. You may notice that here that we used a wildcard in the middle of our copy path for the first time. This is done so that you can grab both R1 and R2 easily without having to type out the full command. Double tab will help tell you when you have a sufficiently specific base name to only get the files you are after.

Trim the fastq files

The following command can be run on the head node. Like with FastQC if we are dealing with less than say 1-2Million reads, it is reasonable to run the command on the head node unless we have 100s of samples in which case submitting to the queue will be faster as the files can be trimmed all at once rather than 1 at a time. Use what you have learned in the class to determine if you think this command should be run on the head node. (this was covered in more detail in the first part of the evaluating and processing read quality tutorial.)

Code Block
languagebash
titleFiguring out how many reads are in each file
collapsetrue
zgrep -c "^+$" *.fastq.gz
Code Block
languagebash
titleExample command for trimming illumina paired end adapters
fastp -i E1-7_S187_L001_R1_001.fastq.gz -I E1-7_S187_L001_R2_001.fastq.gz -o E1-7_S187_L001_R1_001.trim.fastq.gz -O E1-7_S187_L001_R2_001.trim.fastq.gz -w 4 --detect_adapter_for_pe 

Evaluating the output

The last 2 lines of text you get should read:

No Format
Input Read Pairs: 6891 Both Surviving: 2391 (34.70%) Forward Only Surviving: 1844 (26.76%) Reverse Only Surviving: 2644 (38.37%) Dropped: 12 (0.17%)
TrimmomaticPE: Completed successfully

2nd to last line tells us:

  1. we had 6891 total reads
  2. 34.7% of reads both R1 and R2 were long enough to be kept after trimming
  3. 26.76% of reads and 38.37% of reads only 1 of the reads were long enough and/or not a complete duplicate of the other read
  4. only 0.17% of reads were discarded  for both R1 and R2. This is a rough estimate of adapter dimer being present in the sample.

From the last line alone we would believe things worked correctly. Unfortunately if we run the command with an adapter file that we don't have access to (say you have a typo in the name of the adapter file) our full output would look like this:

Trim all the samples from the multiqc tutorial

As mentioned above, if you have already done the multiqc tutorial, you can use your new fastp command to remove the adapter sequences from all 544 samples.

Get some data

Code Block
languagebash
titleset up directories and copy files
mkdir -p $SCRATCH/GVA_trimmomatic_multiqcSamples/Trim_Reads
cd $SCRATCH/GVA_trimmomatic_multiqcSamples
cp -r $BI/gva_course/plasmid_qc/ Raw_Reads/
cp $HOME/miniconda3/envs/*/share/trimmomatic/adapters/TruSeq3-PE-2.fa .

 The ls command will now show 2 directories, and a fasta file.

Trim the fastq files

Just as we used a for loop to set up a set of FastQC commands in the multiqc tutorial, we can use a similar for loop to generate a single file with 272 trim commands for the 544 total files.

The following command will pair all R1 reads in the Raw_Reads folder with its R2 pair, determine the base name, and generate a command to trim the file 

...

languagebash
titlefor loop to generate all trim commands needed

...

Expand
titleBonus question: what is the -p option doing in the mkdir command

According to the --help information: "-p, --parents     no error if existing, make parent directories as needed" so it is allowing us to make nested directories rather than having to make them 1 at a time. Additionally we use a ::space:: to create 2 directories at the same time.

Info
titleAlmost every command has more information about it that can be read at the command line

We have used -h and --help and tried calling commands without any options and mentioned the 'man' command throughout the course for the various programs we have installed, but here we see we can actually use that same framework to access more information about even the most basic of commands without even needing the internet.



Trim the fastq files

The following command can be run on the head node. Like with FastQC if we are dealing with less than say 1-2Million reads, it is reasonable to run the command on the head node unless we have 100s of samples in which case submitting to the queue will be faster as the files can be trimmed all at once rather than 1 at a time. Use what you have learned in the class to determine if you think this command should be run on the head node. (this was covered in more detail in the first part of the evaluating and processing read quality tutorial.)

Code Block
languagebash
titleFiguring out how many reads are in each file
collapsetrue
zgrep -c "^+$" Raw_Reads/*.fastq.gz


Code Block
languagebash
titleExample command for trimming illumina paired end adapters
fastp -i Raw_Reads/E1-7_S187_L001_R1_001.fastq.gz -I Raw_Reads/E1-7_S187_L001_R2_001.fastq.gz -o Trim_Reads/E1-7_S187_L001_R1_001.trim.fastq.gz -O Trim_Reads/E1-7_S187_L001_R2_001.trim.fastq.gz -w 4 --detect_adapter_for_pe 


Expand
titleDid you get an error?

Most likely cause here is that you forgot to activate your new conda environment if you have another issue, you will likely want to ask a question.

Evaluating the output

Using everything you have learned so far in the class, can you answer the following questions?


  1. Expand
    titleWhat are the 4 new files that were generated, and where did they come from?
    • E1-7_S187_L001_R2_001.trim.fastq.gz and E1-7_S187_L001_R1_001.trim.fastq.gz

      • These were created with the -o and -O options, they are in the Trim_Reads folder, and you likely found them using the ls command
    • fastp.html and fastp.json
      • These are log files created by default since we didn't specify their names. This is part of why -j and -h were discussed above with the general command. 
      • While the json file can be evaluated in the terminal (cat less more head tail), the html file has to be transferred back to your computer to view.



  2. Expand
    titleHow many reads were left after trimming?
    • 5884 paired end reads

    • 11768 total reads

      • You likely found this out from using the zgrep command, or from the following blocks that printed as the command ran:

        No Format
        Read1 after filtering:
        total reads: 5884
        total bases: 791763
        Q20 bases: 782948(98.8867%)
        Q30 bases: 765510(96.6842%)
        
        Read2 after filtering:
        total reads: 5884
        total bases: 791763
        Q20 bases: 711414(89.8519%)
        Q30 bases: 658164(83.1264%)
        
        Filtering result:
        reads passed filter: 11768
        reads failed due to low quality: 2014
        reads failed due to too many N: 0
        reads failed due to too short: 0
        reads with adapter trimmed: 3970
        bases trimmed due to adapters: 193972




  3. Expand
    titleHow big was our fragment size? How is this estimated, what might make it inaccurate?
    • From the information generated while the command ran we see:

      • No Format
        Insert size peak (evaluated by paired-end reads): 171


      • This tells us that the average peak size was 171 bases, and that it was estimated by looking at the overlap between the read pairs. It is potentially inaccurate as reads which do not overlap each other can not estimate the size.
    • If you transferred the .html file back to your laptop, you would see this relevant histogram:
      Image Added
      • The general section of the summary at the top of the html tells us that the average insert size was 171, while the histogram tells us that 50% of our data is <18 or >272 bases



  4. Expand
    titleDid our sample have any adapter present?
    • If you only look at the information that printed to the screen, you probably answer "No"
      • you likely see the following block and think this is the end of the answer:

      • No Format
        Detecting adapter sequence for read1...
        No adapter detected for read1
        
        Detecting adapter sequence for read2...
        No adapter detected for read2


    • A more fuller answer might be "maybe" or "probably" or "I'm not sure" as:
      • 1. Not finding any adapter would be super rare
      • 2. If 45% of our reads have an insert size of 171 bases, and we did 151bp PE sequencing,  we should be able to find adpater sequences
      • 3. in the filtering results we see:

      • No Format
        Filtering result:
        reads passed filter: 11768
        reads failed due to low quality: 2014
        reads failed due to too many N: 0
        reads failed due to too short: 0
        reads with adapter trimmed: 3970
        bases trimmed due to adapters: 193972


    • If you look at the html file you probably answered "yes"
      • There is a section for Read1 and Read2 adapters which show a growing stretch of DNA which recreates the illumina adapter sequences.



  5. Expand
    titleWhy was answering if there were adapters present not straight forward?

    Like we saw in our fastqc reports (over represented sequences having "no hit" and adapter content staying at bottom of graph), for something to be classified as an "adapter" in the first section of the printed information, it has to meet certain criteria that in this (and many other instances) is perhaps a bit too stringent. 



  6. Expand
    titleWhat other interesting things can you find from this command?

    This is pretty open ended, take a look at the html file in particular, see what of it does or doesn't make sense and consider asking a question if you would like to know more.




Trim all the samples from the multiqc tutorial

As mentioned above, if you have already done the multiqc tutorial, you can use your new fastp command to remove the adapter sequences from all 544 samples or make other changes based on what you saw from the fastqc/multiqc reports. 

Get some data

Code Block
languagebash
titleset up directories and copy files
mkdir -p $SCRATCH/GVA_fastp_multiqcSamples/Trim_Reads
cd $SCRATCH/GVA_fastp_multiqcSamples
cp -r $BI/gva_course/plasmid_qc/ Raw_Reads/

 The ls command will now show 2 directories, and like in the multiqc tutorial, you can use ls and wc commands to figure out how many files you will be working with.

Trim the fastq files

Just as we used a for loop to set up a set of FastQC commands in the multiqc tutorial, we can use a similar for loop to generate a single file with 272 trim commands for the 544 total files.

The following command will pair all R1 reads in the Raw_Reads folder with its R2 pair, determine the base name, and generate a command to trim the file 

Code Block
languagebash
titlefor loop to generate all trim commands needed
for r1 in Raw_Reads/*_R1_*.fastq.gz; do r2=$(echo $r1|sed 's/_R1_/_R2_/'); name=$(echo $r1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "fastp PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30";done > trim_commands

for R1 in Raw_Reads/*_R1_*001.fastq.gz; do r2R2=$(echo $r1$R1| sed 's/_R1_/_R2_/'); name=$(echo $r1$R1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "trimmomatic PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30fastp -i $R1 -I $R2 -o Trim_Reads/$name.trim.R1.fastq.gz -O Trim_Reads/$name.trim.R2.fastq.gz -w 4 --detect_adapter_for_pe -j 04_Trim_Logs/$name.json -h 04_Trim_Logs/$name.html &> 04_Trim_Logs/$name.log.txt";done > trim_fastp.commands

Use the head and wc -l to see what the output is and how much there is of it respectively.

Again as we discussed in the multiqc tutorial, running this number of commands is kind of a boarder line case, there are not a lot of total reads, but there are a large number of samples and our command does request additional processors so we are likely to benefit from not being run on an idev should not be on the head node. 

Code Block
languagebash
titleModify your slurm file
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021GVA2022.launcher.slurm trim.slurm
nano trim.slurm

Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is

Line numberAs isTo be
16

#SBATCH -J jobName

#SBATCH -J mutli_trimmomatic
17

#SBATCH -n 1

#SBATCH -n 4

21

#SBATCH -t 12:00:00

#SBATCH -t 0:20:00

22

##SBATCH --mail-user=ADD

#SBATCH --mail-user=<YourEmailAddress>

23

##SBATCH --mail-type=all

#SBATCH --mail-type=all

27

conda activate GVA2021

conda activate GVA2021

31
29

export LAUNCHER_JOB_FILE=commands

export LAUNCHER_JOB_FILE=

trim_

fastp.commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

Line 27 assumes you named your breseq environment GVA-breseq in the earlier tutorial. 


Again use ctl-o and ctl-x to save the file and exit.

...

  1. Consider moving back over to the multiqc tutorial use multiqc to determine well the trimming worked. 
  2. The reads could then further be assembled in the Spades tutorial as they are all plasmid samples.


Return to GVA2021GVA2022 home page