...
fastp's home page can be found on github and has links to the paper discussing the program, installation instructions for conda, and information on each of the different options available to the program. This is far above the information quality the average programs will have , as most will not have a user manual (or not nearly so detailed), may not have been updated since originally published (or may not have been published), etc. It having been updated since the publication is one thing that makes it such a good tool as the more who use it the more likely problems are found, and having a group who is going to actively improve the program will significantly increase its longevity.
...
| Code Block | ||||
|---|---|---|---|---|
| ||||
conda create -n GVA-ReadPrepProcessingReadPreProcessing -c bioconda -c conda-forge fastp fastqc multiqc |
...
Breaking down the parts of the above command:
| Part | Purpose | replace with/note |
|---|---|---|
| fastp | tell the computer you are using the fastp prgram | |
| -i <READ1> | fastq file read1 you are trying to trim | actual name of fastq file |
| -I <READ2> | fastq file read2 you are trying to trim | actual name of paired fastq file |
| -o <TRIM1> | output file of trimmed fastq file of read 1 | desired name of trimmed fastq file |
| -O <TRIM2> | output file of trimmed fastq file of read 2 | desired name of paired trimmed fastq file |
| --threads # | use more processors, make command run faster | number of additional processors (68 max on stampede2) |
| --detect_adapter_for_pe | automatically detect adapter sequence based on paired end reads, and remove them | |
-j <LOG.json> | json file with information about how the trim was accomplished. can be helpful for looking at multiple samples similar to multiqc analysis | name of json file you want to use |
-h <LOG.html> | html file with infomration similar to the json file, but with graphs | name of html file you want to use |
All of the above has been put together from the help fastp --help command.
...
| Code Block | ||||
|---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/GVA_fastp_1sample/Trim_Reads $SCRATCH/GVA_fastp_1sample/Raw_Reads cd $SCRATCH/GVA_fastp_1sample cp $BI/gva_course/plasmid_qc/E1-7* .Raw_Reads |
The ls command should show you 2 gzipped fastq files. You may notice that here that we used a wildcard in the middle of our copy path for the first time. This is done so that you can grab both R1 and R2 easily without having to type out the full command. Double tab will help tell you when you have a sufficiently specific base name to only get the files you are after.
Trim the fastq files
The following command can be run on the head node. Like with FastQC if we are dealing with less than say 1-2Million reads, it is reasonable to run the command on the head node unless we have 100s of samples in which case submitting to the queue will be faster as the files can be trimmed all at once rather than 1 at a time. Use what you have learned in the class to determine if you think this command should be run on the head node. (this was covered in more detail in the first part of the evaluating and processing read quality tutorial.)
| Code Block | ||||||
|---|---|---|---|---|---|---|
| ||||||
zgrep -c "^+$" *.fastq.gz |
| Expand | |||||
|---|---|---|---|---|---|
| |||||
According to the --help information:
|
Trim the fastq files
The following command can be run on the head node. Like with FastQC if we are dealing with less than say 1-2Million reads, it is reasonable to run the command on the head node unless we have 100s of samples in which case submitting to the queue will be faster as the files can be trimmed all at once rather than 1 at a time. Use what you have learned in the class to determine if you think this command should be run on the head node. (this was covered in more detail in the first part of the evaluating and processing read quality tutorial.)
| Code Block | ||||||
|---|---|---|---|---|---|---|
| ||||||
zgrep -c "^+$" Raw_Reads/*.fastq.gz |
| Code Block | ||||
|---|---|---|---|---|
| ||||
fastp -i Raw_Reads/E1-7_S187_L001_R1_001.fastq.gz -I E1Raw_Reads/E1-7_S187_L001_R2_001.fastq.gz -o Trim_Reads/E1-7_S187_L001_R1_001.trim.fastq.gz -O Trim_Reads/E1-7_S187_L001_R2_001.trim.fastq.gz -w 4 --detect_adapter_for_pe |
| Expand | ||
|---|---|---|
| ||
Most likely cause here is that you forgot to activate your new conda environment if you have another issue, you will likely want to ask a question. |
Evaluating the output
The last 2 lines of text you get should read:
| No Format |
|---|
Input Read Pairs: 6891 Both Surviving: 2391 (34.70%) Forward Only Surviving: 1844 (26.76%) Reverse Only Surviving: 2644 (38.37%) Dropped: 12 (0.17%)
TrimmomaticPE: Completed successfully |
2nd to last line tells us:
- we had 6891 total reads
- 34.7% of reads both R1 and R2 were long enough to be kept after trimming
- 26.76% of reads and 38.37% of reads only 1 of the reads were long enough and/or not a complete duplicate of the other read
- only 0.17% of reads were discarded for both R1 and R2. This is a rough estimate of adapter dimer being present in the sample.
From the last line alone we would believe things worked correctly. Unfortunately if we run the command with an adapter file that we don't have access to (say you have a typo in the name of the adapter file) our full output would look like this:
Trim all the samples from the multiqc tutorial
As mentioned above, if you have already done the multiqc tutorial, you can use your new fastp command to remove the adapter sequences from all 544 samples.
Get some data
| Code Block | ||||
|---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/GVA_trimmomatic_multiqcSamples/Trim_Reads
cd $SCRATCH/GVA_trimmomatic_multiqcSamples
cp -r $BI/gva_course/plasmid_qc/ Raw_Reads/
cp $HOME/miniconda3/envs/*/share/trimmomatic/adapters/TruSeq3-PE-2.fa . |
The ls command will now show 2 directories, and a fasta file.
Trim the fastq files
Just as we used a for loop to set up a set of FastQC commands in the multiqc tutorial, we can use a similar for loop to generate a single file with 272 trim commands for the 544 total files.
The following command will pair all R1 reads in the Raw_Reads folder with its R2 pair, determine the base name, and generate a command to trim the file
| Code Block | ||||
|---|---|---|---|---|
| ||||
for r1 in Raw_Reads/*_R1_*.fastq.gz; do r2=$(echo $r1|sed 's/_R1_/_R2_/'); name=$(echo $r1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "trimmomatic PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30";done > trim_commands |
Use the head and wc -l to see what the output is and how much there is of it respectively.
Again as we discussed in the multiqc tutorial, running this number of commands is kind of a boarder line case, there are not a lot of total reads, but there are a large number of samples so we are likely to benefit from not being run on an idev node.
| Code Block | ||||
|---|---|---|---|---|
| ||||
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021.launcher.slurm trim.slurm
nano trim.slurm |
Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is
...
#SBATCH -J jobName
...
#SBATCH -n 1
...
#SBATCH -n 4
...
#SBATCH -t 12:00:00
...
#SBATCH -t 0:20:00
...
##SBATCH --mail-user=ADD
...
#SBATCH --mail-user=<YourEmailAddress>
...
##SBATCH --mail-type=all
...
#SBATCH --mail-type=all
...
conda activate GVA2021
...
conda activate GVA2021
...
export LAUNCHER_JOB_FILE=commands
...
export LAUNCHER_JOB_FILE=trim_commands
The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.
Line 27 assumes you named your breseq environment GVA-breseq in the earlier tutorial.
Again use ctl-o and ctl-x to save the file and exit.
| Code Block | ||||
|---|---|---|---|---|
| ||||
sbatch trim.slurm |
The job should take less than 10 minutes once it starts if everything is working correctly, the showq -u command can be used to check for the job finishing.
Evaluating the output
| Code Block | ||||
|---|---|---|---|---|
| ||||
ls Trim_Reads | wc -l
grep -c "TrimmomaticPE: Completed successfully" Queue_job.o*
grep -c "100.00%" Queue_job.o* |
The above 3 commands are expected to show 1088 and 272 and 0 respectively, if you see other answers it suggests that something went wrong with the trimming command itself. If so remember I'm on zoom if you need help looking at whats going on. The most common error that I expect will be that you ran out of ram by trying to process too many samples at once (such as if you used a -n 68 option in your .slurm file).
Beyond the job finishing successfully, the best way to evaluate this data would actually be to move back to the multiqc tutorial and repeat the analysis there that was done on the raw files for the trimmed files here.
Personal experience
The for loop above focuses just on generating the trim commands. In my experience that is only half of the job, the other half is capturing the individual outputs so you can evaluate how many reads were trimmed in what way for each sample. Perhaps you will find this command helpful in your own work:
| Code Block | ||||
|---|---|---|---|---|
| ||||
mkdir trimLogs; mkdir Trim_Reads; for r1 in Raw_Reads/*_R1_*.fastq.gz; do r2=$(echo $r1|sed 's/_R1_/_R2_/'); name=$(echo $r1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "trimmomatic PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30 >& trimLogs/$name.log"; done > trim_commands |
After the job completes, the following command is useful for evaluating its success:
| Code Block | ||||
|---|---|---|---|---|
| ||||
echo -e "\nTotal read pairs:";wc -l trim_commands;echo "Successful trimmings:"; tail -n 1 trimLogs/*|grep -c "TrimmomaticPE: Completed successfully";echo "Potential trimming errors:"; cat trimLogs/*|grep -c "100.00%" |
This gives me a quick readout of if all of the individual commands finished correctly.
Further, if you were to use this for loop rather than the one listed in the tutorial above, you would see each sample generates its own log file in the the trimLogs directory that when investigated with cat/more/less/tail/nano is identical to the output you saw when you trimmed a single sentence allowing you to figure out how different samples are being processedUsing everything you have learned so far in the class, can you answer the following questions?
Expand title What are the 4 new files that were generated, and where did they come from? E1-7_S187_L001_R2_001.trim.fastq.gzandE1-7_S187_L001_R1_001.trim.fastq.gz- These were created with the -o and -O options, they are in the Trim_Reads folder, and you likely found them using the ls command
- These were created with the -o and -O options, they are in the Trim_Reads folder, and you likely found them using the ls command
fastp.htmlandfastp.json- These are log files created by default since we didn't specify their names. This is part of why -j and -h were discussed above with the general command.
- While the json file can be evaluated in the terminal (cat less more head tail), the html file has to be transferred back to your computer to view.
Expand title How many reads were left after trimming? 5884 paired end reads
11768 total reads
You likely found this out from using the zgrep command, or from the following blocks that printed as the command ran:
No Format Read1 after filtering: total reads: 5884 total bases: 791763 Q20 bases: 782948(98.8867%) Q30 bases: 765510(96.6842%) Read2 after filtering: total reads: 5884 total bases: 791763 Q20 bases: 711414(89.8519%) Q30 bases: 658164(83.1264%) Filtering result: reads passed filter: 11768 reads failed due to low quality: 2014 reads failed due to too many N: 0 reads failed due to too short: 0 reads with adapter trimmed: 3970 bases trimmed due to adapters: 193972
Expand title How big was our fragment size? How is this estimated, what might make it inaccurate? - From the information generated while the command ran we see:
No Format Insert size peak (evaluated by paired-end reads): 171- This tells us that the average peak size was 171 bases, and that it was estimated by looking at the overlap between the read pairs. It is potentially inaccurate as reads which do not overlap each other can not estimate the size.
- If you transferred the .html file back to your laptop, you would see this relevant histogram:
- The general section of the summary at the top of the html tells us that the average insert size was 171, while the histogram tells us that 50% of our data is <18 or >272 bases
- From the information generated while the command ran we see:
Expand title Did our sample have any adapter present? - If you only look at the information that printed to the screen, you probably answer "No"
- you likely see the following block and think this is the end of the answer:
No Format Detecting adapter sequence for read1... No adapter detected for read1 Detecting adapter sequence for read2... No adapter detected for read2
- A more fuller answer might be "maybe" or "probably" or "I'm not sure" as:
- 1. Not finding any adapter would be super rare
- 2. If 45% of our reads have an insert size of 171 bases, and we did 151bp PE sequencing, we should be able to find adpater sequences
- 3. in the filtering results we see:
No Format Filtering result: reads passed filter: 11768 reads failed due to low quality: 2014 reads failed due to too many N: 0 reads failed due to too short: 0 reads with adapter trimmed: 3970 bases trimmed due to adapters: 193972
- If you look at the html file you probably answered "yes"
- There is a section for Read1 and Read2 adapters which show a growing stretch of DNA which recreates the illumina adapter sequences.
- If you only look at the information that printed to the screen, you probably answer "No"
Expand title Why was answering if there were adapters present not straight forward? Like we saw in our fastqc reports (over represented sequences having "no hit" and adapter content staying at bottom of graph), for something to be classified as an "adapter" in the first section of the printed information, it has to meet certain criteria that in this (and many other instances) is perhaps a bit too stringent.
Expand title What other interesting things can you find from this command? This is pretty open ended, take a look at the html file in particular, see what of it does or doesn't make sense and consider asking a question if you would like to know more.
Trim all the samples from the multiqc tutorial
As mentioned above, if you have already done the multiqc tutorial, you can use your new fastp command to remove the adapter sequences from all 544 samples or make other changes based on what you saw from the fastqc/multiqc reports.
Get some data
| Code Block | ||||
|---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/GVA_fastp_multiqcSamples/Trim_Reads
cd $SCRATCH/GVA_fastp_multiqcSamples
cp -r $BI/gva_course/plasmid_qc/ Raw_Reads/ |
The ls command will now show 2 directories, and like in the multiqc tutorial, you can use ls and wc commands to figure out how many files you will be working with.
Trim the fastq files
Just as we used a for loop to set up a set of FastQC commands in the multiqc tutorial, we can use a similar for loop to generate a single file with 272 trim commands for the 544 total files.
The following command will pair all R1 reads in the Raw_Reads folder with its R2 pair, determine the base name, and generate a command to trim the file
| Code Block | ||||
|---|---|---|---|---|
| ||||
for R1 in Raw_Reads/*_R1_001.fastq.gz; do R2=$(echo $R1| sed 's/_R1_/_R2_/'); name=$(echo $R1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "fastp -i $R1 -I $R2 -o Trim_Reads/$name.trim.R1.fastq.gz -O Trim_Reads/$name.trim.R2.fastq.gz -w 4 --detect_adapter_for_pe -j Trim_Logs/$name.json -h Trim_Logs/$name.html &> Trim_Logs/$name.log.txt";done > fastp.commands |
Use the head and wc -l to see what the output is and how much there is of it respectively.
| Tip | ||
|---|---|---|
| ||
The "&>" pipe is one of the most important you can learn when working with multiple samples, and running the commands on a distributed network. In addition to seeing 272 variations on the same command we ran above, you also see the command ends with "&> Trim_Logs/sample.log.txt". What this says is take all of the errors (&) and all of the standard output (>) that would normally print to the screen and instead put it in the file listed next. As we will discuss on Friday, when we submit jobs to the queue, we dont get to see those jobs run, we only get to see the final result. By using "&>" we're able to store this in a file so we can look at it later. For some programs (fastQC)this will only help with evaluating things if something goes wrong, but in the case of fastp (and others) there is actual data printed to the screen that is informative |
Again as we discussed in the multiqc tutorial, running this number of commands is kind of a boarder line case of being better to run in idev or as a submitted job, there are not a lot of total reads, but there are a large number of samples. Because our command does request additional processors we should not run on the head node, if we wanted to use only a single processor, the job would take so long to run on the head node that even waiting in the queue system would be faster.
Do only one of the following, but do read through both options as there are different discussions about the process in each.
Submitting as a job
| Code Block | ||||
|---|---|---|---|---|
| ||||
cp /corral-repl/utexas/BioITeam/gva_course/GVA2022.launcher.slurm trim.slurm
nano trim.slurm |
Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is
| Line number | As is | To be |
|---|---|---|
| 16 | #SBATCH -J jobName | #SBATCH -J mutli_fastp |
| 17 | #SBATCH -n 1 | #SBATCH -n 17 |
| 21 | #SBATCH -t 12:00:00 | #SBATCH -t 0:20:00 |
| 22 | ##SBATCH --mail-user=ADD | #SBATCH --mail-user=<YourEmailAddress> |
| 23 | ##SBATCH --mail-type=all | #SBATCH --mail-type=all |
| 29 | export LAUNCHER_JOB_FILE=commands | export LAUNCHER_JOB_FILE=fastp.commands |
Line 17 being set to -n 17 allows 17 jobs to run at the same time, since our command uses -w 4 (4 threads) this job will use all 68 threads available. The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.
Again use ctl-o and ctl-x to save the file and exit.
| Expand | ||
|---|---|---|
| ||
Creating your own empty folder before running a command will never cause a problem, but not creating one can cause problems if the program can't create it for you. As you work with these programs more and more you will either 1. get a feel for which type of program you have and generate folders yourself, with occasional errors and loss of time/effort or 2. get frustrated with the aforementioned losses and just always create your own folders |
| Code Block | ||||
|---|---|---|---|---|
| ||||
mkdir Trim_Reads Trim_Logs
sbatch trim.slurm |
The job should take less than 10 minutes once it starts if everything is working correctly, the showq -u command can be used to check for the job finishing.
Running on idev
The alternative to running all the commands as a submitted job is of course to run on an idev node. In the multiqc tutorial, we were able to tell fastqc to analyze all samples just by using the "*" wildcard as the only required input to fastqc is the filename. Here our command is much more intricate which may seem like it precludes us from being able to run interactively as we never would type 272 nearly identical commands. Obviously there is a trick for this.
| Tip | |||||||
|---|---|---|---|---|---|---|---|
| |||||||
Thus far all programs that we have run from ls to fastp have all been able to run, and been able to autocomplete using the tab key specifically because they are in our $PATH variable. While the .bashrc file you worked with on Monday modifies this slightly to give access to some non-standard locations, and any use of the module system automatically adds the relevent commands to your $PATH. The real star here letting this be something you have likely not had to pay much attention to (unlike in previous years) is conda. Every time you activate a conda environment, conda modifies your $PATH variable to give you access to the programs associated with that environment without costing you access to all the other basic programs/commands you need access to.
|
Before we jump to making our commands file executable and executing it, we want to change it to be slightly different. Specifically, above we used -w 4 to specify we wanted to use 4 processor for every command. While this worked great when we also were launching 17 processes at the same time as it used all 68 processes, when executing a commands file from the command line without the help of the queue system, only 1 sample at a time will launch so you likely think we need to increase to 68 processors.
| Code Block | ||||
|---|---|---|---|---|
| ||||
#1 change the for loop:
for R1 in Raw_Reads/*_R1_001.fastq.gz; do R2=$(echo $R1| sed 's/_R1_/_R2_/'); name=$(echo $R1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "fastp -i $R1 -I $R2 -o Trim_Reads/$name.trim.R1.fastq.gz -O Trim_Reads/$name.trim.R2.fastq.gz -w 68 --detect_adapter_for_pe -j Trim_Logs/$name.json -h Trim_Logs/$name.html &> Trim_Logs/$name.log.txt";done > fastp.commands
#2 use sed to do an in file replacement (something new you haven't seen before in this class)
sed -i 's/ -w 4 / -w 68 /g' fastp.commands |
Note that if you use the sed command above, you need to be very careful in what you choose to match to. If you just choose "4" and replace with "68" then the commands file will then change any file name that has a 4 into 68 and all those samples will fail. When using sed to do replacements, always make sure you have a unique handle, when you don't, and when you don't need one.
| Code Block | ||||
|---|---|---|---|---|
| ||||
chomd +x fastp.commands
./fastp.commands |
Once the command is started continue reading below.
Comparing different run options
In previous years it has been common to question what the fastest way of getting a large set of samples analyzed is with respect to threads and Nodes and tasks. Here we hav an opportunity to do just that, and have some surprising results. Since we have been working with idev sessions all along we'll start with the following:
| run type | processors (-w) | time |
|---|---|---|
| idev | -w 68 | 16 minutes |
| idev | -w 4 | 14 minutes |
This brings us to our first question: how can using fewer processors speed things up. I am not completely sure but I do have some hypotheses, and expect the answer to be somewhere in the middle:
- We have so few reads (10,000s) that trying to spread them in 68 different groups, read them, modify them, write them back to the growing trimmed files, perform statistics on all the reads. That the computation required to spread them out (the overhead) actually exceeds whatever speed bonus is available with the extra threads.
If you used this command and look in the .log.txt file that was generated thanks to our pipe "&>", you will find a interesting note:
No Format WARNING: fastp uses up to 16 threads although you specified 68this means that despite requesting fastp use 68 threads, it is only capable of, and only used 16 so there is an even smaller difference in number of threads available. Unfortunately this is something that doesn't seem to be documented on their website or in the help information.
Given that threads only effect the speed of a single sample I also attempted to trim all the read files at once by telling the idev node to run them in the background (this was done by adding a single & as the last character on each line of the commands file)
| run type | processors (-w) | time |
|---|---|---|
| idev background | -w 68 | NA error after 54 samples finished |
| idev background | -w 1 | NA error after 148 samples finished |
While both of these errored after ~1/5 and ~1/2 of the total samples making them a non viable single pass method, they did finish in <10 seconds. Using grep we could generate a list of samples which did finish, store those in a file, then use another grep command with the -v and -f options to remove the finished samples from analysis. This is something that can be very helpful if you have a large number of samples and your run timed out as you can focus on just samples that have not yet finished instead of having to rerun all samples, or manually edit the commands file to remove samples which finished. Since this run failed due to lack of memory, not lack of time its less reasonable to try to tackle it in this manner.
That brings us to our last set of runs: those working with the commands as a submitted job.
| run type | processors (-w), jobs at once (-n) | time |
|---|---|---|
| sbatch | -w 4 -n 17 | 2 minutes |
| sbatch | -w 1 -n 68 | 2 minutes |
| sbatch | -w 68 -n 1 | 17 minutes |
Based on what we have already seen, it is probably not surprising that using 68 (really 16) threads and only evaluating 1 sample at a time took approximately the same amount of time as it did when running on an idev node as those conditions are functionally equivalent. Whaat may be surprising is the lack of improvement despite running 4x more samples at the same time. Again hypothesies:
- The amount of overhead the job launcher has in checking when a command finishes and starting the next command is similar to the amount of overhead in splitting reads into 4 sets
- The biggest contributor to time is not something that can be improved with additional threads
| Info |
|---|
The take away here is that "should I use more threads per command, or launch more simultaneous commands" is not a simple thing to answer. Perhaps as you become more familiar with a particular command you can tease these apart, but more of then not you will likely find yourself balancing the 2 where you luanch jobs with 4-16 threads each and as many commands at once as the 68 available threads and accommodate. |
Evaluating the output
The first thing you always want to check when working with a lot of commands simultaneously is if they all finished correctly (note above how running all 272 jobs in the background at basically the same time did not finish correctly). Typically this is where the log files we generate with "&>" come in handy. if you look at the tail of any 1 of the files you are likely to see something like:
| No Format |
|---|
Duplication rate: 6.5207%
Insert size peak (evaluated by paired-end reads): 80
JSON report: Trim_Logs/E2-1_S189_L001.json
HTML report: rim_Logs/E2-1_S189_L001.html
fastp -i Raw_Reads/E2-1_S189_L001_R1_001.fastq.gz -I Raw_Reads/E2-1_S189_L001_R2_001.fastq.gz -o Trim_Reads/E2-1_S189_L001.trim.R1.fastq.gz -O Trim_Reads/E2-1_S189_L001.trim.R2.fastq.gz -w 68 --detect_adapter_for_pe -j Trim_Logs/E2-1_S189_L001.json -h Trim_Logs/E2-1_S189_L001.html
fastp v0.23.2, time used: 3 seconds |
Typically what you should look for is some kind of anchor that you can pass to grep that is as far down the file as possible. Sometimes you will be lucky and the program will actually print something like "successfully complete". In our case the last line looks promising, "fastp v0.23.2, time used:" seems likely to be printed as the last step in the program.
| Code Block | ||||
|---|---|---|---|---|
| ||||
ls Trim_Logs/*.log.txt|wc -l
wc -l fastp.commands
tail -n 1 Trim_Logs/*.log.txt|grep -c "^fastp v0.23.2, time used:" |
The above 3 commands are all expected to return 272. If so remember I'm on zoom if you need help looking at whats going on.
Beyond the job finishing successfully, the best way to evaluate this data would actually be to actually use it. That could me running it through multiqc to see how the trimming has effected overall quality, or assembling the reads, or mapping them to a reference (not applicable in this case since they are from a variety of sources and you have not been provided reference files.
Optional next steps:
- Consider moving back over to the multiqc tutorial use multiqc to determine well the trimming worked.
- The reads could then further be assembled in the Spades tutorial as they are all plasmid samples.