...
In previous years it has been common to question what the fastest way of getting a large set of samples analyzed is with respect to threads and Nodes and tasks. Here we hav an opportunity to do just that, and have some surprising results. Since we have been working with idev sessions all along we'll start with the following:
run type | processors (-w) | time |
---|---|---|
idev | -w 68 | 16 minutes |
idev | -w 4 | 14 minutes |
This brings us to our first question: how can using fewer processors speed things up. I am not completely sure but I do have some hypotheses, and expect the answer to be somewhere in the middle:
...
Given that threads only effect the speed of a single sample I also attempted to trim all the read files at once by telling the idev node to run them in the background (this was done by adding a single & as the last character on each line of the commands file)
run type | processors (-w) | time |
---|---|---|
idev background | -w 68 | NA error after 54 samples finished |
idev background | -w 1 | NA error after 148 samples finished |
While both of these errored after ~1/5 and ~1/2 of the total samples making them a non viable single pass method, they did finish in <10 seconds. Using grep we could generate a list of samples which did finish, store those in a file, then use another grep command with the -v and -f options to remove the finished samples from analysis. This is something that can be very helpful if you have a large number of samples and your run timed out as you can focus on just samples that have not yet finished instead of having to rerun all samples, or manually edit the commands file to remove samples which finished. Since this run failed due to lack of memory, not lack of time its less reasonable to try to tackle it in this manner.
That brings us to our last set of runs: those working with the commands as a submitted job.
run type | processors (-w), jobs at once (-n) | time |
---|---|---|
sbatch | -w 4 -n 17 | 2 minutes |
sbatch | -w 1 -n 68 | 2 minutes |
sbatch | -w 68 -n 1 | 17 minutes |
Based on what we have already seen, it is probably not surprising that using 68 (really 16) threads and only evaluating 1 sample at a time took approximately the same amount of time as it did when running on an idev node as those conditions are functionally equivalent. Whaat may be surprising is the lack of improvement despite running 4x more samples at the same time. Again hypothesies:
...
Info |
---|
The take away here is that "should I use more threads per command, or launch more simultaneous commands" is not a simple thing to answer. Perhaps as you become more familiar with a particular command you can tease these apart, but more of then not you will likely find yourself balancing the 2 where you luanch jobs with 4-16 threads each and as many commands at once as the 68 available threads and accommodate. |
Evaluating the output
Lets look at how the trimming actually went. The first thing you always want to check when working with a lot of commands simultaneously is if they all finished correctly (note above how running all 272 jobs in the background at basically the same time did not finish correctly). Typically this is where the log files we generate with "&>" come in handy. if you look at the tail of any of the
Code Block | ||||
---|---|---|---|---|
| ||||
ls Trim_Reads | wc -l
grep -c "TrimmomaticPE: Completed successfully" Queue_job.o*
grep -c "100.00%" Queue_job.o* |
The above 3 commands are expected to show 1088
and 272 and 0
respectively, if you see other answers it suggests that something went wrong with the trimming command itself. If so remember I'm on zoom if you need help looking at whats going on. The most common error that I expect will be that you ran out of ram by trying to process too many samples at once (such as if you used a -n 68 option in your .slurm file).
Beyond the job finishing successfully, the best way to evaluate this data would actually be to move back to the multiqc tutorial and repeat the analysis there that was done on the raw files for the trimmed files here.
Personal experience
The for loop above focuses just on generating the trim commands. In my experience that is only half of the job, the other half is capturing the individual outputs so you can evaluate how many reads were trimmed in what way for each sample. Perhaps you will find this command helpful in your own work:
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir trimLogs; mkdir Trim_Reads; for r1 in Raw_Reads/*_R1_*.fastq.gz; do r2=$(echo $r1|sed 's/_R1_/_R2_/'); name=$(echo $r1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "trimmomatic PE $r1 $r2 -baseout Trim_Reads/$name.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30 >& trimLogs/$name.log"; done > trim_commands |
After the job completes, the following command is useful for evaluating its success:1 of the files you are likely to see something like:
No Format |
---|
Duplication rate: 6.5207%
Insert size peak (evaluated by paired-end reads): 80
JSON report: Trim_Logs/E2-1_S189_L001.json
HTML report: rim_Logs/E2-1_S189_L001.html
fastp -i Raw_Reads/E2-1_S189_L001_R1_001.fastq.gz -I Raw_Reads/E2-1_S189_L001_R2_001.fastq.gz -o Trim_Reads/E2-1_S189_L001.trim.R1.fastq.gz -O Trim_Reads/E2-1_S189_L001.trim.R2.fastq.gz -w 68 --detect_adapter_for_pe -j Trim_Logs/E2-1_S189_L001.json -h Trim_Logs/E2-1_S189_L001.html
fastp v0.23.2, time used: 3 seconds |
Typically what you should look for is some kind of anchor that you can pass to grep that is as far down the file as possible. Sometimes you will be lucky and the program will actually print something like "successfully complete". In our case the last line looks promising, "fastp v0.23.2, time used:" seems likely to be printed as the last step in the program.
Code Block | ||||||
---|---|---|---|---|---|---|
| echo -e "\nTotal read pairs:";
| |||||
ls Trim_Logs/*.log.txt|wc -l wc -l trim_commands;echo "Successful trimmings:";fastp.commands tail -n 1 trimLogsTrim_Logs/*.log.txt|grep -c "TrimmomaticPE: Completed successfully";echo "Potential trimming errors:"; cat trimLogs/*|grep -c "100.00%" |
This gives me a quick readout of if all of the individual commands finished correctly.
...
^fastp v0.23.2, time used:" |
The above 3 commands are all expected to return 272.
If so remember I'm on zoom if you need help looking at whats going on.
Beyond the job finishing successfully, the best way to evaluate this data would actually be to actually use it. That could me running it through multiqc to see how the trimming has effected overall quality, or assembling the reads, or mapping them to a reference (not applicable in this case since they are from a variety of sources and you have not been provided reference files.
Optional next steps:
- Consider moving back over to the multiqc tutorial use multiqc to determine well the trimming worked.
- The reads could then further be assembled in the Spades tutorial as they are all plasmid samples.
...