Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titlefor loop to generate all trim commands needed
for r1R1 in Raw_Reads/*_R1_*001.fastq.gz; do r2R2=$(echo $r1$R1| sed 's/_R1_/_R2_/'); name=$(echo $r1$R1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "fastp PE $r1 $r2-i $R1 -I $R2 -baseouto Trim_Reads/$name.trim.R1.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:4:30:10 MINLEN:30";done > trim_commands

for R1 in Raw_Reads/*_R1_001.fastq.gz; do R2=$(echo $R1| sed 's/_R1_/_R2_/'); name=$(echo $R1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "fastp -i $R1 -I $R2 -o Trim_Reads/$name.trim.R1.fastq.gz -O Trim_Reads/$name.trim.R2.fastq.gz -w 4 --detect_adapter_for_pe -j 04_Trim_Logs/$name.json -h 04_Trim_Logs/$name.html &> 04_Trim_Logs/$name.log.txt";done > fastp.commands

Use the head and wc -l to see what the output is and how much there is of it respectively.

Again as we discussed in the multiqc tutorial, running this number of commands is kind of a boarder line case, there are not a lot of total reads, but there are a large number of samples and our command does request additional processors so we should not be on the head node. 

Code Block
language
-O Trim_Reads/$name.trim.R2.fastq.gz -w 4 --detect_adapter_for_pe -j Trim_Logs/$name.json -h Trim_Logs/$name.html &> Trim_Logs/$name.log.txt";done > fastp.commands

Use the head and wc -l to see what the output is and how much there is of it respectively.

Tip
titleCheat Sheat

The "&>" pipe is one of the most important you can learn when working with multiple samples, and running the commands on a distributed network.

In addition to seeing 272 variations on the same command we ran above, you also see the command ends with "&> Trim_Logs/sample.log.txt". What this says is take all of the errors (&) and all of the standard output (>) that would normally print to the screen and instead put it in the file listed next. As we will discuss on Friday, when we submit jobs to the queue, we dont get to see those jobs run, we only get to see the final result. By using "&>" we're able to store this in a file so we can look at it later. For some programs (fastQC)this will only help with evaluating things if something goes wrong, but in the case of fastp (and others) there is actual data printed to the screen that is informative

Again as we discussed in the multiqc tutorial, running this number of commands is kind of a boarder line case of being better to run in idev or as a submitted job, there are not a lot of total reads, but there are a large number of samples. Because our command does request additional processors we should not run on the head node, if we wanted to use only a single processor, the job would take so long to run on the head node that even waiting in the queue system would be faster.

Do only one of the following, but do read through both options as there are different discussions about the process in each.

Submitting as a job

Code Block
languagebash
titleModify your slurm file
cp /corral-repl/utexas/BioITeam/gva_course/GVA2022.launcher.slurm trim.slurm
nano trim.slurm

Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is

Line numberAs isTo be
16

#SBATCH -J jobName

#SBATCH -J mutli_trimmomaticfastp
17

#SBATCH -n 1

#SBATCH -n 417

21

#SBATCH -t 12:00:00

#SBATCH -t 0:20:00

22

##SBATCH ---mail-user=ADD

#SBATCH --mail-user=<YourEmailAddress>

23

##SBATCH --mail-type=all

#SBATCH --mail-type=all

29

export LAUNCHER_JOB_FILE=commands

export LAUNCHER_JOB_FILE=fastp.commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

Again use ctl-o and ctl-x to save the file and exit.

Code Block
languagebash
titlesubmit the job to run on the que
sbatch trim.slurm

...

mail-user=ADD

#SBATCH --mail-user=<YourEmailAddress>

23

##SBATCH --mail-type=all

#SBATCH --mail-type=all

29

export LAUNCHER_JOB_FILE=commands

export LAUNCHER_JOB_FILE=fastp.commands

Line 17 being set to -n 17 allows 17 jobs to run at the same time, since our command uses -w 4 (4 threads) this job will use all 68 threads available. The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.


Again use ctl-o and ctl-x to save the file and exit.

Expand
titleAs discussed elsewhere, sometimes a program can generate its own new folder if you specify both a folder location and a file name as output, others cant (expand for more information). In this case neither fastp nor native bash (&>) can create directories so you will need to do so before you submit your job.

Creating your own empty folder before running a command will never cause a problem, but not creating one can cause problems if the program can't create it for you. As you work with these programs more and more you will either 1. get a feel for which type of program you have and generate folders yourself, with occasional errors and loss of time/effort or 2. get frustrated with the aforementioned losses and just always create your own folders


Code Block
languagebash
titlesubmit the job to run on the que
mkdir Trim_Reads Trim_Logs
sbatch trim.slurm

The job should take less than 10 minutes once it starts if everything is working correctly, the showq -u command can be used to check for the job finishing.


Running on idev

The alternative to running all the commands as a submitted job is of course to run on an idev node. In the multiqc tutorial, we were able to tell fastqc to analyze all samples just by using the "*" wildcard as the only required input to fastqc is the filename. Here our command is much more intricate which may seem like it precludes us from being able to run interactively as we never would type 272 nearly identical commands. Obviously there is a trick for this.

Tip
titleCheat Sheet chmod +x and the $PATH variable

Thus far all programs that we have run from ls to fastp have all been able to run, and been able to autocomplete using the tab key specifically because they are in our $PATH variable. While the .bashrc file you worked with on Monday modifies this slightly to give access to some non-standard locations, and any use of the module system automatically adds the relevent commands to your $PATH. The real star here letting this be something you have likely not had to pay much attention to (unlike in previous years) is conda. Every time you activate a conda environment, conda modifies your $PATH variable to give you access to the programs associated with that environment without costing you access to all the other basic programs/commands you need access to.

At the same time that means that a program you install in a random location on $SCRATCH can't be run because it is not in your path, and you are left with 2 options 1: modify your path so the computer sees your new program when it looks through its list of commands, or 2: specifically tell the computer the location of the file you want to run and run it. We can take advantage of the 2nd option to run our large list of commands in 2 steps: 1. giving our commands file execution permissions, and 2. telling the command line to execute our file of commands.

Code Block
languagebash
titleGeneric steps to run a list of commands
chmod +x <filename>
./<filename>


Before we jump to making our commands file executable and executing it, we want to change it to be slightly different. Specifically, above we used -w 4 to specify we wanted to use 4 processor for every command. While this worked great when we also were launching 17 processes at the same time as it used all 68 processes, when executing a commands file from the command line without the help of the queue system, only 1 sample at a time will launch so you likely think we need to increase to 68 processors. 

Code Block
languagebash
title2 different ways to increase to 68 threads
#1 change the for loop:
for R1 in Raw_Reads/*_R1_001.fastq.gz; do R2=$(echo $R1| sed 's/_R1_/_R2_/'); name=$(echo $R1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "fastp -i $R1 -I $R2 -o Trim_Reads/$name.trim.R1.fastq.gz -O Trim_Reads/$name.trim.R2.fastq.gz -w 68 --detect_adapter_for_pe -j Trim_Logs/$name.json -h Trim_Logs/$name.html &> Trim_Logs/$name.log.txt";done > fastp.commands

#2 use sed to do an in file replacement (something new you haven't seen before in this class)


14 minutes with -w4 idev

16 minutes with -w 68 idev

error after 54 samples -w 68 idev & ... <1 minute

error after 148 samples -w 1 idev & ... <1 minute

~2 minutes as submitted job -w4 -n 17

? minutes as submitted job with -w68 -n 1

? minutes as submitted job with -w1 -n 68

Evaluating the output

Code Block
languagebash
titlesubmit the job to run on the que
ls Trim_Reads | wc -l
grep -c "TrimmomaticPE: Completed successfully" Queue_job.o*
grep -c "100.00%" Queue_job.o*

...