Page Comparison

...

Tip

title	Reservations

Use our summer school reservation (CoreNGS-Wed) when submitting batch jobs to get higher priority on the ls6 normal queue today.

Code Block

language	bash
title	Request an interactive (idev) node

# Request a 180 minute interactive node on the normal queue using our reservation
idev -m 120 -N 1 -A OTH21164 -r CoreNGS
idev -m 120 -N 1 -A TRA23004 -r CoreNGS

# Request a 120 minute idev node on the development queue 
idev -m 120 -N 1 -A OTH21164 -p development
idev -m 120 -N 1 -A TRA23004 -p development

Code Block

language	bash
title	Submit a batch job

# Using our reservation
sbatch --reseservation=CoreNGS <batch_file>.slurm

Note that the reservation name (CoreNGS-Wed) is different from the TACC allocation/project for this class, which is OTH21164.

...

You can't run a web browser directly from your "dumb terminal" command line environment. The FastQC results have to be placed where a web browser can access them. One way to do this is to copy the results back to your laptop, for example by using scp from your computer (read more at Copying files from TACC to your laptop).

For convenience, we put an example FastQC report at this URL:
https://web.corral.tacc.utexas.edu/BioinformaticsResource/CoreNGS/yeast_stuff/Sample_Yeast_L005_R1.cat_fastqc/fastqc_report.html

...

Expand

title	Answer

The Per base sequence quality report does not look good. The data should probably be trimmed (to 40 or 50 bp) before alignment.

...

.

...

Using MultiQC to consolidate multiple QC reports

...

Expand

title	Make sure you're in a idev session

Make sure you're in an idev session. If you're in an idev session, the hostname command will display a name like c455-020.ls6.tacc.utexas.edu. But if you're on a login node the hostname will be something like login1.ls6.tacc.utexas.edu.

If you're on a login node, start an idev session like this:

Code Block

language	bash
title	Start an idev session

idev -m 120 -N 1 -A OTH21164 -r CoreNGSday3  # or -A TRA23004
# or
idev -m 120 -N 1 -A OTH21164 -p development  # or -A TRA23004

Code Block

language	bash

# Load the main BioContainers module if you have not already
module load biocontainers  # may take a while

# Load the multiqc module and ask for its usage information
module load multiqc
multiqc --help | more

...

Expand

title	Setup (if needed)

Code Block

language	bash
title	Set up directory for working with FASTQs

export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools

# Create a $SCRATCH area to work on data for this course,
# with a sub-direct[1ory for pre-processing raw fastq files
mkdir -p $SCRATCH/core_ngs/fastq_prep

# Make a symbolic links to the original yeast data:
cd $SCRATCH/core_ngs/fastq_prep
ln -s -fsf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -fsf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

...

The -l 50 option says that base 50 should be the last base (i.e., trim down to 50 bases)
The -Q 33 option specifies how base Qualities on the 4th line of each FASTQ entry are encoded.
- The FASTX Toolkit is an older program written in the time when Illumina base qualities were encoded differently, so its default does not work for modern FASTQ files.
  - These days Illumina base qualities follow the Sanger FASTQ standard (Phred score + 33 to make an ASCII character).
- This option is not really required here because we're just hard trimming, so the program doesn't have to interpret the quality scores. But the -Q 33 option would be required if you were trimming according to base qualities.
- Note that the fastq_trimmer help does not document this -Q option! But they do talk about it on their website.

Exercise: compressing fastx_trimmer output

...

Expand

title	Hint

Code Block

language	bash

echo $(( `zcat Sample_H54_miRNA_L004_R1.cat.fastq.gz | wc -l` / 4 ))
# or
zcat Sample_H54_miRNA_L004_R1.cat.fastq.gz | wc -l | awk '{print $1 / 4}'

Read more about Arithemetic in bash and more about awk in Some Linux commands: awk

Expand

title	Answer

Looking at the FASTQ file names, we see this is two lanes of single-end reads (L004 and L005).

The data from lane 4 has 2,001,337 reads, the data from lane 5 has 2,022,237 reads.

...

Expand

title	Answer

These are 101-base reads.

wc -c counts the "invisible" newline character, so subtract 1 from the character count it returns for a line.

Here's a way to strip the trailing newline characters from the quality scores string before calling wc -c to count the characters. We use the echo -n option that tells echo not to include the trailing newline in its output. We gemerate generate that text using sub-shell evaluation (an alternative to backtick evaluation) of that zcat ... command:

Code Block

language	bash

echo -n $( zcat Sample_H54_miRNA_L004_R1.cat.fastq.gz | head -2 | tail -1 ) | wc -c

...

Code Block

language	bash

# Remember, FASTQ files have 4 lines per read
zcat Sample_H54_miRNA_L004_R1.cat.fastq.gz | head -2000 > miRNA_test.fq

Now execute cutadapt like this: cutadapt like this. Note that the backslash ( \ ) here is just a line continuation character so that we can split a long command onto multiple lines to make it more readable.

Expand

title	Setup (if needed)

Code Block

language	bash
title	Setup for cutadapt on miRNA FASTQ

export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/human_stuff/miRNA_test.fq .

...

Here's one of those cases where you have to be careful about separating standard output and standard error.
- cutadapt writes its FASTQ output to standard output by default, and writes summary information to standard error.
In this command we first redirect standard error to a log file named miRNA_test.cuta.log using the 2> syntax, in order to capture cutadapt diagnostics.
- Then the remaining standard output is piped to gzip, whose output is the redirected to a new compressed FASTQ file. (Read more about Standard streams and redirection)

...

Expand

title	Interpreting cutadapt I/O options

The cutadapt --help output describes its usage as follows:

cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

From this we see that the input.fastq is a required argument. Clearly, it can be a FASTQ file, and it can be compressed based on this help:

Compressed input and output is supported andauto-detected from the file name (.gz, .xz, .bz2).Use the file name '-' for standard input/output.

And this says that input reads can also be provided on standard input, if that argument is a hyphen ( - ). So input data can come:

from a file named as an argument:
- cutadapt -a CGTAATTCGCG -o small.trim.fq small.fq
- and that input.fastq file can be provided in one of three compression formats
from standard input if the input.fastq argument is replaced with a dash ( - )
- cat small.fq | cutadapt -a CGTAATTCGCG -o small.trim.fq -

What about cutadapt output (the trimmed reads)? The brackets around the usage -o option indicate that the resulting trimmed FASTQ can be written to a file, but is not by default. This implies that cutadapt by default writes its results to standard output. So output can go

to a file, using the -o option
- cutadapt -a CGTAATTCGCG -o small.trim.fq small.fq
to standard output without the -o option
- cutadapt -a CGTAATTCGCG small.fq 1> small.trim.fq

Finally, as we've seen, cutadapt also writes diagnostic output. Where does it go? The usage line doesn't say anything about diagnostics explicitly. But in the Output section of cutadapt --help:

-o FILE, --output=FILE Write trimmed reads to FILE. FASTQ or FASTA format is chosen depending on input. The summary report is sent to standard output. Use '{name}' in FILE to demultiplex reads into multiple files. Default: write to standard output

Careful reading of this suggests that:

When the trimmed output is sent to a file with the -o output.fastq option,
- diagnostics are written to standard output
  - so can be redirected to a log file with 1> small.trim.log
- cutadapt -a CGTAATTCGCG -o small.trim.fq small.fq 1> small.trim.log
But when the -o option is omitted, and output goes to standard output,
- diagnostics must be written to standard error
  - so can be redirected to a log file with 2> trim.log
- cutadapt -a CGTAATTCGCG small.fq 1> small.trim.fq 2> small.trim.log

...

Now we're going to run cutadapt on the larger FASTQ files, and also perform paired-end adapter trimming on some yeast paired-end RNA-seq data.

Since batch jobs can't be submitted from an idev session, make sure you are back on a login node (just exit the idev session).

First stage the 4 FASTQ files we will work on:

Code Block

language	bash
title	Setup for cutadapt

mkdir -p $SCRATCH/core_ngs/cutadapt
cd $SCRATCH/core_ngs/cutadapt
cp $CORENGS/human_stuff/Sample_H54_miRNA_L004_R1.cat.fastq.gz .
cp $CORENGS/human_stuff/Sample_H54_miRNA_L005_R1.cat.fastq.gz .
cp $CORENGS/custom_tracksalignment/Yeast_RNAseq_L002_R1.fastq.gz .
cp $CORENGS/custom_tracksalignment/Yeast_RNAseq_L002_R2.fastq.gz .

...

Tip

title	Paired-end RNA fastq trimming script

The BioITeam has an a number of useful NGS scripts that can be executed by anyone on ls6. or stampede2 stampede3. They are located in the /work/projects/BioITeam/common/script/ directory.

For groups that participate in BRCF pods, the scripts are available in /mnt/bioi/script on any compute server.

...

Or use this "cat to MARKER" trick, also known as an heredoc. The MARKER tag can be anything; below it is EOL.

...

Next create a batch submission script for your job and submit it to the normal queue with a maximum run time of 2 hours.Since batch jobs can't be submitted from an idev session, make sure you are back on a login node (just exit the idev session)1 hour.

Code Block

language	bash
title	Create and submit cutadapt batch script

cd $SCRATCH/core_ngs/cutadapt
launcher_creator.py -j cuta.cmds -n cuta -t 01:00:00 -a OTH21164 -q normal 
sbatch --reservation=CoreNGS-Wed cuta.slurm
showq -u

# or, if you're not on the reservation:
launcher_creator.py -j cuta.cmds -n cuta -t 01:00:00 -a OTH21164 -q development
sbatch cuta.slurm
showq -u

...

H54_miRNA_L004.cuta.log, H54_miRNA_L005.cuta.log, yeast_rnaseq.cuta.log
- these are the main execution log files, one for each trim_adapters.sh command
H54_miRNA_L004.acut.pass0.log, H54_miRNA_L005.acut.pass0.log
- these are cutadapt statistics files for the single-end adapter trimming
- their contents will look like our small example above
yeast_rnaseq.acut.pass1.log, yeast_rnaseq.acut.pass2.log
- these are cutadapt statistics files from paired-end trimming of the R1 and R2 adapters, respectively.

...

Expand

title	Hint

Code Block

language	bash

echo "$((`zcat yeast_rnaseq_R1.cuta.fastq.gz | wc -l` / 4))"
zcat yeast_rnaseq_R2.cuta.fastq.gz | wc -l | awk '{printf("%d\n", $1/4)}'

For more on printf, which is available in most programming languages, see see https://en.wikipedia.org/wiki/Printf#Format_specifier or https://alvinalexander.com/programming/printf-format-cheat-sheet/

...

Versions Compared

Old Version 142

New Version Current

Key

Using MultiQC to consolidate multiple QC reports