Page Comparison

...

Expand

title	Setup (if needed)

Code Block

language	bash
title	Set up directory for working with FASTQs

export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools

# Create a $SCRATCH area to work on data for this course,
# with a sub-direct[1ory for pre-processing raw fastq files
mkdir -p $SCRATCH/core_ngs/fastq_prep

# Make a symbolic links to the original yeast data:
cd $SCRATCH/core_ngs/fastq_prep
ln -s -fsf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -fsf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

...

Expand

title	Answer

These are 101-base reads. wc -c counts the "invisible" newline character, so subtract 1 from the character count it returns for a line.

Here's a way to strip the trailing newline characters from the quality scores string before calling wc -c to count the characters. We use the echo -n option that tells echo not to include the trailing newline in its output. We gemerate generate that text using sub-shell evaluation (an alternative to backtick evaluation) of that zcat ... command:

Code Block

language	bash

echo -n $( zcat Sample_H54_miRNA_L004_R1.cat.fastq.gz | head -2 | tail -1 ) | wc -c

...

Here's one of those cases where you have to be careful about separating standard output and standard error.
- cutadapt writes its FASTQ output to standard output by default, and writes summary information to standard error.
In this command we first redirect standard error to a log file named miRNA_test.cuta.log using the 2> syntax, in order to capture cutadapt diagnostics.
- Then the remaining standard output is piped to gzip, whose output is the redirected to a new compressed FASTQ file. (Read more about Standard streams and redirection)

...

Expand

title	Interpreting cutadapt I/O options

The cutadapt --help output describes its usage as follows:

cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

From this we see that the input.fastq is a required argument. Clearly, it can be a FASTQ file, and it can be compressed based on this help:

Compressed input and output is supported andauto-detected from the file name (.gz, .xz, .bz2).Use the file name '-' for standard input/output.

And this says that input reads can also be provided on standard input, if that argument is a hyphen ( - ). So input data can come:

from a file named as an argument:
- cutadapt -a CGTAATTCGCG -o small.trim.fq small.fq
- and that input.fastq file can be provided in one of three compression formats
from standard input if the input.fastq argument is replaced with a dash ( - )
- cat small.fq | cutadapt -a CGTAATTCGCG -o small.trim.fq -

What about cutadapt output (the trimmed reads)? The brackets around the usage -o option indicate that the resulting trimmed FASTQ can be written to a file, but is not by default. This implies that cutadapt by default writes its results to standard output. So output can go

to a file, using the -o option
- cutadapt -a CGTAATTCGCG -o small.trim.fq small.fq
to standard output without the -o option
- cutadapt -a CGTAATTCGCG small.fq 1> small.trim.fq

Finally, as we've seen, cutadapt also writes diagnostic output. Where does it go? The usage line doesn't say anything about diagnostics explicitly. But in the Output section of cutadapt --help:

-o FILE, --output=FILE Write trimmed reads to FILE. FASTQ or FASTA format is chosen depending on input. The summary report is sent to standard output. Use '{name}' in FILE to demultiplex reads into multiple files. Default: write to standard output

Careful reading of this suggests that:

When the trimmed output is sent to a file with the -o output.fastq option,
- diagnostics are written to standard output
  - so can be redirected to a log file with 1> small.trim.log
- cutadapt -a CGTAATTCGCG -o small.trim.fq small.fq 1> small.trim.log
But when the -o option is omitted, and output goes to standard output,
- diagnostics must be written to standard error
  - so can be redirected to a log file with 2> trim.log
- cutadapt -a CGTAATTCGCG small.fq 1> small.trim.fq 2> small.trim.log

...

Code Block

language	bash
title	Setup for cutadapt

mkdir -p $SCRATCH/core_ngs/cutadapt
cd $SCRATCH/core_ngs/cutadapt
cp $CORENGS/human_stuff/Sample_H54_miRNA_L004_R1.cat.fastq.gz .
cp $CORENGS/human_stuff/Sample_H54_miRNA_L005_R1.cat.fastq.gz .
cp $CORENGS/custom_tracksalignment/Yeast_RNAseq_L002_R1.fastq.gz .
cp $CORENGS/custom_tracksalignment/Yeast_RNAseq_L002_R2.fastq.gz .

...

Tip

title	Paired-end RNA fastq trimming script

The BioITeam has an a number of useful NGS scripts that can be executed by anyone on ls6. or stampede2 stampede3. They are located in the /work/projects/BioITeam/common/script/ directory.

For groups that participate in BRCF pods, the scripts are available in /mnt/bioi/script on any compute server.

...

Or use this "cat to MARKER" trick, also known as an heredoc. The MARKER tag can be anything; below it is EOL.

...

Code Block

language	bash
title	Create and submit cutadapt batch script

cd $SCRATCH/core_ngs/cutadapt
launcher_creator.py -j cuta.cmds -n cuta -t 01:00:00 -a OTH21164 -q normal 
sbatch --reservation=CoreNGS-Wed cuta.slurm
showq -u

# or, if you're not on the reservation:
launcher_creator.py -j cuta.cmds -n cuta -t 01:00:00 -a OTH21164 -q development
sbatch cuta.slurm
showq -u

...

H54_miRNA_L004.cuta.log, H54_miRNA_L005.cuta.log, yeast_rnaseq.cuta.log
- these are the main execution log files, one for each trim_adapters.sh command
H54_miRNA_L004.acut.pass0.log, H54_miRNA_L005.acut.pass0.log
- these are cutadapt statistics files for the single-end adapter trimming
- their contents will look like our small example above
yeast_rnaseq.acut.pass1.log, yeast_rnaseq.acut.pass2.log
- these are cutadapt statistics files from paired-end trimming of the R1 and R2 adapters, respectively.

...

Expand

title	Hint

Code Block

language	bash

echo "$((`zcat yeast_rnaseq_R1.cuta.fastq.gz | wc -l` / 4))"
zcat yeast_rnaseq_R2.cuta.fastq.gz | wc -l | awk '{printf("%d\n", $1/4)}'

For more on printf, which is available in most programming languages, see see https://en.wikipedia.org/wiki/Printf#Format_specifier or https://alvinalexander.com/programming/printf-format-cheat-sheet/

...

Versions Compared

Old Version 142

New Version 143

Key