Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleSetup (if needed)


Code Block
languagebash
titleSet up directory for working with FASTQs
export CORENGS=/work/projects/BioITeam/projects/courses/Core_NGS_Tools

# Create a $SCRATCH area to work on data for this course,
# with a sub-direct[1ory for pre-processing raw fastq files
mkdir -p $SCRATCH/core_ngs/fastq_prep

# Make a symbolic links to the original yeast data:
cd $SCRATCH/core_ngs/fastq_prep
ln -s -fsf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -s -fsf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz


...

Expand
titleAnswer

These are 101-base reads. wc -c counts the "invisible" newline character, so subtract 1 from the character count it returns for a line.

Here's a way to strip the trailing newline characters from the quality scores string before calling wc -c to count the characters. We use the echo -n option that tells echo not to include the trailing newline in its output. We gemerate generate that text using sub-shell evaluation (an alternative to backtick evaluation) of that zcat ... command:

Code Block
languagebash
echo -n $( zcat Sample_H54_miRNA_L004_R1.cat.fastq.gz | head -2 | tail -1 ) | wc -c


...

  • Here's one of those cases where you have to be careful about separating standard output and standard error
    • cutadapt writes its FASTQ output to standard output by default, and writes summary information to standard error.
  • In this command we first redirect standard error to a log file named miRNA_test.cuta.log using the 2> syntax, in order to capture cutadapt diagnostics.
    • Then the remaining standard output is piped to gzip, whose output is the redirected to a new compressed FASTQ file. (Read more about Standard streams and redirection)

...

Expand
titleInterpreting cutadapt I/O options

The cutadapt --help output describes its usage as follows:

    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

From this we see that the input.fastq is a required argument. Clearly, it can be a FASTQ file, and it can be compressed based on this help:

Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2).
Use the file name '-' for standard input/output.

And this says that input reads can also be provided on standard input, if that argument is a hyphen ( - ). So input data can come:

  • from a file named as an argument:
    • cutadapt -a CGTAATTCGCG -o small.trim.fq  small.fq
    • and that input.fastq file can be provided in one of three compression formats
  • from standard input if the input.fastq argument is replaced with a dash ( - )
    • cat small.fq | cutadapt -a CGTAATTCGCG -o small.trim.fq  -

What about cutadapt output (the trimmed reads)? The brackets around the usage -o option indicate that the resulting trimmed FASTQ can be written to a file, but is not by default. This implies that cutadapt by default writes its results to standard output. So output can go

  • to a file, using the -o option
    • cutadapt -a CGTAATTCGCG -o small.trim.fq  small.fq
  • to standard output without the -o option
    • cutadapt -a CGTAATTCGCG small.fq 1> small.trim.fq

Finally, as we've seen, cutadapt also writes diagnostic output. Where does it go? The usage line doesn't say anything about diagnostics explicitly. But in the Output section of cutadapt --help:

   -o FILE, --output=FILE
        Write trimmed reads to FILE. FASTQ or FASTA format is
        chosen depending on input. The summary report is sent
        to standard output. Use '{name}' in FILE to
        demultiplex reads into multiple files. Default: write
       
to standard output

Careful reading of this suggests that:

  • When the trimmed output is sent to a file with the -o output.fastq option,
    • diagnostics are written to standard output
      • so can be redirected to a log file with 1> small.trim.log
    • cutadapt -a CGTAATTCGCG -o small.trim.fq  small.fq 1> small.trim.log
  • But when the -o option is omitted, and output goes to standard output,
    • diagnostics must be written to standard error
      • so can be redirected to a log file with 2> trim.log
    • cutadapt -a CGTAATTCGCG small.fq 1> small.trim.fq 2> small.trim.log

...

Code Block
languagebash
titleSetup for cutadapt
mkdir -p $SCRATCH/core_ngs/cutadapt
cd $SCRATCH/core_ngs/cutadapt
cp $CORENGS/human_stuff/Sample_H54_miRNA_L004_R1.cat.fastq.gz .
cp $CORENGS/human_stuff/Sample_H54_miRNA_L005_R1.cat.fastq.gz .
cp $CORENGS/custom_tracksalignment/Yeast_RNAseq_L002_R1.fastq.gz .
cp $CORENGS/custom_tracksalignment/Yeast_RNAseq_L002_R2.fastq.gz .

...

Tip
titlePaired-end RNA fastq trimming script

The BioITeam has an a number of useful NGS scripts that can be executed by anyone on ls6. or stampede2 stampede3. They are located in the /work/projects/BioITeam/common/script/ directory.

For groups that participate in BRCF pods, the scripts are available in /mnt/bioi/script on any compute server.

...

Or use this "cat to MARKER" trick, also known as an heredoc. The MARKER tag can be anything; below it is EOL.

...

Code Block
languagebash
titleCreate and submit cutadapt batch script
cd $SCRATCH/core_ngs/cutadapt
launcher_creator.py -j cuta.cmds -n cuta -t 01:00:00 -a OTH21164 -q normal 
sbatch --reservation=CoreNGS-Wed cuta.slurm
showq -u

# or, if you're not on the reservation:
launcher_creator.py -j cuta.cmds -n cuta -t 01:00:00 -a OTH21164 -q development
sbatch cuta.slurm
showq -u

...

  • H54_miRNA_L004.cuta.log, H54_miRNA_L005.cuta.log, yeast_rnaseq.cuta.log
    • these are the main execution log files, one for each trim_adapters.sh command
  • H54_miRNA_L004.acut.pass0.log, H54_miRNA_L005.acut.pass0.log
    • these are cutadapt statistics files for the single-end adapter trimming
    • their contents will look like our small example above
  • yeast_rnaseq.acut.pass1.log, yeast_rnaseq.acut.pass2.log
    • these are cutadapt statistics files from paired-end trimming of the R1 and R2 adapters, respectively.

...

Expand
titleHint


Code Block
languagebash
echo "$((`zcat yeast_rnaseq_R1.cuta.fastq.gz | wc -l` / 4))"
zcat yeast_rnaseq_R2.cuta.fastq.gz | wc -l | awk '{printf("%d\n", $1/4)}'

For more on printf, which is available in most programming languages, see see https://en.wikipedia.org/wiki/Printf#Format_specifier or https://alvinalexander.com/programming/printf-format-cheat-sheet/

...