Part 3: Working with 3rd party program I/O
Recall the three standard Unix streams: they each have a number, a name and redirection syntax:
3rd party tool file and stream handling
Third party bioinformatics tools are often written to perform sub-command processing; that is, they have a top-level program that handles multiple sub-commands. Examples include the bwa NGS aligner and the samtools and bedtools tool suites.
To see their menu of sub-commands, you usually just need to enter the top-level command, or <command> --help. Similarly, sub-command usage is usually available as <command> <sub-command> or <command> <sub-command> --help.
3rd party tools and standard streams
Many tools write their main output to standard output by default, but have options to write it to a file instead.
Similarly, tools often write processing status and diagnostics to standard error, and it is usually your responsibility to redirect this elsewhere (e.g. to a log file).
Finally, tools may support taking their main input from standard input, but need a "placeholder" argument where you would usually specify a file. That standard input placeholder is usually a single dash ( - ) but can also be a reserved word such as stdin.
Now let's see how these concepts fit together when running 3rd party tools.
Exercise 2-3 bwa mem
Display the bwa mem sub-command usage using the more pager.
Where does the bwa mem sub-command write its output?
How can this be changed?
bwa mem also writes diagnostic progress as it runs, to standard error. This is typical for tools that may run for an extended period of time.
Show how you would invoke bwa mem to capture both its alignment output and its progress diagnostics. Use input from a my_fastq.fq file and ./refs/hg38 as the <idxbase>. (The resulting expression isn't expected to work!)
A real example:
cd ~/gzips # Diagnostic progress is written to standard error, which is # mapped to the Terminal bwa mem /mnt/bioi/ref_genome/bwa/bwtsw/sacCer3/sacCer3.fa \ sm2.fq.gz > small.sam # Diagnostic progress on standard error is redirected to a log file bwa mem /mnt/bioi/ref_genome/bwa/bwtsw/sacCer3/sacCer3.fa \ sm2.fq.gz > small.sam 2>small.log cat small.log
Exercise 2-4 cutadapt
The cutadapt adapter trimming command reads NGS sequences from a FASTQ file, and writes adapter-trimmed reads to a FASTQ file. Find its usage.
Where does cutadapt write its output to from by default? How can that be changed?
Where does cutadapt read its input from by default? How can that be changed? Can the input FASTQ be in compressed format?
Where does cutadapt write its diagnostic output by default? How can that be changed?
Real examples:
cd ~/gzips # No -o option, so output is written to standard output, redirected here # Summary report is written to standard error, which goes to the Terminal cutadapt -a AGATCGGAAGAGCACACGTCTGA small.fq > trimmed.fq # Same as above, but summary report is redirected to a log file cutadapt -a AGATCGGAAGAGCACACGTCTGA small.fq > trimmed.fq 2>trim.log cat trim.log # Use -o to specify the output file # Pipe the fastq data in, specifying "-" as the placeholder argument # Summary report will go to standard output; redirect to a log file cat small.fq | \ cutadapt -a AGATCGGAAGAGCACACGTCTGA -o trimmed.fq - 1>tr.log cat tr.log