Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A good shell script should also be relatively easy to call. That's why, for example, we have this script takes only a short name of the desired reference and uses it to select the correct path, and only requires the name of the R1 fastq for paired-end reads, using that path to determine the name of the R2 fastq file. While we won't go into the details of that defaulting in this discussion, you might want to look at those parts of the code for ideas on how to accomplish similar goals.

The real work!

After the first part of align_bwa.sh has performed some initial error checks and established the execution environment, the script gets about doing the real work. For example, when doing a single-end alignment, it makes a call to bwa aln passing the pathname prefix for the indexed reference genome files and the input fastq file name, and redirecting the output (which normally goes to standard output) to a .sai file named using the output prefix specified by the user:

...

Finally, the last few lines of the script should declare success in a way that can be grep'd for. Ours uses this boilerplate text:

Code Block

We can then check that all of our scripts have done their proper work like this:

Code Block

find . -name "*.log" | xargs grep 'completed successfully' | wc -l

This will print the number of log files that have the magic success words, and we can compare that number against the number of scripts we actually ran.

The real work!

After the first part of align_bwa.sh has performed some initial error checks and established the execution environment, the script gets about doing the real work. For example, when doing a single-end alignment, it makes a call to bwa aln passing the pathname prefix for the indexed reference genome files and the input fastq file name, and redirecting the output (which normally goes to standard output) to a .sai file named using the output prefix specified by the user. We then use our "belts and suspenders" approach to error checking to make sure all went well.

Code Block

bwa aln $REF_PFX $IN_FQ > $OUT_PFX.sai
ckRes $? "bwa aln";
ckFileSz "$OUT_PFX.sai";

Note that .sai is a proprietary binary format used by bwa. Most aligners have some equivalent "intermediate" format that can then be translated in to a .sam or .bam file. For bwa, the command to extract alignments is samse (single end alignment) or sampe (paired end alignment). Here's what the single-end call looks like:

Code Block

bwa samse -r "$RG" $REF_PFX $OUT_PFX.sai $IN_FQ | samtools view -b -S - > $OUT_PFX.bam;
ckRes $? "bwa samse";
ckFileSz "$OUT_PFX.bam";

The call to bwa samse requires the same pathname prefix for the indexed reference genome files and input fastq file name passed to bwa aln. It also takes the .sai binary alignment file name. In addition, we provide read group information (the -r "$RG" option) which will be stored in the .bam header (see the script comments for more information).

Since we want a binary .bam as output, but bwa samse (and sampe) produce .sam text output, we pipe the .sam file output to samtools view to convert it to .bam output, which is then redirected to an output file named using the user's output prefix. This command chaining or "piping" avoids having to write then read an intermediate .sam file. Note the dash on the samtools view -b -S - command line means samtools should look for its input data on standard input instead of in a file.

When aligning paired-end reads, bwa aligns each set of read ends independently, then uses pairing information when the alignments are extracted (for example, to compute the insert size between reads where both ends aligned). So the call to bwa sampe in our script takes arguments for fastq and .sai files for each end.

At this point the .sam/.bam file produced has a header, and then one line for each read end that was processed. Read pairs are listed one after the other, in the same name order as the input fastq file: this is referred to as read name ordering. While useful for some applications, most downstream tools (such as the IGV visualization program) require a .bam that is sorted by location (location ordered). A location consists a contig name, as defined in the original .fasta file used to generate the reference index (e.g. chr14) and a start position. The names of the contigs and their lengths are kept in the .sam/.bam header, which is why the header is required for sorting.

The actual bam sorting and indexing are straightforward calls to samtools (although you might want to check out the -m maximum memory option for samtools sort; it can speed up sorting of large files considerably):

Code Block

echo "Sorting '$OUT_PFX.bam'...";
samtools sort $OUT_PFX.bam $OUT_PFX.sorted;
ckRes $? "samtools sort";
ckFileSz "$OUT_PFX.sorted.bam";

echo "Indexing '$OUT_PFX.sorted.bam'...";
samtools index $OUT_PFX.sorted.bam;
ckRes $? "samtools index";
ckFileSz "$OUT_PFX.sorted.bam.bai";

Finally, we call samtools flagstat to report alignment statistics:

Code Block

samtools flagstat $OUT_PFX.sorted.bam | tee $OUT_PFX.flagstat.txt
ckRes $? "samtools flagstat";
ckFileSz "$OUT_PFX.flagstat.txt";

Note we tee the output so that the information is both stored in a file and appears in the execution log. Output from samtools flagstat looks like this for paired end alignments:

Code Block

4029634 + 0 in total (QC-passed reads + QC-failed reads)
2492274 + 0 duplicates
3292433 + 0 mapped (81.71%:-nan%)
4029634 + 0 paired in sequencing
2014817 + 0 read1
2014817 + 0 read2
3158008 + 0 properly paired (78.37%:-nan%)
3181258 + 0 with itself and mate mapped
111175 + 0 singletons (2.76%:-nan%)
5339 + 0 with mate mapped to a different chr
3711 + 0 with mate mapped to a different chr (mapQ>=5)

To summarize the statistics from all your (possibly parallel) alignments, you could do something like this:

Code Block

find . -name "*.flagstat.txt" | xarg grep 'mapped ('

Daechan original