Content Comparison

...

The approach this script takes to error checking is that many many things can go wrong. This is from experience: every error check in this script checks for something that has gone wrong for us in the past :)

...

Here is another function, ckRes, that checks the result code passed in as its first argument. It uses the text passed as its second argument either to print a diagnostic message (by calling our friend err) or to print a message showing that the task completed, and when:

...

Finally, the last few lines of the script should declare success in a way that can be grep'd for. Ours uses this boilerplate text:

Code Block

...


echo "---------------------------------------------------------";
echo "All bwa alignment tasks completed successfully!";
echo "`date`";
echo "---------------------------------------------------------";
exit 0;

We can check that all of our scripts have done their proper work using something like this:

Code Block
find . -name "*.log" \| xargs grep 'completed successfully' \| wc -l

...

The real work!

After the first part of align_bwa.sh has performed some initial error checks and established the execution environment, the script gets about doing the real work. For example, when doing a single-end alignment, it makes a call to bwa aln passing the pathname prefix for the indexed reference genome files and the input fastq file name, and then redirecting the output (which normally goes to standard output) to a .sai file named using the output prefix specified by the user. We then use our "belts and suspenders" approach to error checking to make sure all went well.

...

Code Block
bwa samse -r "$RG" $REF_PFX $OUT_PFX.sai $IN_FQ \| samtools view -b -S - > $OUT_PFX.bam; ckRes $? "bwa samse"; ckFileSz "$OUT_PFX.bam";

The call to bwa samse requires the same pathname prefix for the indexed reference genome files and input fastq file name passed to bwa aln. It also takes the .sai binary alignment file name. In addition, we provide read group information (the -r "$RG" option) which will be stored in the .bam header (see the script comments for more information).

Since we want a binary .bam as output, but bwa samse (and sampe) produce .sam text output, we pipe the .sam file output to samtools view to convert it to .bam output, which is then redirected to an output file named using the user's output prefix. This command chaining or "piping" avoids having to write then read an intermediate .sam file. Note the dash on the samtools view -b -S - command line means samtools should look for its input data on standard input instead of in a file.

When aligning paired-end reads, bwa aligns each set of read ends independently, then uses pairing information when the alignments are extracted (for example, to compute the insert size between reads where both ends aligned). So the call to bwa sampe in our script takes arguments for fastq and .sai files for each end.

At this point the .sam/.bam file produced has a header, and then one line for each read end that was processed. Read pairs are listed one after the other, in the same name order as the input fastq file: this is referred to as read name ordering. While useful for some applications, most downstream tools (such as the IGV visualization program) require a .bam that is sorted by location (location ordered). A location consists a contig name, as defined in the original .fasta file used to generate the reference index (e.g. chr14) and a start position. The names of the contigs and their lengths are kept in the .sam/.bam header, which is why the header is required for sorting.

The actual bam sorting and indexing are straightforward calls to samtools (although you might want to check out the -m maximum memory option for samtools sort; it can speed up sorting of large files considerably):

...

Version	Old Version 17	New Version 18
Changes made by	Anna Battenhouse	Anna Battenhouse
Saved on	May 20, 2012	May 20, 2012

Versions Compared

Key

The real work!