Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

A shell is a program that takes your commands from the keyboard and gives them to the operating system. Most Linux systems utilize Bourne Again SHell (bash), but there are several additional shell programs on a typical Linux system such as ksh, tcsh, and zsh. The simplest way to check which shell your machine has is to type any random letters and hit enter. For example, Lonestar in TACC uses bash.

Code Block
login1$ ffkakldk
-bash: ffkakldk: command not found

...

  • The first line tells the shell which program to use to execute this file (here, the bash program).
  • The 2nd line sets the shell variable TEXT to the first command line argument.
  • The 3rd line defaults the value of TEXT to the string "Shell World" if no command line argument is provided.
  • Remaining lines echo some text, substituting the value passed in on the command line.

...

The script can be run, with or without command line arguments, by explicitly invoking bash as follows:

Code Block
user$ bash hello.sh
-------------------
Hello, Shell World!
-------------------
user$ bash hello.sh Goddess
-------------------
Hello, Goddess!
-------------------

...

This says that anyone can read the file, the owner (you) or anyone in your group can modify it (write permission), but no one can execute it. We use the chmod program to allow anyone to execute the script:

...

The first real script you will likely find yourself wanting is one that performs a standard set of alignment tasks such as mapping, bam file creation and statistics reporting. The script we want for the bwa aligner would do the following:

  1. Aligns a fastq file to a pre-made reference genome
  2. Extracts alignments from bwa's proprietary binary .sai file to a .sam file
  3. Converts the .sam file into a .bam file using samtools
  4. Sort and index the .bam file so that it can be viewed in IGV.
  5. Count the number of aligned and unaligned reads, and calculate the mapping rate.

...

The first thing you will notice about this script is that there is a lot of argument and error checking -- more than the actual "work" code! This is a hallmark of a well-written shell script, especially one that will be run at TACC by many processors at a time.

...

Why write a function this simple? Because we're going to build on it, writing more specialized error checking functions, and we want all of them to print out the boilerplate text ("...exiting") before they exit. If we ever want to change this boilerplate text we only have to change it in one function! And, since the text is well-known and not likely to be written by successful programs, we can easily grep for it in our execution log files. For a single file:

...

A time-honored convention is that all programs whether shell scripts, built-in shell commands, user-written scripts or other programs, exit with a return code of 0 if all went well, or with any other integer return code if not. Calling programs can then check the return code to see if something went wrong. In the bash shell, the just-executed program's return code is placed in the special $? variable, which should be checked right away because doing anything else will reset it. So for example, to check whether a call to bwa aln returned 0 (ok, keep going) or not (bad, exit with message):

...

It first calls ckFile to see if the file exists. Only if it does do the further statements get executed. The file size check is performed by piping the result of ls -l <file> to an awk script that just echos the file size part of the line (field 5). This chained command is executed by putting it in back quotes and its result stored in the SZ variable, which is then checked to see if it was the string "0".

...

Finally, the last few lines of the script should declare success in a way that can be grep'd for. Ours uses this boilerplate text:

...

After the first part of align_bwa.sh has performed some initial error checks and established the execution environment, the script gets about doing the real work. For example, when doing a single-end alignment, it makes a call to bwa aln passing the pathname prefix for the indexed reference genome files and the input fastq file name, then redirecting the output (which normally goes to standard output) to a .sai file named using the output prefix specified by the user. We then use our "belts and suspenders" approach to error checking to make sure all went well.

...

Note that .sai is a proprietary binary format used by bwa. Most aligners have some equivalent "intermediate" format that can then be translated in to a .sam or .bam file. For bwa, the command to extract alignments is samse (single end alignment) or sampe (paired end alignment). Here's what the single-end call looks like:

Code Block
bwa samse -r "$RG" $REF_PFX $OUT_PFX.sai $IN_FQ | samtools view -b -S - > $OUT_PFX.bam;
ckRes $? "bwa samse";
ckFileSz "$OUT_PFX.bam";

The call to bwa samse requires the same pathname prefix for the indexed reference genome files and input fastq file name passed to bwa aln. It also takes the .sai binary alignment file name. In addition, we provide read group information (the -r "$RG" option) which will be stored in the .bam header (see the script comments for more information).

Since we want a binary .bam as output, but bwa samse (and sampe) produce .sam text output, we pipe the .sam file output to samtools view to convert it to .bam output, which is then redirected to an output file named using the user's output prefix. This command chaining or "piping" avoids having to write then read an intermediate .sam file. Note the dash on the samtools view -b -S - command line means samtools should look for its input data on standard input instead of in a file.

When aligning paired-end reads, bwa aligns each set of read ends independently, then uses pairing information when the alignments are extracted (for example, to compute the insert size between reads where both ends aligned). So the call to bwa sampe in our script takes arguments for fastq and .sai files for each end.

At this point the .sam/.bam file produced has a header, and then one line for each read end that was processed. Read pairs are listed one after the other, in the same name order as the input fastq file: this is referred to as read name ordering. While useful for some applications, most downstream tools (such as the IGV visualization program) require a .bam that is sorted by location (location ordered). A location consists a contig name, as defined in the original .fasta file used to generate the reference index (e.g. chr14) and a start position. The names of the contigs and their lengths are kept in the .sam/.bam header, which is why the header is required for sorting.

The actual bam sorting and indexing are straightforward calls to samtools (although you might want to check out the -m maximum memory option for samtools sort; it can speed up sorting of large files considerably):

...

Code Block
samtools flagstat $OUT_PFX.sorted.bam | tee $OUT_PFX.flagstat.txt
ckRes $? "samtools flagstat";
ckFileSz "$OUT_PFX.flagstat.txt";

Note we tee the output so that the information is both stored in a file and appears in the execution log. Output from samtools flagstat looks like this for paired end alignments:

...