...
- Name of the input fastq file (or the R1 file if paired).
- A prefix to use when writing output files (e.g. <prefix>.bam).
- Name of an reference genome to use. The script will find the appropriate reference index based on this value.
- A flag indicating whether single end or paired end alignment should be done. 0 = single, 1 = paired.
A good shell script should be relatively easy to call. That's why, for example, we have this script takes only a short name of the desired reference and uses it to select the correct path, and only requires the name of the R1 fastq for paired-end reads, using that path to determine the name of the R2 fastq file.
Here is a completed Example BWA alignment script.You may want to open it in a separate window so you can read along as it is discussed here. It is also available in the course materials as align_bwa.sh.
...
The first thing you will notice about this script is that there is a lot of argument and error checking -- more than the actual "work" code. ! This is a hallmark of a well-written shell script, especially one that will be run at TACC by many processors at a time.
...
The approach this script takes to error checking is that many many things can go wrong (this . This is from experience; : every error check in this script checks for something that has gone wrong for us in the past :)
...
Why write a function this simple? Because we're going to build on it, writing more specialized error checking functions, and we want all of them to print out the boilerplate text ("...exiting") before they exit. If we ever want to change this boilerplate text we only have to change it in one function! And, since the text is well-known and not likely to be written by successful programs, we can easily grep for it in our execution log files. For a single file:
...
Here is a more specialized ckFile function that checks for the existence of the filename file name passed as its first argument:
| Code Block |
|---|
ckFile() {
if [ ! -e "$1" ]; then
err "$2 File '$1' not found";
fi
}
|
If the filename file name passed as the first argument exists, nothing happens when ckFile is called. If the file does not exist, the shell script exits at the line where the ckFile called after printing out a diagnostic message that includes our boilerplate (because this function calls err).
...
A time-honored convention is that all programs whether shell scripts, built-in shell commands, user-written scripts or other programs, exit with a return code of 0 if all went well, or with any other integer return code if not. Calling programs can then check the return code to see if something went wrong. In the bash shell, the just-executed program's return code is placed in the special xxx $? variable, which should be checked right away because doing anything else in the shell will reset it. So for example, to check whether a call to bwa aln returned 0 (ok, keep going) or not (bad, exit with message):
...
- Programs don't always return a non-0 return code. For example, if called with no arguments just to show usage they often return 0 (after all, there was no error, even if nothing was done). Even well-written programs sometimes neglect to return non-0 exit codes in some circumstances.
- Sometimes a program (or the shell) creates an empty output file before doing anything else. So if it doesn't return a non-0 error code, you can think the program ran fine, even if the file exists. This happens often enough to warrant a special check.
Here's a ckFileSz function that accomplishes this goal, building on the ckFile and err functions:
| Code Block |
|---|
ckFileSz() {
ckFile $1 $2;
SZ=`ls -l $1 | awk '{print $5}'`;
if [ "$SZ" == "0" ]; then
err "$2 file '$1' is zero length";
fi
}
|
It first calls ckFile to see if the file exists. Only if it does do the further statements get executed. The file size check is performed by piping the result of ls -l <file> to and awk script that just echos the file size part of the line (field 5). This chained command is executed by putting it in backquotes and its result stored in the SZ variable, which is then checked to see if it was the string "0".
Other niceties
A good shell script should be relatively easy to call. That's why, for example, we have this script takes only a short name of the desired reference and uses it to select the correct path, and only requires the name of the R1 fastq for paired-end reads, using that path to determine the name of the R2 fastq file.