Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Name of the input fastq file (or the R1 file if paired).
  2. A prefix to use when writing output files (e.g. <prefix>.bam).
  3. Name of an reference genome to use. The script will find the appropriate reference index based on this value.
  4. A flag indicating whether single end or paired end alignment should be done. 0 = single, 1 = paired.

A good shell script should be relatively easy to call. That's why, for example, we have this script takes only a short name of the desired reference and uses it to select the correct path, and only requires the name of the R1 fastq for paired-end reads, using that path to determine the name of the R2 fastq file.

Here is a completed Example BWA alignment script.You may want to open it in a separate window so you can read along as it is discussed here. It is also available in the course materials as align_bwa.sh.

...

Let's say you run 20 alignments in parallel at TACC. How do you know if they all completed successfully? If they some did not, which ones? And how do you tell what went wrong? Do you really want to poke around in 20 directories/files to figure it out?Also, a good shell script should be relatively easy to call. That's why, for example, this script takes only a short name of the desired reference and uses it to select the correct path, and only requires the name of the R1 fastq for paired-end reads, and uses that path to determine the name of the R2 fastq file.

The approach this script takes to error checking is that many many things can go wrong (this is from experience; every error check in this script checks for something that has gone wrong for us in the past :)

...

This function takes one argument, echoes it along with some boilerplate text ("...exiting") and exits. You might call it like this from a shell script:

Code Block
if [ "abc" != -e "$MY_FILE"ABC" ]; then
   err "Filestring '$MY_FILEabc' doesis not exist the same as 'ABC'"; 
fi

...

Why write a function this simple? Because we're going to build on it, writing more specialized error checking functions, and we want all of them to print out the boilerplate text ("...exiting") before they exit. That way If we want to change this boilerplate text we only have to change it in one function! And, since the text is well-known and not likely to be written by successful programs, we can easily grep for this well-know text it in our execution log files. For a single file:

...

Or even better, for any log file in subdirectories any subdirectory of the current directory:

Code Block
find . -name "*.log" | xargs grep 'exiting'

Here is a more specialized ckFile function that checks for the existence of the filename passed as its first argument:

Code Block

ckFile() {
  if [ ! -e "$1" ]; then
    err "$2 File '$1' not found";
  fi
}

If the filename passed as the first argument exists, nothing happens when ckFile is called. If the file does not exist, the shell script exits at the line where the ckFile called after printing out a diagnostic message that includes our boilerplate (because this function calls err.

The function can be called with one or two arguments, for example:

Code Block

ckFile myFile.fastq
ckFile myFile.fastq "Input fastq"

For a further refinement of file checking, we also check that the file's size is non-0. Why? Because:

Daechan original