A brief introduction to shell scripting. Please see /corral-repl/utexas/BioITeam/ngs_intro/scripts for several example scripts.
| Table of Contents | ||
|---|---|---|
|
What and why is Shell
...
Scripting?
A shell is a program that takes your commands from the keyboard and gives them to the operating system. Most Linux systems utilize Bourne Again SHell (bash), but there are several additional shell programs on a typical Linux system such as ksh, tcsh, and zsh. The simplest way to check which shell your machine has is to type any random letters and hit enter. For example, Lonestar in TACC uses bash.
| Code Block |
|---|
login1$ ffkakldk -bash: ffkakldk: command not found |
...
Exercise 1 - Hello world
Below is the easiest a simple shell script in the world. Create and run this script.
...
that takes one argument (the text to print after "Hello") and echos it.
- The first line tells the shell which program to use to execute this file (here, the bash program).
- The 2nd line sets the shell variable TEXT to the first command line argument.
- The 3rd line defaults the value of TEXT to the string "Shell World" if no command line argument is provided.
- Remaining lines echo some text, substituting the value passed in on the command line.
| Code Block |
|---|
#!/bin/bash TEXT=$1 : ${TEXT:="Shell World"} echo "-------------------" echo "Hello, Shell World$TEXT!" echo "-------------------" |
Open your favorite text editor, enter these lines, and save as testhello.sh (note the file extension for shell scripts is .sh). Using cd command, go to Then open a Terminal window and change into the directory where test.sh is. Then, execute the script by entering './test.sh'
Exercise 2
You already generated reference file. Now, you want to do the following task in order.
- Map a fastq file on reference genome using BWA
- Convert .sai file to .sam file
- Convert the .sam file into .bam file using Samtools
- Count the number of aligned and unaligned reads, respectively, and calculate the mapping rate.
Below is an example script for yeast. This works only in my machine because my yeast reference genome is under /Users/Daechan/Desktop/ref/ and the sample input file name is "subset.test.fastq" Change refPath and inFile variables, save as basic_script.sh, and execute it by entering "./basic_script.sh"
...
was saved. For example:
| Code Block |
|---|
cd /Desktop |
The script can be run, with or without command line arguments, by explicitly invoking bash as follows:
| Code Block |
|---|
user$ bash hello.sh
-------------------
Hello, Shell World!
-------------------
user$ bash hello.sh Goddess
-------------------
Hello, Goddess!
-------------------
|
There is a shortcut, though. Since we have the line at the top of this file that names the program that should run it, we should be able to execute the script just by typing in its pathname like this (where ./ means current directory):
| Code Block |
|---|
user$ ./hello.sh
-bash: ./hello.sh: Permission denied
|
But there''s a complication. Welcome to the world of Unix permissions! The script file must be marked executable for this to work. To see what the current permissions are:
| Code Block |
|---|
user$ ls -la hello.sh
-rw-rw-r-- 1 user group 122 May 18 01:16 hello.sh
|
This says that anyone can read the file, the owner (you) or anyone in your group can modify it (write permission), but no one can execute it. We use the chmod program to allow anyone to execute the script:
| Code Block |
|---|
username$ chmod +r hello.sh
username$ ls -la hello.sh
-rwxrwxr-x 1 user group 122 May 18 01:16 hello.sh
|
Now hello.sh can be invoked directly:
| Code Block |
|---|
username$ ./hello.sh "Expert scripter" ------------------- Hello, Expert scripter! ---------"; refPath="/Users/Daechan/Desktop/ref/sacCer1.fa"; inFile="subset.test.fastq" PFX="subset.test.output" saiFile="$PFX.sai"; samFile="$PFX.sam"; bamFile="$PFX.bam"; echo "Reference File: $refPath" echo "Input File name: $inFile" echo "Prefix for SAI BAM SAM files: $PFX" echo "SAI file name: $saiFile" echo "SAM file name: $samFile" echo "BAM file name: $bamFile" echo "------------------------------------------"; echo "Align subset.test.fastq on yeast genome"; echo "------------------------------------------"; bwa aln $refPath $inFile > $saiFile; echo "...Done"; echo "" echo "------------------------------------------"; echo "Convert the SAI file into the SAM file"; echo "------------------------------------------"; bwa samse $refPath $saiFile $inFile > $samFile; echo "...Done"; echo "" echo "------------------------------------------"; echo "Convert the SAM file into the BAM file"; echo "------------------------------------------"; samtools view -bS $samFile > $bamFile; echo "...Done"; echo "" echo "------------------------------------------"; echo "Calculate the simple statistics"; echo "------------------------------------------"; numMapRd=$(samtools view -F 0x04 $bamFile | wc -l ); echo "The number of mapped read: $numMapRd"; echo "" numUnmapRd=$(samtools view -f 0x04 $bamFile | wc -l ); echo "The number of unmapped read: $numUnmapRd" echo "" echo "Mapping rate is : $[$[numMapRd*100]/$[numMapRd+numUnmapRd]]%" |
Exercise3
You can repeatedly use the above script for multiple datasets by allowing the script to take arguments. Below script will give you the same result as above. Plus, variant calling file (vcf) will be generated. Before execute the script, save the below lines as args_script.sh and enter "./args_script.sh" without arguments. It will give you information about positional arguments. If you want to map the input file "subset.test.fastq" onto yeast reference genome and put a prefix "test" on all output files, enter "./args_script.sh sacCer1 subset.test.fastq test
| Code Block |
|---|
#!/bin/bash assembly=$1; inFile=$2; PFX=$3; saiFile="$PFX.sai"; samFile="$PFX.sam"; bamFile="$PFX.bam"; sortBam="$PFX.sorted.bam"; bcfFile="$PFX.bcf"; vcfFile="$PFX.vcf"; usage() { echo "------------------------------------------"; echo "args_script.sh <assembly> <inFile> <PFX>"; echo "Required Arguments" echo " assembly hg18 mm9 sacCer1"; echo " inFile path to input sequence file"; echo " outPFX prefix for output files"; echo "------------------------------------------"; exit 1; } if [ "$assembly" == "" ]; then usage ; fi if [ "$inFile" == "" ]; then usage ; fi if [ "$PFX" == "" ]; then usage ; fi if [ "$assembly" == "sacCer1" ]; then ref="/Users/Daechan/Desktop/ref/sacCer1.fa"; elif [ "$assembly" == "mm9" ]; then ref="/Users/Daechan/Desktop/ref/mm9.fa"; elif [ "$assembly" == "hg18" ]; then ref="/Users/Daechan/Desktop/ref/hg18.fa"; fi echo "------------------------------------------"; echo "Define variables"; echo "------------------------------------------"; echo "Reference File: $assembly" echo "Input File name: $inFile" echo "Prefix for SAI BAM SAM files: $PFX" echo "SAI file name: $saiFile" echo "SAM file name: $samFile" echo "BAM file name: $bamFile" echo "Sorted BAM file name: $sortBam" echo "BCF file name: $bcfFile" echo "VCF file name: $vcfFile" echo "" echo "------------------------------------------"; echo "Align subset.test.fastq on yeast genome"; echo "---------- |
Note that when we supplied the text "Expert scripter", we put it in quotes, which group the two words into one argument to the script. Without the quotes, the word "Expert" would be seen by the script as argument 1 and "scripter" would be seen as argument 2 (which our script ignores).
BWA alignment script
The first real script you will likely find yourself wanting is one that performs a standard set of alignment tasks such as mapping, bam file creation and statistics reporting. The script we want for the bwa aligner would do the following:
- Aligns a fastq file to a pre-made reference genome
- Extracts alignments from bwa's proprietary binary .sai file to a .sam file
- Converts the .sam file into a .bam file using samtools
- Sort and index the .bam file so that it can be viewed in IGV.
- Count the number of aligned and unaligned reads, and calculate the mapping rate.
Since we want to use this script on different datasets, it should take some arguments on the command line telling it what to work on. Let's have it take the following arguments:
- Name of the input fastq file (or the R1 file if paired).
- A prefix to use when writing output files (e.g. <prefix>.bam).
- Name of an reference genome to use. The script will find the appropriate reference index based on this value.
- A flag indicating whether single end or paired end alignment should be done. 0 = single, 1 = paired.
Here is a completed Example BWA alignment script.You may want to open it in a separate window so you can read along as it is discussed here. It is also available in the course materials as align_bwa.sh.
What could possibly go wrong?
The first thing you will notice about this script is that there is a lot of argument and error checking -- more than the actual "work" code! This is a hallmark of a well-written shell script, especially one that will be run at TACC by many processors at a time.
Let's say you run 20 alignments in parallel at TACC. How do you know if they all completed successfully? If some did not, which ones? And how do you tell what went wrong? Do you really want to poke around in 20 directories/files to figure it out?
The approach this script takes to error checking is that many many things can go wrong. This is from experience: every error check in this script checks for something that has gone wrong for us in the past :)
Script functions
Shell scripts can define functions, which are a convenient way to avoid repeating the same few lines of code again and again. This philosophy of code writing is called DRY, for Don't Repeat Yourself. Like shell scripts themselves, shell script functions take their arguments on the line that invokes them, and refers to them as $1, $2, etc.
Let's look at the simplest function in the align_bwa.sh script:
| Code Block |
|---|
err() {
echo "$1...exiting";
exit 1; # any non-0 code means error
}
|
This function takes one argument, echoes it along with some boilerplate text ("...exiting") and exits. You might call it like this from a shell script:
| Code Block |
|---|
if [ "abc" != "ABC" ]; then
err "string 'abc' is not the same as 'ABC'";
fi
|
Why write a function this simple? Because we're going to build on it, writing more specialized error checking functions, and we want all of them to print out the boilerplate text ("...exiting") before they exit. If we ever want to change this boilerplate text we only have to change it in one function! And, since the text is well-known and not likely to be written by successful programs, we can easily grep for it in our execution log files. For a single file:
| Code Block |
|---|
grep 'exiting' myrun.log
|
Or even better, for any log file in any subdirectory of the current directory:
| Code Block |
|---|
find . -name "*.log" | xargs grep 'exiting'
|
Here is a more specialized ckFile function that checks for the existence of the file name passed as its first argument:
| Code Block |
|---|
ckFile() {
if [ ! -e "$1" ]; then
err "$2 File '$1' not found";
fi
}
|
If the file name passed as the first argument exists, nothing happens when ckFile is called. If the file does not exist, the shell script exits at the line where the ckFile called after printing out a diagnostic message that includes our boilerplate (because this function calls err).
The function can be called with one or two arguments, for example:
| Code Block |
|---|
ckFile myFile.fastq
ckFile myFile.fastq "Input fastq"
|
Here is another function, ckRes, that checks the result code passed in as its first argument. It uses the text passed as its second argument either to print a diagnostic message (by calling our friend err) or to print a message showing that the task completed, and when:
| Code Block |
|---|
ckRes() {
if [ "$1" == "0" ]; then
echo "..Done $2 `date`";
else
err "$2 returned non-0 exit code $1";
fi
}
|
A time-honored convention is that all programs whether shell scripts, built-in shell commands, user-written scripts or other programs, exit with a return code of 0 if all went well, or with any other integer return code if not. Calling programs can then check the return code to see if something went wrong. In the bash shell, the just-executed program's return code is placed in the special $? variable, which should be checked right away because doing anything else will reset it. So for example, to check whether a call to bwa aln returned 0 (ok, keep going) or not (bad, exit with message):
| Code Block |
|---|
bwa aln $REF_PFX $IN_FQ > $OUT_PFX.sai
ckRes $? "bwa aln"
|
If the alignment was successful, a message like this will be written to the execution log:
| Code Block |
|---|
..Done bwa aln Sun May 20 15:00:03 CDT 2012
|
If the program's return code was non-zero, a message like this will be written, and the script will terminate.
| Code Block |
|---|
bwa aln returned non-0 exit code 1...exiting
|
And yet one more wrinkle. For a further refinement of file checking, we also check that the file's size is non-0. Why? Because:
- Programs don't always return a non-0 return code. For example, if called with no arguments just to show usage they often return 0 (after all, there was no error, even if nothing was done). Even well-written programs sometimes neglect to return non-0 exit codes in some circumstances.
- Sometimes a program (or the shell) creates an empty output file before doing anything else. So if it doesn't return a non-0 error code, you can think the program ran fine, even if the file exists. This happens often enough to warrant a special check.
Here's a ckFileSz function that accomplishes this goal, building on the ckFile and err functions:
| Code Block |
|---|
ckFileSz() {
ckFile $1 $2;
SZ=`ls -l $1 | awk '{print $5}'`;
if [ "$SZ" == "0" ]; then
err "$2 file '$1' is zero length";
fi
}
|
It first calls ckFile to see if the file exists. Only if it does do the further statements get executed. The file size check is performed by piping the result of ls -l <file> to an awk script that just echos the file size part of the line (field 5). This chained command is executed by putting it in back quotes and its result stored in the SZ variable, which is then checked to see if it was the string "0".
Of course a program could still produce a file with a non-0 size then error with a non-0 exit code. How might you address this possibility?
OK, enough of boring (but neccessary!) error checking. Onward to more interesting things!
Other niceties
Another time-honored program-writing convention is to provide your users with information on how to run the program The first thing our script does, after capturing its first 4 command line arguments in variables, is to check whether the last required argument (PAIRED, the 4th argument) is empty, and if so, prints detailed usage information. So when someone doesn't know or remember what arguments the script takes (perhaps you, 6 months from now), they can just invoke the script with no arguments to find out:
| Code Block |
|---|
user$ ./align_bwa.sh -------------------------------------------------"; bwa aln $ref $inFile > $saiFile; echo "...Done"; echo "" echo "------------------------------------------"; echoAlign "Convertfastq the SAI file into the SAM file"; echo "------------------------------------------"; bwa samse $ref $saiFile $inFile > $samFile; echo "...Done"; echo "" echo "------------------------------------------"; echo "Convert the SAM file into the BAM file"; echo "------------------------------------------"; samtools view -bS $samFile > $bamFile; echo "...Done"; echo "" echo "------------------------------------------"; echo "Calculate the simple statistics"; echo "------------------------------------------"; numMapRd=$(samtools view -F 0X04 $bamFile | wc -l ); echo "The number of mapped read: $numMapRd"; echo "" numUnmapRd=$(samtools view -f 0X04 $bamFile | wc -l ); echo "The number of unmapped read: $numUnmapRd" echo "" echo "Mapping rate is : $[$[numMapRd*100]/$[numMapRd+numUnmapRd]]%" echo "------------------------------------------"; echo "Call variants"; echo "------------------------------------------"; echo "Sorting BAM file..."; samtools sort $bamFile $sortBam echo "" echo "Indexing reference sequences in ref.fa by samtools faidx..."; samtools faidx $ref echo "" echo "Generating bcf file..."; samtools mpileup -uf $ref $sortBam | bcftools view -bvcg - > $bcfFile echo "" echo "Generating vcf file..."; bcftools view $bcfFile > $vcfFile echo "" echo "...Done" data with bwa, producing a sorted indexed BAM file. align_bwa.sh in_file out_pfx assembly paired(0|1) in_file For single-end alignments, path of the input fastq file. For paired-end alignemtts, path to the the R1 fastq file which must contain the string '_R1.' in its name. The corresponding 'R2' must have the same path except for '_R1' out_pfx Desired prefix of output files. assembly One of: hg19 hg18 mm10 mm9 sacCer3 sacCer3 ecoli paired 0 = single end alignment; 1 = paired end. Example: align_bwa.sh my.fastq mrna_b1_ln1 hg18 0 align_bwa.sh my_L001_R1.fastq swi6_b2_ln1 sacCer3 1 |
A good shell script should also be relatively easy to call. That's why, for example, we have this script takes only a short name of the desired reference and uses it to select the correct path, and only requires the name of the R1 fastq for paired-end reads, using that path to determine the name of the R2 fastq file. While we won't go into the details of that defaulting in this discussion, and these specific choices may not be appropriate for your environment, you might want to look at those parts of the script for ideas on how to accomplish similar goals.
Finally, the last few lines of the script should declare success in a way that can be grep'd for. Ours uses this boilerplate text:
| Code Block |
|---|
echo "---------------------------------------------------------";
echo "All bwa alignment tasks completed successfully!";
echo "`date`";
echo "---------------------------------------------------------";
exit 0;
|
We can check that all of our scripts have done their proper work using something like this:
| Code Block |
|---|
find . -name "*.log" | xargs grep 'completed successfully' | wc -l
|
This will print the number of log files that have the magic success words, and we can compare that number against the number of scripts we actually ran.
The real work!
After the first part of align_bwa.sh has performed some initial error checks and established the execution environment, the script gets about doing the real work. For example, when doing a single-end alignment, it makes a call to bwa aln passing the pathname prefix for the indexed reference genome files and the input fastq file name, then redirecting the output (which normally goes to standard output) to a .sai file named using the output prefix specified by the user. We then use our "belts and suspenders" approach to error checking to make sure all went well.
| Code Block |
|---|
bwa aln $REF_PFX $IN_FQ > $OUT_PFX.sai
ckRes $? "bwa aln";
ckFileSz "$OUT_PFX.sai";
|
Note that .sai is a proprietary binary format used by bwa. Most aligners have some equivalent "intermediate" format that can then be translated in to a .sam or .bam file. For bwa, the command to extract alignments is samse (single end alignment) or sampe (paired end alignment). Here's what the single-end call looks like:
| Code Block |
|---|
bwa samse -r "$RG" $REF_PFX $OUT_PFX.sai $IN_FQ | samtools view -b -S - > $OUT_PFX.bam;
ckRes $? "bwa samse";
ckFileSz "$OUT_PFX.bam";
|
The call to bwa samse requires the same pathname prefix for the indexed reference genome files and input fastq file name passed to bwa aln. It also takes the .sai binary alignment file name. In addition, we provide read group information (the -r "$RG" option) which will be stored in the .bam header (see the script comments for more information).
Since we want a binary .bam as output, but bwa samse (and sampe) produce .sam text output, we pipe the .sam file output to samtools view to convert it to .bam output, which is then redirected to an output file named using the user's output prefix. This command chaining or "piping" avoids having to write then read an intermediate .sam file. Note the dash on the samtools view -b -S - command line means samtools should look for its input data on standard input instead of in a file.
When aligning paired-end reads, bwa aligns each set of read ends independently, then uses pairing information when the alignments are extracted (for example, to compute the insert size between reads where both ends aligned). So the call to bwa sampe in our script takes arguments for fastq and .sai files for each end.
At this point the .sam/.bam file produced has a header, and then one line for each read end that was processed. Read pairs are listed one after the other, in the same name order as the input fastq file: this is referred to as read name ordering. While useful for some applications, most downstream tools (such as the IGV visualization program) require a .bam that is sorted by location (location ordered). A location consists a contig name, as defined in the original .fasta file used to generate the reference index (e.g. chr14) and a start position. The names of the contigs and their lengths are kept in the .sam/.bam header, which is why the header is required for sorting.
The actual bam sorting and indexing are straightforward calls to samtools (although you might want to check out the -m maximum memory option for samtools sort; it can speed up sorting of large files considerably):
| Code Block |
|---|
echo "Sorting '$OUT_PFX.bam'...";
samtools sort $OUT_PFX.bam $OUT_PFX.sorted;
ckRes $? "samtools sort";
ckFileSz "$OUT_PFX.sorted.bam";
echo "Indexing '$OUT_PFX.sorted.bam'...";
samtools index $OUT_PFX.sorted.bam;
ckRes $? "samtools index";
ckFileSz "$OUT_PFX.sorted.bam.bai";
|
Finally, we call samtools flagstat to report alignment statistics:
| Code Block |
|---|
samtools flagstat $OUT_PFX.sorted.bam | tee $OUT_PFX.flagstat.txt
ckRes $? "samtools flagstat";
ckFileSz "$OUT_PFX.flagstat.txt";
|
Note we tee the output so that the information is both stored in a file and appears in the execution log. Output from samtools flagstat looks like this for paired end alignments:
| Code Block |
|---|
4029634 + 0 in total (QC-passed reads + QC-failed reads)
2492274 + 0 duplicates
3292433 + 0 mapped (81.71%:-nan%)
4029634 + 0 paired in sequencing
2014817 + 0 read1
2014817 + 0 read2
3158008 + 0 properly paired (78.37%:-nan%)
3181258 + 0 with itself and mate mapped
111175 + 0 singletons (2.76%:-nan%)
5339 + 0 with mate mapped to a different chr
3711 + 0 with mate mapped to a different chr (mapQ>=5)
|
To summarize the statistics from all your (possibly parallel) alignments, you could do something like this:
| Code Block |
|---|
find . -name "*.flagstat.txt" | xarg grep 'mapped ('
|