What and why is Shell Script?

A shell is a program that takes your commands from the keyboard and gives them to the operating system. Most Linux systems utilize Bourne Again SHell (bash), but there are several additional shell programs on a typical Linux system such as ksh, tcsh, and zsh. The simplest way to check which shell your machine has is to type any random letters and hit enter. For example, Lonestar in TACC uses bash.

login1$ ffkakldk
-bash: ffkakldk: command not found

Let’s assume that you go to a restaurant. The rule of the restaurant is to order one by one: order a drink and get it, then order an appetizer and get it, then order dishes and so forth. What a stupid ordering system is! To make matters worse, even if you want to order the exact same meal you ate before, you must go through the tedious process again. Shell scripting addresses this problem. A shell script is series of commands written in a plain text file. Instead of entering commands one by one, you can store the sequence of commands to text file and tell the shell to execute this text file. When you want to repeatedly execute the series of command lines for multiple datasets, the shell script can automate your task and save lots of time.

Exercise 1 - Hello world

Below is a simple shell script that takes one argument (the text to print after "Hello") and echos it.

#!/bin/bash
TEXT=$1
: ${TEXT:="Shell World"}
echo "-------------------"
echo "Hello, $TEXT!"
echo "-------------------"

Open your favorite text editor, enter these lines, and save as hello.sh (note the file extension for shell scripts is .sh). Then open a Terminal window and change into the directory where the script was saved. For example:

cd /Desktop

The script can be run, with or without command line arguments, by explicitly invoking bash as follows:

user$ bash hello.sh
-------------------
Hello, Shell World!
-------------------
user$ bash hello.sh Goddess
-------------------
Hello, Goddess!
-------------------

There is a shortcut, though. Since we have the line at the top of this file that names the program that should run it, we should be able to execute the script just by typing in its pathname like this (where ./ means current directory):

user$ ./hello.sh
-bash: ./hello.sh: Permission denied

But there''s a complication. Welcome to the world of Unix permissions! The script file must be marked executable for this to work. To see what the current permissions are:

user$ ls -la hello.sh
-rw-rw-r-- 1 user group 122 May 18 01:16 hello.sh

This says that anyone can read the file, the owner (you) or anyone in your group can modify it (write permission), but no one can execute it. We use the chmod program to allow anyone to execute the script:

username$ chmod +r hello.sh
username$ ls -la hello.sh
-rwxrwxr-x 1 user group 122 May 18 01:16 hello.sh

Now hello.sh can be invoked directly:

username$ ./hello.sh "Expert scripter"
-------------------
Hello, Expert scripter!
-------------------

Note that when we supplied the text "Expert scripter", we put it in quotes, which group the two words into one argument to the script. Without the quotes, the word "Expert" would be seen by the script as argument 1 and "scripter" would be seen as argument 2 (which our script ignores).

Exercise 2

You already generated reference file. Now, you want to do the following tasks in order.

  1. Align a fastq file to a reference genome using bwa
  2. Extract alignments from bwa's proprietary binary .sai file to a .sam file
  3. Convert the .sam file into a .bam file using samtools
  4. Count the number of aligned and unaligned reads, and calculate the mapping rate. 

Below is an example script for yeast. This works only in my machine because my yeast reference genome is under /Users/Daechan/Desktop/ref/ and the sample input file name is "subset.test.fastq" Change refPath and inFile variables, save as basic_script.sh, and execute it by entering "./basic_script.sh"

#!/bin/bash
echo "------------------------------------------";
echo "Define variables";
echo "------------------------------------------";
refPath="/Users/Daechan/Desktop/ref/sacCer1.fa";
inFile="subset.test.fastq"
PFX="subset.test.output"
saiFile="$PFX.sai";
samFile="$PFX.sam";
bamFile="$PFX.bam";

echo "Reference File: $refPath"
echo "Input File name: $inFile"
echo "Prefix for SAI BAM SAM files: $PFX"
echo "SAI file name: $saiFile"
echo "SAM file name: $samFile"
echo "BAM file name: $bamFile"

echo "------------------------------------------";
echo "Align subset.test.fastq on yeast genome";
echo "------------------------------------------";
bwa aln $refPath $inFile > $saiFile;
echo "...Done";
echo ""

echo "------------------------------------------";
echo "Convert the SAI file into the SAM file";
echo "------------------------------------------";
bwa samse $refPath $saiFile $inFile > $samFile;
echo "...Done";
echo ""

echo "------------------------------------------";
echo "Convert the SAM file into the BAM file";
echo "------------------------------------------";
samtools view -bS $samFile > $bamFile;
echo "...Done";
echo ""

echo "------------------------------------------";
echo "Calculate the simple statistics";
echo "------------------------------------------";
numMapRd=$(samtools view -F 0x04 $bamFile | wc -l );
echo "The number of mapped read: $numMapRd";
echo ""

numUnmapRd=$(samtools view -f 0x04 $bamFile | wc -l );
echo "The number of unmapped read: $numUnmapRd"
echo ""
echo "Mapping rate is : $[$[numMapRd*100]/$[numMapRd+numUnmapRd]]%"

Exercise3

You can repeatedly use the above script for multiple datasets by allowing the script to take arguments. Below script will give you the same result as above. Plus, variant calling file (vcf) will be generated. Before execute the script, save the below lines as args_script.sh and enter "./args_script.sh" without arguments. It will give you information about positional arguments. If you want to map the input file "subset.test.fastq" onto yeast reference genome and put a prefix "test" on all output files, enter "./args_script.sh sacCer1 subset.test.fastq test

#!/bin/bash

assembly=$1;
inFile=$2;
PFX=$3;
saiFile="$PFX.sai";
samFile="$PFX.sam";
bamFile="$PFX.bam";
sortBam="$PFX.sorted.bam";
bcfFile="$PFX.bcf";
vcfFile="$PFX.vcf";

usage() {
    echo "------------------------------------------";
    echo "args_script.sh <assembly> <inFile> <PFX>";
    echo "Required Arguments"
    echo "  assembly   hg18 mm9 sacCer1";
    echo "  inFile     path to input sequence file";
    echo "  outPFX     prefix for output files";
    echo "------------------------------------------";
    exit 1;
}

if [ "$assembly" == "" ]; then usage ; fi
if [ "$inFile" == "" ]; then usage ; fi
if [ "$PFX" == "" ]; then usage ; fi

if [ "$assembly" == "sacCer1" ]; then
    ref="/Users/Daechan/Desktop/ref/sacCer1.fa";
elif [ "$assembly" == "mm9" ]; then
    ref="/Users/Daechan/Desktop/ref/mm9.fa";
elif [ "$assembly" == "hg18" ]; then
    ref="/Users/Daechan/Desktop/ref/hg18.fa";
fi

echo "------------------------------------------";
echo "Define variables";
echo "------------------------------------------";

echo "Reference File: $assembly"
echo "Input File name: $inFile"
echo "Prefix for SAI BAM SAM files: $PFX"
echo "SAI file name: $saiFile"
echo "SAM file name: $samFile"
echo "BAM file name: $bamFile"
echo "Sorted BAM file name: $sortBam"
echo "BCF file name: $bcfFile"
echo "VCF file name: $vcfFile"
echo ""


echo "------------------------------------------";
echo "Align subset.test.fastq on yeast genome";
echo "------------------------------------------";
bwa aln $ref $inFile > $saiFile;
echo "...Done";
echo ""

echo "------------------------------------------";
echo "Convert the SAI file into the SAM file";
echo "------------------------------------------";
bwa samse $ref $saiFile $inFile > $samFile;
echo "...Done";
echo ""

echo "------------------------------------------";
echo "Convert the SAM file into the BAM file";
echo "------------------------------------------";
samtools view -bS $samFile > $bamFile;
echo "...Done";
echo ""

echo "------------------------------------------";
echo "Calculate the simple statistics";
echo "------------------------------------------";
numMapRd=$(samtools view -F 0X04 $bamFile | wc -l );
echo "The number of mapped read: $numMapRd";
echo ""

numUnmapRd=$(samtools view -f 0X04 $bamFile | wc -l );
echo "The number of unmapped read: $numUnmapRd"
echo ""
echo "Mapping rate is : $[$[numMapRd*100]/$[numMapRd+numUnmapRd]]%"

echo "------------------------------------------";
echo "Call variants";
echo "------------------------------------------";
echo "Sorting BAM file...";
samtools sort $bamFile $sortBam
echo ""
echo "Indexing reference sequences in ref.fa by samtools faidx...";
samtools faidx $ref
echo ""
echo "Generating bcf file...";
samtools mpileup -uf $ref $sortBam | bcftools view -bvcg - > $bcfFile
echo ""
echo "Generating vcf file...";
bcftools view $bcfFile > $vcfFile
echo ""
echo "...Done"