Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleStart an idev session
idev -m 180 -N 1 -A OTH21164 -r CoreNGS      # or -A TRA23004-Thu  

idev -m 120 -N 1 -A OTH21164 -p development  # or -A TRA23004


Code Block
languagebash
module load biocontainers  # takes a while
module load bwa
bwa

...

Code Block
languagebash
titlePrepare BWA reference directory for sacCer3
mkdir -p $SCRATCH/core_ngs/references/bwa/sacCer3
cd $SCRATCH/core_ngs/references/bwa/sacCer3

# Create a symbolic link to the sacCer3 FASTA file so it can
# be accessed directly from this directory
ln -sf ../../fasta/sacCer3.fa
ls -l

...

Code Block
languagebash
titlePrepare to align yeast data
mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa

# Create symbolic links to the fastq and sacCer3 reference directories
# so we can access them directly from this alignment directory
ln -sf ../fastq
ln -sf ../../references/bwa/sacCer3
ls -l

As our workflow indicated, we first use bwa aln on the R1 and R2 FASTQs, producing a BWA-specific .sai intermediate binary files.

Exercise: What does does bwa aln needs in the way of arguments?

...

And where does it write its binary output?

bash

Examine the sub-command's usage:

bwa

aln

Expand
titleHint
Code Block
language

Required arguments are a <prefix> of the bwa index files, and the input FASTQ file. There are lots of options, but here is a summary of the most important ones.

...

Other options control the details of how much a mismatch or gap is penalized, limits on the number of acceptable hits per read, and so on. Much more information can be found on the BWA manual page.

For a basic alignment like this, we can just go with the default alignment parameters.

Note that bwa writes its (binary) output to standard output by default, so we need to redirect that to a .sai file.

For simplicity, we will just execute these commands directly, one at a time. Each command should only take few minutes and you will see bwa's progress messages in your terminal.

Code Block
languagebash
titlebwa aln commands for yeast R1 and R2
# If not already loaded: module load biocontainers module load bwa cd $SCRATCH/core_ngs/alignment/yeast

2>&1 | more

If you just type bwa aln | more, the text will probably scroll off your Terminal. This is because bwa always writes its usage to standard error, not to standard output

The pipe ( | ) only connects standard output from bwa aln to standard input of the more command – but no standard output is generated.

The standard error text that is generated just goes to your Terminal, bypassing more.


Expand
titleAnswer

The bwa aln usage says:

     Usage: bwa aln [options] <prefix> <in.fq>

Required arguments are a <prefix> of the bwa index files, and the input FASTQ file.

There are no arguments specified for the program's output results, but later in the Options section it says

                   -f FILE file to write output to instead of stdout

so we know output results are written to standard output by default (unless the -f option is specified).

There are lots of options, but here is a summary of the most important ones.

OptionEffect
-lSpecifies the length of the seed (default = 32)
-kSpecifies the number of mismatches allowable in the seed of each alignment (default = 2)
-nSpecifies the number of mismatches (or fraction of bases in a given alignment that can be mismatches) in the entire alignment (including the seed) (default = 0.04)
-tSpecifies the number of threads

Other options control the details of how much a mismatch or gap is penalized, limits on the number of acceptable hits per read, and so on. Much more information can be found on the BWA manual page. For a basic alignment like this, we can just go with the default alignment parameters.

Since bwa writes its (binary) output to standard output by default, so we need to redirect that to a .sai file.

For simplicity, we will just execute these commands directly, one at a time. Each command should only take few minutes and you will see bwa's progress messages in your terminal.

Code Block
languagebash
titlebwa aln commands for yeast R1 and R2
# If not already loaded:
module load biocontainers
module load bwa

cd $SCRATCH/core_ngs/alignment/yeast_bwa
bwa aln sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R1.cat.fastq.gz > yeast_pe_R1.sai
bwa aln sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz > yeast_pe_R2.sai

...

Tip
titleMake sure your output files are not empty

Double check that output was written by doing ls -lh and making sure the file sizes listed are not 0..

When redirection ( > ) to a file is used, the target file is created whether or not the command produced any output!

Exercise: How long did it take to align the R2 file?

Expand
titleAnswer

The last few lines of bwa's execution output should look something like this:

Code Block
languagebash
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 50.76 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 50.35 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 13.64 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.17-r1188
[main] CMD: /usr/local/bin/bwa aln sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R1.cat.fastq.gz
[main] Real time: 8578.584186 sec; CPU: 8377.825597 sec

So the R2 alignment took ~85 ~78 seconds (~1.4 3 minutes).

Since you have your own private compute node, you can use all its resources. It has 128 cores, so re-run the R2 alignment asking for 60 execution threads.

...

Expand
titleAnswer

The last few lines of bwa's execution output should look something like this:

Code Block
languagebash
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 266.70 sec
[bwa_aln_core] write to the disk... 0.04 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 268.94 sec
[bwa_aln_core] write to the disk... 0.03 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 72.26 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.17-r1188
[main] CMD: /usr/local/bin/bwa aln -t 60 sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz
[main] Real time: 7.931 sec; CPU: 179.153 sec

So the R2 alignment took only ~8 seconds (real time), or 10+ times as fast as with only one processing thread.

Note, though, that the CPU time with 60 threads was greater (~180 sec) than with only 1 thread (~85 sec). That's because of the thread management overhead when using multiple threads.
 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 72.26 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.17-r1188
[main] CMD: /usr/local/bin/bwa aln -t 60 sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz
[main] Real time: 5.612 sec; CPU: 166.315 sec

So the R2 alignment took only ~6 seconds (real time), or 10+ times as fast as with only one processing thread.

Note, though, that the CPU time with 60 threads was greater (~166 sec) than with only 1 thread (~78 sec). That's because of the thread management overhead when using multiple threads.

Exercise: How would you capture the diagnostic output from bwa aln?

Expand
titleHint

Examine the sub-command's usage:

bwa aln 2>&1 | more

If you just type bwa aln | more, the text will probably scroll off your Terminal. This is because bwa always writes its usage to standard error, not to standard output

The pipe ( | ) only connects standard output from bwa aln to standard input of the more command – but no standard output is generated.

The standard error text that is generated just goes to your Terminal, bypassing more.


Expand
titleAnswer

The bwa aln usage says:

     Usage: bwa aln [options] <prefix> <in.fq>

There are no arguments specified for the program's output results, but later in the Options section it says

                   -f FILE file to write output to instead of stdout

so we know output results are written to standard output by default (unless the -f option is specified).

N


Next we use the bwa sampe command to pair the reads and output SAM format data. Just type that command in with no arguments to see its usage.

...

You may have noticed that some alignment records contain contig names (e.g. chrV) in field 3 while others contain an asterisk ( * ). The * means the record didn't map. We're going to use this heuristic along with cut to see about how many records represent aligned sequences. (Note this is not the strictly correct method of finding unmapped reads because not all unmapped reads have an asterisk in field 3. Later you'll see how to properly distinguish between mapped and unmapped reads using samtools.)

...