Page Comparison

Tip

title	Reservations

Use our summer school reservation (CoreNGS

-Thu

) when submitting batch jobs to get higher priority on the

ls6 normal queue today:sbatch --reservation=CoreNGS-Thu <batch_file>.slurm
idev -m 180 -N 1 -A OTH21164 -r CoreNGS-Thu

ls6 normal queue.

Code Block

language	bash
title	Request an interactive (idev) node

# Request a 180 minute interactive node on the normal queue using our reservation
idev -m 120 -N 1 -A OTH21164 -r CoreNGS
idev -m 120 -N 1 -A TRA23004 -r CoreNGS

# Request a 120 minute idev node on the development queue 
idev -m 120 -N 1 -A OTH21164 -p development
idev -m 120 -N 1 -A TRA23004 -p development

Code Block

language	bash
title	Submit a batch job

# Using our reservation
sbatch --reseservation=CoreNGS <batch_file>.slurm

Note that the reservation name (CoreNGS) is different from the TACC allocation/project for this class, which is OTH21164.

Table of Contents

Overview

...

Code Block

language	bash
title	Start an idev session

idev -m 180 -N 1 -A OTH21164 -r CoreNGS-Thu

Then stage the sample datasets and references we will use.

...

Reference	Species	Base Length	Contig Number	Source	Download
hg19	Human	3.1 Gbp	25 (really 93)	UCSC	UCSC GoldenPath
sacCer3	Yeast	12.2 Mbp	17	UCSC	UCSC GoldenPath
mirbase v20	Human subset	160 Kbp	1908	miRBase	miRBase Downloads
vibCho (O395)	Vibrio cholerae	~4 Mbp	2	GenBank	GenBank Downloads

...

We've discovered a pattern (also known as a regular expression) to use in searching, and the command line tool that does regular expression matching is grep (general regular expression parser). (Read more about grep and regular expressions)

Regular expressions are so powerful that nearly every modern computer language includes a "regex" module of some sort. There are many online tutorials for regular expressions, and several slightly different "flavors" of them. But the most common is the Perl style (http://perldoc.perl.org/perlretut.html), which was one of the fist and still the most powerful (there's a reason Perl was used extensively when assembling the human genome). We're only going to use simple regular expressions here, but learning more about them will pay handsome dividends for you in the future.

...

The -P option tells grep to Perl-style regular expression patterns.
- This makes including special characters like Tab ( \t ), carriage return ( \r ) or linefeed ( \n ) much easier that the default POSIX paterns.
- While it is not required here, it generally doesn't hurt to include this option.
'^>' is the regular expression describing the pattern we're looking for (described below)
sacCer3.fa is the file to search.
- lines with text that match our pattern will be written to standard output
- non matching lines will be omitted
We pipe to more just in case there are a lot of contig names.

...

As we have seen, during command line parsing and evaluation the shell will often look for special metacharacters on the command line that mean something to it (for example, the $ in front of an environment variable name, like in $SCRATCH). Well, regular expressions treat the $ specially too – but in a completely different way! Those single quotes tell the shell "don't look inside here for special characters – treat this as a literal string and pass it to the program". The shell will obey, will strip the single quotes off the string, and will pass the actual pattern, ^>, to the grep program. (Read more about about Literal characters and metacharacters and Quoting in the shell)

So what does ^> mean to grep? We know that contig name lines always start with a > character, so > is a literal for grep to use in its pattern match.

...

alignment type	aligner options	pro's	con's
global with bwa	single end reads: bwa aln <R1> bwa samse paired end reads: bwa aln <R1> bwa aln <R2> bwa sampe	simple to use (take default options) good for basic global alignment	multiple steps needed
global with bowtie2	bowtie2	extremely configurable can be used for RNAseq alignment (after adapter trimming) because of its many options	complex (many options)
local with bwa	bwa mem	simple to use (take default options) very fast no adapter trimming needed good for simple RNAseq analysis the secondary alignments it reports can provide splice junction information	always produces alignments with secondary reads must be filtered if not desired
local with bowtie2	bowtie2 --local	extremely configurable no adapter trimming needed good for small RNA alignment because of its many options	complex – many options

...

We're going to skip the trimming step for now and see how it goes. We'll perform steps 2 - 5 now and leave , leaving samtools for a later exercise since steps 6 - 10 are common to nearly all post-alignment workflows.

...

Code Block

language	bash
title	Start an idev session

idev -m 180 -N 1 -A OTH21164 -r CoreNGS-Thu      # or -A TRA23004

idev -m 90120 -N 1 -A OTH21164 -p development  # or -A TRA23004

Code Block

language	bash

module load biocontainers  # takes a while
module load bwa
bwa

...

Expand

title	Answer

The last few lines of bwa's execution output should look something like this:

Code Block

language	bash

[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 50.76 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 50.35 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 13.64 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.17-r1188
[main] CMD: /usr/local/bin/bwa aln sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R1.cat.fastq.gz
[main] Real time: 7885.185584 sec; CPU: 7783.598825 sec

So the R2 alignment took ~78 ~85 seconds (~1.3 4 minutes).

Since you have your own private compute node, you can use all its resources. It has 128 cores, so re-run the R2 alignment asking for 60 execution threads.

...

Expand

title	Answer

The last few lines of bwa's execution output should look something like this:

Code Block

language	bash

[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 266.70 sec
[bwa_aln_core] write to the disk... 0.04 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 268.94 sec
[bwa_aln_core] write to the disk... 0.03 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 72.26 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 592180 sequences have been processed.
[main] Version: 0.7.17-r1188
[main] CMD: /usr/local/bin/bwa aln -t 60 sacCer3/sacCer3.fa fastq/Sample_Yeast_L005_R2.cat.fastq.gz
[main] Real time: 57.013931 sec; CPU: 142179.813153 sec

So the R2 alignment took only ~5 ~8 seconds (real time), or 1510+ times as fast as with only one processing thread.

Note, though, that the CPU time with 60 threads was greater (142.8 ~180 sec) than with only 1 thread (77.6 ~85 sec). That's because of the thread management overhead when using multiple threads.

...

Code Block

language	bash
title	Cut syntax for a single field

tail yeast_pe.sam | cut -f 3

By default cut assumes the field delimiter is Tab, which is the delimiter used in the majority of NGS file formats. You can specify a different delimiter with the -d option.

...

Code Block

language	bash
title	Count aligned SAM records

grep -v -P '^@' yeast_pe.sam | cut -f 3 | grep -v '*' | wc -l

Read more at Some Linux commands: Advanced commands

Exercise: About how many records represent aligned sequences? What alignment rate does this represent?

Expand

title	Answer

The expression above returns 612,968. There were 1,184,360 records total, so the percentage is:

Code Block

language	bash
title	Calculate alignment rate

awk 'BEGIN{print 612968/1184360}'

or about 51%. Not great.

Note we perform this calculation in awk's BEGIN block, which is always executed, instead of the body block, which is only executed for lines of input. And here we call awk without piping it any input. See Linux fundamentals: cut,sort,uniq,grep,awk

Exercise: What might we try in order to improve the alignment rate?

...

Expand

title	Make sure you're in a idev session

Code Block

language	bash
title	Start an idev session

idev -m 120 -N 1 -A OTH21164 -r CoreNGS-Thu     # or -A TRA23004

idev -m 90 -N 1 -A OTH21164 -p development  # or -A TRA23004

Code Block

language	bash

# If not already loaded
module load biocontainers  # takes a while

module load samtools
samtools

...

Exercise: What samtools view option will include the header records in its output? Which option would show only the header records?

Expand

title	Hint

Note that samtools (like bwa) writes its help to standard error, but less and more only accept input on standard input. So the syntax redirecting standard error to standard input must be used before the pipe to less or more.

samtools view 2>&1 | less

then search for "header" ( /header )

Expand

title	Answer

samtools view -h shows header records along with alignment records.

samtools view -H shows header records only.

...

Here we use the tee command which reports its standard input outputto standard output before also writing it to the specified file.

...

Versions Compared

Old Version 198

New Version Current

Key

Overview