Page Comparison

...

Tool

TACC

Version

Download

Manual

Example

Bowtie2

Code Block
module load bowtie/2.2.6

You may recall we added this to our .bashrc file yesteray yesterday so it is already loaded

2.1.0

link

#Bowtie2

...

Code Block
$BI/gva_course/mapping/data

...

File Name

...

Description

...

Sample

...

SRR030257_1.fastq

...

Paired-end Illumina, First of pair, FASTQ format

...

You may recognize this as the same files we used for the fastqc and fastx_toolkit tutorial. If you chose to improve the quality of R2 reads using fastx_toolkit as you did for R1 in the tutorial, you could use the improved reads in this tutorial to see what a difference the improved reads can make for read mapping.

File Name	Description	Sample
`SRR030257_1.fastq`	Paired-end Illumina, First of pair, FASTQ format	Re-sequenced E. coli genome
`SRR030257_2.fastq`	Paired-end Illumina, Second of pair, FASTQ format	Re-sequenced E. coli genome
`NC_012967.1.gbk`	Reference Genome in Genbank format	E. coli B strain REL606

The easiest way to run the tutorial is to copy this entire directory into a new folder called "bowtie2mappingBDIB_bowtie2_mapping" on your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls command.

If you need a little help but don't want the answer yet, click the triangle...

Code Block
tacc:~$ ls NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq
Expand
title	SRR030257_2.fastq.gz

Expand

title	Hint

Remember that to copy an entire folder requires the use of the recursive (-r) option.

Code Block

language	bash
title	Still stuck? click here for the correct code
collapse	true

cds
cp -r $BI/gva_course/mapping/data bowtie2MappingTutorialBDIB_bowtie2_mapping
cd bowtie2MappingTutorialBDIB_bowtie2_mapping
ls

Useful commands

Often you will have general questions about your sequencing files that you want to answer before or after starting your actual analysis. Here we show you some very handy commands after a warning:

Warning

title	Beware the cat command when working with NGS data

NGS data can be quite large, a single lane of an Illumina Hi-Seq run generates 2 files each with 100s of millions of lines. Printing all of that can take an enormous amount of time and may crash your terminal long before it finishes. If you find yourself in a seemingly endless scroll of sequence (or anything else for that matter) remember ctrl+c will kill whatever command you just executed

Reminder about Linux 1 liners

Below are several commands we've already been using, and some new ones put together to improve your skills.

Code Block

language	bash
title	How to look at the top/bottom of files to determine their type
collapse	true

head * # will show the top 10 lines to give you an idea of the file type/structure
tail * # will show the last 10 lines to determine if the file is a repetitive structure

...

Code Block

language	bash
title	How to determine the total number of sequences in a fastq file
collapse	true

wc -l *  # and then divide by 4 using the your knowledge of fastq files
 
# OROR 
 
grep ^@SRR030257 SRR030257_1.fastq | wc -l
 
# OR
grep --count ^@SRR030257 SRR030257_1.fastq

...

The bp_seqconvert.pl script that is installed as part of Bioperl is one helpful utility for converting between many common sequence formats. On TACC, the Bioperl modules are installed, but the helper script isn't. So, we've put it in a place that you can run it from for your convenience. However, remember that any time that you use the script you must have the bioperl module loaded. We also took care of this for you when we edited your ~/.profile_userbashrc file in the Linux introduction.

Run the script without any arguments to get the help message:

Code Block
module load gcc module load bioperl bp_seqconvert.pl

Exercises

...

Bowtie2 is a complete rewrite of bowtie. It is currently the latest and greatest in the eyes of one very picky instructor (and his postdoc/gradstudent) in terms of configurability, sensitivity, and speed.

Create a fresh output directory named bowtie2. We are going to create a specific output directory for the bowtie2 mapper within the directory that has the input files so that you can compare the results of other mappers if you choose to do the other tutorials.

...

After years of teaching bwa mapping along with bowtie2, we've decided that you will be the first class to use only bowtie2 since we never recommend anyone use bwa. For some more details about the differences between them see the bonus presentation, and if you find a compelling reason to use bwa rather than bowtie2, we'd love to hear from you.

Create a fresh output directory named bowtie2. We are going to create a specific output directory for the bowtie2 mapper within the directory that has the input files so that you can compare the results of other mappers if you choose to do the other tutorials.

Code Block

language	bash
title	Commands for making a directory and changing into it
collapse	true

mkdir bowtie2

Next, load make sure the bowtie2 module is loaded (we use module spider to get the current name, which may not be bowtie/2.12.06 if you re-run this tutorial):

Expand

title	Click here for a hint without the answer

Remember in our earlier tutorial we discussed the use of lonestar's module commands "spider" and "load" to install new functionality

Code Block

language	bash
title	click here for the answer without having to go back through the previous tutorial
collapse	true

module spiderlist bowtie
 
module# load bowtie/2.1.0

Expand

title	OPTIONAL -- How to check what version of bowtie2 was loaded?

Here are a few of the possibilities that will work.

Code Block

language	bash
title	In this case all of these methods will work, that may not be true of all programs

bowtie2 --version
module list
which bowtie2

Note that which can be very useful for making sure you are running the executable that you think you are running, especially if you install your own programs. In particular make sure that the path matches up to what you expect. The most common situations arise from wanting to

This should return the following 2 lines:
# Currently Loaded Modules Matching: bowtie
#  1) bowtie/2.2.6
 
# If you do not see the above output, run these commands, and repeat this block
module spider bowtie
module load bowtie/2.2.6

Expand

title	OPTIONAL -- How to check what version of bowtie2 was loaded?

Here are a few of the possibilities that will work.

Code Block

language	bash
title	In this case all of these methods will work, that may not be true of all programs

bowtie2 --version
module list
which bowtie2

Note that which can be very useful for making sure you are running the executable that you think you are running, especially if you install your own programs. In particular make sure that the path matches up to what you expect. The most common situations arise from wanting to run a simplistically named script in your $HOME directory conflicting with something of the same name in the $BI directories or TACC modules.

Now convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta and put it in the same directory as NC_012967.1.gbk.

...

Try reading the help to figure out how to run the command yourself. Remember these are paired-end reads.

Warning

title	IMPORTANT

This command

...

can take a while (~5 minutes) and is extremely taxing. This is longer than we want to run a job on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage

...

So, you will want to submit the full job to the cluster like you learned in the introduction.

But first, try to figure out the command and start it in interactive mode. Remember these are paired-end reads. Use control-c to stop the job once you are sure it is running without an immediate error! Then, submit your command that is working to the TACC queue.

Warning

title	Submit to the TACC queue or run in an idev shell

Create a commands file and use launcher_creator.py followed by qsub.

Expand

title	click here for a hint before getting the answer

launcher_creator.py -h will give you insight to how to use that command.

Code Block

language	bash
title	commands for the commands file if you can't work them out yourself
collapse	true

bowtie2 -t -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam  # the -t command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers

Code Block

language	bash
title	Click here for the specific launcher_creator.py commands
collapse	true

launcher_creator.py -n "bowtie2" -t 00:10:00

Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:

SAM files can be enormously humongous text files (maybe >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like head or grep or using a viewer like IGV, which we will cover later.
SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time.

Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:

Code Block
head bowtie2/SRR030257.sam

Expand

title	What do you think the 4th and 8th columns mean(click for answer)?

If you thought the answer was the mapping coordinates of the read pairs you were right!

Multithreaded execution

We have actually massively under-utilized Lonestar in this example. We submitted a job that reserved a single node on the cluster, but that node has 12 processors. Bowtie was only using one of those processors (a single "thread")! For programs that support multithreaded execution (and most mappers do because they are obsessed with speed) we could have sped things up by using all 12 processors for the bowtie process.

Expand

title	See if you can figure out how to re-run this using all 12 processors. Click here for a hint

You need to use the -p, for "processors" option. Since we had 12 processors available to our job.

Code Block

language	bash
title	click here to check your answer
collapse	true

bowtie2 -t -p 12 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam

Try it out and compare the speed of execution by looking at the log files.when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the showq -u command to make sure you are on an idev node.

Code Block

language	bash
title	Solution
collapse	true

bowtie2 -t -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam  # the -t command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers

Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:

SAM files can be enormously humongous text files (maybe >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like head or grep or more or using a viewer like IGV, which we will cover later.
SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time, or they are covered in the bonus tutorial.

Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:

Code Block
head bowtie2/SRR030257.sam

Expand

title	What do you think the 4th and 8th columns mean(click for answer)?

If you thought the answer was the mapping coordinates of the read pairs you were right!

Multithreaded execution

We have actually massively under-utilized Lonestar in this example. We ran the command using only a single processor (a single "thread") rather than the 48 we have available. For programs that support multithreaded execution (and most mappers do because they are obsessed with speed) we could have sped things up by using all 48 processors for the bowtie process.

Expand

title	See if you can figure out how to re-run this using all 48 processors. Click here for a hint

You need to use the -p, for "processors" option. Since we had 48processors available to our job.

Code Block

language	bash
title	click here to check your answer
collapse	true

bowtie2 -t -p 48 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam

Try it out and compare the speed of execution by looking at the log files.

Expand

title	How much faster was it using all 48 processors?

1 processor took ~5 minutes, 48 processors took ~ 1 minute. Can you think of any reasons why it was ~ 5x faster rather than 48x faster?

Expand

title	Answer

Anytime you use multiprocessing correctly things will go faster, but even if a program can divide the input perfectly among all available processors, and combine the outputs back together perfectly, there is "overhead" in dividing things up and recombining them. These are the types of considerations you may have to make with your data: When is it better to give more processors to a single sample? How fast do I actually need the data to come back?

If you want to launch many processes as part of one job, so that they are distributed one per node and use the maximum number of processors available, then you need to learn about the "wayness" of how you request nodes on Lonestar and possibly edit your *.sge scriptslurm script in the future (more on this on Friday), or make better use of running commands in the background using the & symbol at the end of the command line.

One consequence of using multithreading that might be confusing is that the aligned reads might appear in your output SAM file in a different order than they were in the input FASTQ. This happens because small sets of reads get continuously packaged, "sent" to the different processors, and whichever set "returns" fastest is written first. You can force them to appear in the same order (at a slight cost in speed) by adding the --reorder flag to your command, but is typically only necessary if the reads are already ordered or you intend to do some comparison between the input and output.

Optional Exercises for your free time

In the bowtie2 example, we mapped in --local mode. Try mapping in --end-to-end mode (aka global mode).
Do the BWA tutorial so you can compare their outputs.
- Did bowtie2 or BWA map more reads?
- In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
- Which aligner took less time to run? Are there any options you can change that:
  - Lead to a larger percentage of the reads being mapped? (increase sensitivity)
  - Speed up performance without causing many fewer reads to be mapped? (increase performance)

...

The next steps are often to view the output using a specific viewer on your local machine, or to begin identifying variant locations where the reads differ from the reference sequence. These will be the next things we cover in the course. Here is a link to help you return to the GVA 2015 2016 course schedule.

Versions Compared

Old Version 1

New Version 2

Key

Useful commands

Reminder about Linux 1 liners

Exercises

More reading about SAM files

Multithreaded execution

More reading about SAM files

Multithreaded execution

Optional Exercises for your free time