Read Mapping with bowtie2 Tutorial GVA2023
- 1 Overview
- 2 Learning Objectives
- 3 Theory
- 4 Tutorial: E. coli genome re-sequencing data
- 5 Reminders about working with sequencing files
- 6 Converting sequence file formats
- 6.1.1 Load the bioperl module and run the script without any options to display the help contents
- 6.1.2 On the head node, after you have installed the bioperl module, there are actually 2 instances of bp_seqconvert.pl available to you.
- 6.1.3 On the head node, after loading the bioperl module, you have access to the program in 2 different locations.
- 6.2 Convert a gbk reference to a embl reference
- 6.3 Converting from fastq to fasta format
- 7 Mapping with bowtie2
- 8 What comes after mapping?
- 9 Optional exercises
Overview
Once you know you are working with the best quality data (Evaluating Raw Sequencing data tutorial) possible, the first step in nearly every NGS analysis pipeline is to map sequencing reads to a reference genome. In this tutorial we'll explore these basic principles using bowtie2 on TACC.
The world of read mappers is settling down after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster and more accurate shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed, accuracy, and configurability among read mappers that have remained popular. There are over 50 read mapping programs listed here. Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, etc). It is possible a different read mapper would be better for your set of experiments. More will be discussed about selecting a good tool on Friday.
Other read mappers
Previous versions of this class and tutorial have covered using bowtie and bwa. Please consult these tutorials for more specific information on each mapping program. A previous version of this tutorial included a trimmed down version of the bwa tutorial if you just want the 'flavor' of what other read mappers involve.
Learning Objectives
This tutorial covers the commands necessary to use bowtie2 to map reads to a reference genome, and concepts applicable to many more mappers.
Become comfortable with the basic steps of indexing a reference genome, mapping reads, and converting output to
SAM/BAMformat for downstream analysis.Use bowtie2 to map reads from an E. coli Illumina data set to a reference genome and compare the output.
Theory
Please see the Introduction to mapping presentation on the course outline for more details of the theory behind read mapping algorithms and critical considerations for using these tools and references correctly.
Tutorial: E. coli genome re-sequencing data
The following DNA sequencing read data files were downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment.
Transferring Data
Rather than having to download these files from the SRA or EUN and NCBI, these data files are available in the following directory:
$BI/gva_course/mapping/data
You may recognize this as the same files we used for the fastqc and cutadapt tutorial. If you chose to improve the quality of R2 reads using cutadapt as you did for R1 in the tutorial, you could use the improved reads in this tutorial to see what a difference the improved reads can make for read mapping.
File Name | Description | Sample |
|---|---|---|
| Paired-end Illumina, First of pair, FASTQ format | Re-sequenced E. coli genome |
| Paired-end Illumina, Second of pair, FASTQ format | Re-sequenced E. coli genome |
| Reference Genome in Genbank format | E. coli B strain REL606 |
The easiest way to run the tutorial is to copy this entire directory into a new folder called "GVA_bowtie2_mapping" on your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls command.
tacc:/scratch/<#>/<UserName>/GVA_bowtie2_mapping$ ls
NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq SRR030257_2.fastq.gz
Reminders about working with sequencing files
Beware the cat command when working with NGS data
NGS data can be quite large, a single lane of an Illumina Hi-Seq run generates 2 files each with 100s of millions of lines. Printing all of that can take an enormous amount of time and will likely crash your terminal long before it finishes. If you find yourself in a seemingly endless scroll of sequence (or anything else for that matter) remember control+c will kill whatever command you just executed.
If hitting control+c several times doesn't work, control +z will stop the process, you then need to kill the process using kill %1 if control+z doesn't work, you may be best off closing the window, opening a new window, logging back in, and picking up where you left off. Note that for the purpose of this class, you should make sure to restart an idev session.
Remember, from the introduction tutorial, there are multiple ways to look at our sequencing files without using cat:
Command | useful for | bad if |
|---|---|---|
head | seeing the first lines of a file (10 by default) | file is binary |
tail | seeing the last lines of a file (10 by default) | file is binary |
cat | print all lines of a file to the screen | the file is big and/or binary |
less | opens the entire file in a separate program but does not allow editing | if you are going to type a new command based on the content, or forget the q key exits the view, or file is binary |
more | prints 1 page worth of a file to the screen, can hold enter key down to see next line repeatedly. Contents will remain when you scroll back up. | you forget that you hit the q key to stop stop looking at the file, or file is binary |
How to determine how many reads are in a fastq file
grep -c "^+$" SRR030257_1.fastq How to determine how long the reads are in a fastq file
sed -n 2p SRR030257_1.fastq | awk -F"[ATCGNatcgn]" '{print NF-1}'Converting sequence file formats
Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky? Everybody has own favorite formats and/or those that they are the most familiar with but humans can typically pick the information they need out of comparable formats. Programs can only be written to assume a single type of format (or allow you to specify a format if the author is particularly generous), and can only find things in single locations based on that format.
While you could write your own sequence converter, hopefully it jumps out at you that this is something someone else must have done before. In situations like this, you can often spend a few minutes on google finding a stackoverlow question/answer that deals with something like this. Some will be in reference to how to code such things, but the particularly useful ones will be the ones that point to a program or repository where someone has already done this for you.
In this case the bp_seqconvert.pl perl script is included as part of the bioperl module package. Rather than attempt to find it as part of a conda package, or in some other repository we will use the module version. If needing this script in the future outside of TACC, https://metacpan.org/dist/BioPerl/view/bin/bp_seqconvert.
Load the bioperl module and run the script without any options to display the help contents
module load bioperl
bp_seqconvert.pl
The information in this box is related to the path variable, perl programming libraries, having multiple copies of a script/file available in your path, and computer architecture. If you are not interested in this, you can skip this box.
On the head node, after you have installed the bioperl module, there are actually 2 instances of bp_seqconvert.pl available to you.
module load bioperl/1.007002
which -a bp_seqconvert.plIf you run on an idev node you get 1 result related to the bioperl module, but if you run on the head node (outside idev) you get 2 results. On the head node, 1 points to the BioITeam near where you keep finding your data (/corral-repl/utexas/BioITeam/) which is part of the BioITeam, specifically the "bin" folder which is full of binary or (typically small) bash/python/perl/R scripts that someone has written to help the TACC community. The other is in a folder specifically associated with the bioperl module. You can load and unload the bioperl module to see the difference.
If you try to run the BioITeam version of the script (/corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl) from the head node without the bioperl module loaded, you get an error message similar to the following:
module unload bioperl
bp_seqconvert.pl
Can't locate Bio/SeqIO.pm in @INC (@INC contains: /corral-repl/utexas/BioITeam//local/share/perl5 /corral-repl/utexas/BioITeam//perl5/lib/perl5/x86_64-linux-thread-multi /corral-repl/utexas/BioITeam//perl5/lib/perl5 /corral-repl/utexas/BioITeam//perl5/lib64/perl5/auto /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.
BEGIN failed--compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.We get this error message because because while perl is installed on stampede2, the required SeqIO.pm library is not available by default but it is easily installed with the bioperl module. As it is likely rare that you will need to convert sequence files between different format, bioperl is actually not listed as one of the modules on your .bashrc file in your $HOME directory that you set up yesterday, but if you find yourself using the command `module load bioperl` often, you may want to add it.
On the head node, after loading the bioperl module, you have access to the program in 2 different locations.
module load bioperl
which -a bp_seqconvert.pl
How does the computer know which location to use?
It will use whatever location it finds earliest in the $PATH,
which is the same as the top line in the which -a command output,
which is the same as the line printed if you run the `which` command without the "-a".
Using just the script name by itself, will use which ever is found first, but you can always force the computer to use a given copy by specifying the full path to the copy you want. Thus, the following 2 commands are not equal:
/corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl
/home1/apps/bioperl/1.007002/bin/bp_seqconvert.pl While the commands are different, both copies can use the same bioperl library SeqIO.pm when the bioperl module is loaded and thus work.