...
Remember, from the introduction tutorial, there are multiple ways to look at our sequencing files without using cat:
Command | useful for | bad if |
---|
head | seeing the first lines of a file (10 by default) | file is binary |
tail | seeing the last lines of a file (10 by default) | file is binary |
cat | print all lines of a file to the screen | the file is big and/or binary |
less | opens the entire file in a separate program but does not allow editing | if you are going to type a new command based on the content, or forget the q key exits the view, or file is binary |
more | prints 1 page worth of a file to the screen, can hold enter key down to see next line repeatedly. Contents will remain when you scroll back up. | you forget that you hit the q key to stop stop looking at the file, or file is binary |
Code Block |
---|
language | bash |
---|
title | How to determine how many reads are in a fastq file |
---|
|
grep -c "^+$" SRR030257_1.fastq |
...
Run the script without any arguments to get the help message:
Code Block |
---|
module load gcc
module load bioperl
bp_seqconvert.pl
|
Convert a gbk reference to a embl reference
...
Sometimes you only want to work with a subset of a full data file to check for overall trends, or to try out a new piece of code. Convert only the first 10,000 lines of SRR030257_1.fastq
to FASTA
format.
Code Block |
---|
language | bash |
---|
title | Remember you can use the "|" character to have the output of head feed directly into the bp_seqconvert.pl script. |
---|
|
head -n 10000 SRR030257_1.fastq | bp_seqconvert.pl --from fastq --to fasta > SRR030257_1.fasta
|
...
Expand |
---|
title | What information was lost by this conversion? |
---|
|
...
Expand |
---|
title | Click here if you need a hint |
---|
|
Remember you can use the "|" character to have the output of head feed directly into the bp_seqconvert.pl script.Use the head command to look at the top of both the .fastq and .fasta file |
|
Code Block |
---|
| head SRR030257_1.fastq
head SRR030257_1.fasta |
The line of ASCII characters was lost. Remember, those are your "base quality scores". Many mappers will use the base quality scores to improve how the reads are aligned by not placing as much emphasis on poor bases. |
Mapping with bowtie2
Bowtie2 is a complete rewrite of bowtie. It is currently the latest and greatest in the eyes of one very picky professor (and his postdoc) in of an older program bowtie. In terms of configurability, sensitivity, and speed it is useful for a wide range of projects. After years of teaching bwa mapping along with bowtie2, bowtie2 alone is now taught as I never recommend anyone use bwa, and based on positive feedback we continue with this set up. For some more details about how read mappers work see the bonus presentation, and if you find a compelling reason to use bwa (or any other read mapper) rather than bowtie2 after the course is over, I'd love to hear from you.
...
Code Block |
---|
language | bash |
---|
title | Commands for making a directory and changing into itnamed bowtie2 |
---|
collapse | true |
---|
|
mkdir bowtie2 |
First you need to convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta
and put it in the same directory as NC_012967.1.gbk
.
Code Block |
---|
|
expandbash | title | If you're stuck and want a hint without the answer... |
---|
| Use the Use the information you you learned above working with the bp_seqconvert.pl script |
| Code Block |
---|
language | bash |
---|
title | Click here to get the answer or to check your commandto convert the entire .gbk file into a .fa file | collapse | true |
---|
| bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > NC_012967.1.fasta
|
Next, we want to make sure the bowtie2 module is loaded (we use module spider
to get the current name, which may not be bowtie/2.3.4
if you re-run this tutorial later):
Expand |
---|
title | Click here for a hint without the answer and some links back to where you would have learned this previously |
---|
|
Remember in our earlier tutorial we discussed the use of lonestar's module commands "spider" and "load" to install new functionality and "list", "keyword", and "avail" to find different modules. Code Block |
---|
language | bash |
---|
title | click here for the best answer |
---|
collapse | true |
---|
| module list bowtie |
|
...
Code Block |
---|
|
Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "bowtie/2.3.4"
Try: "module spider bowtie/2.3.4" to see how to load the module(s). |
...
Expand |
---|
title | OPTIONAL -- How to check what version of bowtie2 was loaded? |
---|
|
Here are a few of the possibilities that will work. Code Block |
---|
language | bash |
---|
title | In this case all of these methods will work, that may not be true of all programs |
---|
| module list
bowtie2 --version
module list
which bowtie2 | Note that Many programs have --version or -v options that can be passed to them to specifically print the version of the program and sometimes information about what papers the authors would like you to cite if you use the program. which can be very useful for making sure you are running the executable that you think you are running, especially if you install your own programs. In particular make sure that the path matches up to what you expect. The most common situations arise from wanting to run a simplistically named script in your $HOME directory conflicting with something of the same name in the $BI directories or TACC modules.
|
...
Expand |
---|
title | If you're stuck click here for an explanation of what arguments the command does need |
---|
|
The command requires 2 arguments. The first argument is the reference FASTA. The second argument is the "base" file name to use for the created index files. It will create a bunch of files beginning bowtie/NC_012967.1*. |
Code Block |
---|
language | bash |
---|
title | Click here to check your work, or for the answer if needed |
---|
collapse | true |
---|
|
bowtie2-build NC_012967.1.fasta bowtie2/NC_012967.1
|
Take a look at your output directory using ls bowtie2
to see what new files have appeared. These files are binary files, so looking at them with head
or tail isn't instructive and can cause issues with your terminal. If you insist on looking at them (or accidentally do so before you read this) and your terminal begins behaving oddly, simply close it and log back into lonestar with a new terminal, and start a new idev session.Why do so many different mapping programs create an index as a first step you may be wondering?
Info |
---|
title | you may be wondering why creating an index is a common first step for many different mapping programs. |
---|
|
Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to fully align the read to every place in the genome, it saves a ton of time. Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map. Keeping it as a separate step means that you can skip it later when you want to align a new sample to the same reference sequence. |
Finally, map the reads! The command you need is:
Try Again, try reading the help to figure out how to run the command yourself. Remember these are paired-end reads.
Warning |
---|
|
This command can take a while (~5 minutes) and is extremely taxing. This is longer than we want to run a job on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the showq -u command to make sure you are on an idev node. If you are not scroll up to the start of this tutorial or see yesterday's tutorial for more informationidev node. If you are not sure if you are on an idev node, raise your hand and wespeak up on zoom and I'll show(q) -u what you are looking for. Yes, your instructor likes bad puns. My apologies. If you are not on an idev node, and need help to relaunch it, click over to the idev tutorial. |
Code Block |
---|
language | bash |
---|
title | Solution |
---|
collapse | true |
---|
|
bowtie2 -t -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam # the -t command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers |
...
- SAM files can be enormously humongous text files (potentially >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like
head
or grep
or more or using a viewer like IGV, which we will cover in a later tutorial. - SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time, or they are covered in the bonus presentation.
Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:
...
Expand |
---|
title | See if you can figure out how to re-run this using all 48 processors. Click here for a hint |
---|
|
You need to use the -p , for "processors" option. Since we had 48processors available to our job. Code Block |
---|
language | bash |
---|
title | click here to check your answer |
---|
collapse | true |
---|
| bowtie2 -t -p 48 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam
|
Try it out and compare the speed of execution by looking at the log files. Expand |
---|
title | How much faster was it using all 48 processors? |
---|
| 1 processor took ~5 a little over 5 minutes, 48 processors took ~ 15 seconds minute. Can you think of any reasons why it was ~ 16x faster rather than 48x faster? Expand |
---|
| Anytime you use multiprocessing correctly things will go faster, but even if a program can divide the input perfectly among all available processors, and combine the outputs back together perfectly, there is "overhead" in dividing things up and recombining them. These are the types of considerations you may have to make with your data: When is it better to give more processors to a single sample? How fast do I actually need the data to come back? |
|
|
If you want to launch many processes as part of one job, so that they are distributed one per node and use the maximum number of processors available, then you need to learn think about the "wayness" of how you request nodes on Lonestar and possibly edit your launcher.slurm script in the future ((we'll go over this more on this on Friday), or make better use of running commands in the background using the & symbol at the end of the command line.
One consequence of using multithreading that might be confusing is that the aligned reads might appear in your output SAM file in a different order than they were in the input FASTQ. This happens because small sets of reads get continuously packaged, "sent" to the different processors, and whichever set "returns" fastest is written first. You can force them to appear in the same order (at a slight cost in speed) by adding the --reorder
flag to your command, but is typically only necessary if the reads are already ordered or you intend to do some comparison between the input and output.
Optional
...
exercises
In the bowtie2 example, we mapped in --local
mode. Try mapping in --end-to-end
mode (aka global mode).
- Do the BWA tutorial so you can compare their outputs.
- Did bowtie2 or BWA map more reads?
- In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
- Which aligner took less time to run? Are there any options you can change that:
- Lead to a larger percentage of the reads being mapped? (increase sensitivity)
- Speed up performance run time without causing many fewer reads to be mapped? (increase performance)
...
The next steps are often to view the output using a specific viewer on your local machine, or to begin identifying variant locations where the reads differ from the reference sequence. These will be the next things we cover in the course. Here is a link to help you return to the GVA 2019 2020 course schedule.