...
| Tip | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
Use our summer school reservation (CoreNGS-Thu) when submitting batch jobs to get higher priority on the ls6 normal queue.
Note that the reservation name (CoreNGS) is different from the TACC allocation/project for this class, which is OTH21164. |
...
After raw sequence files are generated (in FASTQ format), quality-checked, and pre-processed preprocessed in some way, the next step in many NGS pipelines is mapping to a reference genome.
...
| Code Block | ||||
|---|---|---|---|---|
| ||||
idev -m 180 -N 1 -A OTH21164 -r CoreNGS-Thu |
Then stage the sample datasets and references we will use.
...
Here are the four reference genomes we will be using today, with some information about them. These are not necessarily the most recent versions of these references (e.g. the newest human reference genome is hg38 and a the most recent miRBase version is v21. (See here for information about many more genomes.)
...
Regular expressions are so powerful that nearly every modern computer language includes a "regex" module of some sort. There are many online tutorials for regular expressions, and several slightly different "flavors" of them. But the most common is the Perl style (http://perldoc.perl.org/perlretut.html), which was one of the fist and still the most powerful (there's a reason Perl was used extensively when assembling the human genome). We're only going to use simple regular expressions here, but learning more about them will pay handsome dividends for you in the future.Here's how to execute grep to list contig names in a FASTA file.
First stage the FASTA files we'll need:
| Code Block | ||||
|---|---|---|---|---|
| ||||
# Stage the FASTA files
cds
mkdir -p core_ngs/references/fasta
cd core_ngs/references/fasta
cp $CORENGS/references/fasta/*.fa .
|
Here's how to execute grep to list contig names in a FASTA file.
| Code Block | ||||
|---|---|---|---|---|
| ||||
cd $SCRATCH/core_ngs/references/fasta grep -P '^>' sacCer3.fa | more |
...
- The -P option tells grep to Perl-style regular expression patterns.
- This makes including special characters like Tab ( \t ), carriage return ( \r ) or linefeed ( \n ) much easier that the default POSIX paterns.
- While it is not required here, it generally doesn't hurt to include this option.
'^>' is the regular expression describing the pattern we're looking for (described below)
- sacCer3.fa is is the FASTA file to search.
- lines with text that match our pattern will be written to standard output
- non matching lines will be omitted
- We pipe to more just in case there are a lot of contig names.
...
| Expand | |||||
|---|---|---|---|---|---|
| |||||
|
Exercise: How many lines does the sacCer3 reference have, and how many contigs are there in the sacCer3 reference?
| Expand | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
Or use grep's -c option that says "just count the line matches"
|
...
| Expand | ||
|---|---|---|
| ||
There are 17 contigs, out of 243,167 total lines. |
Aligner overview
There are many aligners available, but we will concentrate on two of the most popular general-purpose ones: bwa and bowtie2. The table below outlines the available protocols for them.
...