Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Even though many mapping tools exist, a few individual programs have a dominant "market share" of the NGS world. In this section, we will primarily focus on two of the most versatile general-purpose ones: BWA and Bowtie2 (the latter being part of the Tuxedo suite which includes the transcriptome-aware RNA-seq aligner Tophat2 as well as other downstream quantifiaction quantification tools).

Stage the alignment data

...

Code Block
languagebash
titleStart an idev session
idev -m 180 -N 1 -A OTH21164 -r CoreNGSday4CoreNGS-Thu

Then stage the sample datasets and references we will use.

...

File NameDescriptionSample
Sample_Yeast_L005_R1.cat.fastq.gzPaired-end Illumina, First of pair, FASTQYeast ChIP-seq
Sample_Yeast_L005_R2.cat.fastq.gzPaired-end Illumina, Second of pair, FASTQYeast ChIP-seq
human_rnaseq.fastq.gzPaired-end Illumina, First of pair only, FASTQHuman RNA-seq
human_mirnaseq.fastq.gzSingle-end Illumina, FASTQHuman microRNA-seq
cholera_rnaseq.fastq.gzSingle-end Illumina, FASTQV. cholerae RNA-seq

Reference Genomes

...

Here are the four reference genomes we will be using today, with some information about them. These are not necessarily the most recent versions of these references (e.g. the newest human reference genome is hg38 and the most a recent miRBase annotation is  version is v21. (See here for information about many more genomes.)

...

We've discovered a pattern (also known as a regular expression) to use in searching, and the command line tool that does regular expression matching is grep (general regular expression parser). (Read more about grep here: Advanced commands: grep.and regular expressions)

Regular expressions are so powerful that nearly every modern computer language includes a "regex" module of some sort. There are many online tutorials for regular expressions, and several slightly different "flavors" of them. But the most common is the Perl style (http://perldoc.perl.org/perlretut.html), which was one of the fist and still the most powerful (there's a reason Perl was used extensively when assembling the human genome). We're only going to use simple regular expressions here, but learning more about them will pay handsome dividends for you in the future.

...

Code Block
languagebash
titlegrep to match contig names in a FASTA file
# If you haven't staged Stage the fastaFASTA files
cds
mkdir -p core_ngs/references/fasta
cd core_ngs/references/fasta
cp $CORENGS/references/fasta/*.fa .

cd $SCRATCH/core_ngs/references/fasta
grep -P '^>' sacCer3.fa | more

...

  • The -P option tells grep to Perl-style regular expression patterns. 
    • This makes including special characters like Tab \t  ), Carriage Return carriage return ( \r ) or Linefeed linefeed ( \n ) much easier that the default POSIX paterns.
    • While it is not required here, it generally doesn't hurt to include this option.
  • '^>' is the regular expression describing the pattern we're looking for (described below)

  • sacCer3.fa is the file to search. 
    • lines with text that match our pattern will be written to standard output
    • non matching lines will be omitted
  • We pipe to more just in case there are a lot of contig names.

Now down to the nuts and bolts of the pattern: '^>'

First, the single quotes around the pattern – this tells the bash shell to pass the exact string contents to grep.

As part of its friendly we have seen, during command line parsing and evaluation , the shell will often look for special characters metacharacters on the command line that mean something to it (for example, the the $ in front of an environment variable name, like in $SCRATCH). Well, regular expressions treat the $ specially too – but in a completely different way! Those Those single quotes tell tell the shell "don't look inside here for special characters – treat this as a literal string and pass it to the program". The shell will obey, will strip the single quotes off the string, and will pass the actual pattern,   ^>, to the grep program. (Note that the shell does look inside double quotes ( " ) for certain special signals, such as looking for environment variable names to evaluate. Read more about  program. (Read more about Quoting in the shell.)

So what does ^> mean to grep? We know that contig name lines always start with a > character, so > is a literal for grep to use in its pattern match.

We might be able to get away with just using this literal alone as our regex, specifying '>' as as the command line pattern argument. But for for grep, the more specific the pattern, the better. So we constrain where the > can appear on the line. The special carat ( ^ )  metacharacter represents "beginning of line". So ^> means "beginning of a line followed by a > character".

...