Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tip
titleReservations

Use our summer school reservation (CoreNGS-Thu) when submitting batch jobs to get higher priority on the ls6 normal queue.

Code Block
languagebash
titleRequest an interactive (idev) node
# Request a 180 minute interactive node on the normal queue using our reservation
idev -m 120 -N 1 -A OTH21164 -r CoreNGS
idev -m 120 -N 1 -A TRA23004 -r CoreNGS-Thu

# Request a 120 minute idev node on the development queue 
idev -m 120 -N 1 -A OTH21164 -p development
idev -m 120 -N 1 -A TRA23004 -p
development


Code Block
languagebash
titleSubmit a batch job
# Using our reservation
sbatch --reseservation=CoreNGS-Thu <batch_file>.slurm

Note that the reservation name (CoreNGS) is different from the TACC allocation/project for this class, which is OTH21164.

...

After raw sequence files are generated (in FASTQ format), quality-checked, and pre-processed preprocessed in some way, the next step in many NGS pipelines is mapping to a reference genome.

...

Code Block
languagebash
titleStart an idev session
idev -m 180 -N 1 -A OTH21164 -r CoreNGS-Thu

Then stage the sample datasets and references we will use.

...

Here are the four reference genomes we will be using today, with some information about them. These are not necessarily the most recent versions of these references (e.g. the newest human reference genome is hg38 and a the most recent miRBase version is v21. (See here for information about many more genomes.)

...

Regular expressions are so powerful that nearly every modern computer language includes a "regex" module of some sort. There are many online tutorials for regular expressions, and several slightly different "flavors" of them. But the most common is the Perl style (http://perldoc.perl.org/perlretut.html), which was one of the fist and still the most powerful (there's a reason Perl was used extensively when assembling the human genome). We're only going to use simple regular expressions here, but learning more about them will pay handsome dividends for you in the future.Here's how to execute grep to list contig names in a FASTA file.

First stage the FASTA files we'll need:

Code Block
languagebash
titlegrep to match contig names in a FASTA file
# Stage the FASTA files
cds
mkdir -p core_ngs/references/fasta
cd core_ngs/references/fasta
cp $CORENGS/references/fasta/*.fa .

Here's how to execute grep to list contig names in a FASTA file.

Code Block
languagebash
titlegrep to match contig names in a FASTA file
cd $SCRATCH/core_ngs/references/fasta
grep -P '^>' sacCer3.fa | more

...

  • The -P option tells grep to Perl-style regular expression patterns. 
    • This makes including special characters like Tab \t ), carriage return ( \r ) or linefeed ( \n ) much easier that the default POSIX paterns.
    • While it is not required here, it generally doesn't hurt to include this option.
  • '^>' is the regular expression describing the pattern we're looking for (described below)

  • sacCer3.fa is is the FASTA file to search. 
    • lines with text that match our pattern will be written to standard output
    • non matching lines will be omitted
  • We pipe to more just in case there are a lot of contig names.

...

Expand
titleSetup (if needed)


Code Block
languagebash
# Copy the FASTA files for building references
mkdir -p $SCRATCH/core_ngs/references/fasta
cp $CORENGS/references/fasta/*.fa $SCRATCH/core_ngs/references/fasta/


Exercise: How many lines does the sacCer3 reference have, and how many contigs are there in the sacCer3 reference?

Expand
titleHint


Code Block
languagebash
cd $SCRATCH/core_ngs/references/fasta
grep -P '^>' sacCer3.fa | wc -l

# and to count the lines:
wc -l sacCer3.fa

Or use grep's -c option that says "just count the line matches"

Code Block
languagebash
grep -P -c '^>' sacCer3.fa


...

Expand
titleAnswer

There are 17 contigs, out of 243,167 total lines.

Aligner overview

There are many aligners available, but we will concentrate on two of the most popular general-purpose ones: bwa and bowtie2. The table below outlines the available protocols for them.

...