Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Stage the alignment data

First connect to stampede2ls6.tacc.utexas.edu and start an idev session. This should be second nature by now (smile)

Code Block
languagebash
titleStart an idev session
idev -p normal -m 180 -A UT-2015-05-18OTH21164 -N 1 -n 68 

Then stage the sample datasets and references we will use.

Code Block
languagebash
titleGet the alignment exercises files
mkdir -p $SCRATCH/core_ngs/references/fasta
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/references/*.fa     $SCRATCH/core_ngs/references/fasta/

mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/
cd $SCRATCH/core_ngs/alignment/fastq

...

Searching genomes is computationally hard work and takes a long time if done on linear genomic sequence. So aligners require that references first be indexed to accelerate lookup. The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done.

Building a reference index involves taking a FASTA file as input, with each contig (contiguous string of bases, e.g. a chromosome) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used to perform the sequence alignment, and alignments are reported using coordinates referencing names and offset positions based on the original FASTA file contig entries.

...

Code Block
languagebash
titleBWA hg19 index location
/work2work/projects/BioITeam/ref_genome/bwa/bwtsw/hg19

...

Tip

The BioITeam maintains a set of reference indexes for many common organisms and aligners. They can be found in aligner-specific sub-directories of the /work2work/projects/BioITeam/ref_genome area. E.g.:

Code Block
languagebash
/work2work/projects/BioITeam/ref_genome/
   bowtie2/
   bwa/
   hisat2/
   kallisto/
   star/
   tophat/


...

Regular expressions are so powerful that nearly every modern computer language includes a "regex" module of some sort. There are many online tutorials for regular expressions, and several slightly different "flavors" of them. But the most common is the Perl style (http://perldoc.perl.org/perlretut.html), which was one of the fist and still the most powerful (there's a reason Perl was used extensively when assembling the human genome). We're only going to use the most simple of regular expressions here, but learning more about them will pay handsome dividends for you in the future.

...

Code Block
languagebash
titlegrep to match contig names in a FASTA file
# If you haven't staged the fasta files
cds
mkdir -p core_ngs/references/fasta
cd core_ngs/references/fasta
cp $CORENGS/references/*.fa .

cd $SCRATCH/core_ngs/references/fasta
grep -P '^>' sacCer3.fa | more

...