Page Comparison

...

Stage the alignment data

First connect to stampede2ls6.tacc.utexas.edu and start an idev session. This should be second nature by now

Code Block

language	bash
title	Start an idev session

idev -p normal -m 180 -A UT-2015-05-18OTH21164 -N 1 -n 68

Then stage the sample datasets and references we will use.

Code Block

language	bash
title	Get the alignment exercises files

mkdir -p $SCRATCH/core_ngs/references/fasta
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/references/*.fa     $SCRATCH/core_ngs/references/fasta/

mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/
cd $SCRATCH/core_ngs/alignment/fastq

...

Searching genomes is computationally hard work and takes a long time if done on linear genomic sequence. So aligners require that references first be indexed to accelerate lookup. The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done.

Building a reference index involves taking a FASTA file as input, with each contig (contiguous string of bases, e.g. a chromosome) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used to perform the sequence alignment, and alignments are reported using coordinates referencing names and offset positions based on the original FASTA file contig entries.

...

Code Block

language	bash
title	BWA hg19 index location

/work2work/projects/BioITeam/ref_genome/bwa/bwtsw/hg19

...

Tip

The BioITeam maintains a set of reference indexes for many common organisms and aligners. They can be found in aligner-specific sub-directories of the /work2work/projects/BioITeam/ref_genome area. E.g.:

Code Block

language	bash

/work2work/projects/BioITeam/ref_genome/
   bowtie2/
   bwa/
   hisat2/
   kallisto/
   star/
   tophat/

...

Regular expressions are so powerful that nearly every modern computer language includes a "regex" module of some sort. There are many online tutorials for regular expressions, and several slightly different "flavors" of them. But the most common is the Perl style (http://perldoc.perl.org/perlretut.html), which was one of the fist and still the most powerful (there's a reason Perl was used extensively when assembling the human genome). We're only going to use the most simple of regular expressions here, but learning more about them will pay handsome dividends for you in the future.

...

Code Block

language	bash
title	grep to match contig names in a FASTA file

# If you haven't staged the fasta files
cds
mkdir -p core_ngs/references/fasta
cd core_ngs/references/fasta
cp $CORENGS/references/*.fa .

cd $SCRATCH/core_ngs/references/fasta
grep -P '^>' sacCer3.fa | more

...

Versions Compared

Old Version 186

New Version 187

Key

Stage the alignment data