...
Stage the alignment data
First connect to stampede2ls6.tacc.utexas.edu and start an idev session. This should be second nature by now
Code Block | ||||
---|---|---|---|---|
| ||||
idev -p normal -m 180 -A UT-2015-05-18OTH21164 -N 1 -n 68 |
Then stage the sample datasets and references we will use.
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/references/fasta mkdir -p $SCRATCH/core_ngs/alignment/fastq cp $CORENGS/references/*.fa $SCRATCH/core_ngs/references/fasta/ mkdir -p $SCRATCH/core_ngs/alignment/fastq cp $CORENGS/alignment/*fastq.gz $SCRATCH/core_ngs/alignment/fastq/ cd $SCRATCH/core_ngs/alignment/fastq |
...
Searching genomes is computationally hard work and takes a long time if done on linear genomic sequence. So aligners require that references first be indexed to accelerate lookup. The aligners we are using each require a different index, but use the same method (the Burrows-Wheeler Transform) to get the job done.
Building a reference index involves taking a FASTA file as input, with each contig (contiguous string of bases, e.g. a chromosome) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used to perform the sequence alignment, and alignments are reported using coordinates referencing names and offset positions based on the original FASTA file contig entries.
...
Code Block | ||||
---|---|---|---|---|
| ||||
/work2work/projects/BioITeam/ref_genome/bwa/bwtsw/hg19 |
...
Tip | |||||
---|---|---|---|---|---|
The BioITeam maintains a set of reference indexes for many common organisms and aligners. They can be found in aligner-specific sub-directories of the /work2work/projects/BioITeam/ref_genome area. E.g.:
|
...
Regular expressions are so powerful that nearly every modern computer language includes a "regex" module of some sort. There are many online tutorials for regular expressions, and several slightly different "flavors" of them. But the most common is the Perl style (http://perldoc.perl.org/perlretut.html), which was one of the fist and still the most powerful (there's a reason Perl was used extensively when assembling the human genome). We're only going to use the most simple of regular expressions here, but learning more about them will pay handsome dividends for you in the future.
...
Code Block | ||||
---|---|---|---|---|
| ||||
# If you haven't staged the fasta files
cds
mkdir -p core_ngs/references/fasta
cd core_ngs/references/fasta
cp $CORENGS/references/*.fa .
cd $SCRATCH/core_ngs/references/fasta
grep -P '^>' sacCer3.fa | more |
...