...
| Code Block |
|---|
mkdir -p $SCRATCH/my_rnaseq_course/day_4b/mirbase cd $SCRATCH/my_rnaseq_course/day_4b/mirbase cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.fastq.gz . less human_mirnaseq.fastq.gz |
...
hairpin_cDNA_hsa.fa .
less hairpin_cDNA_hsa.fa |
If you run commands like wc -l or less on the reference FASTA, you will see how reduced the sequence space is. Additionally, each "contig" is a feature, meaning that rather than count reads in a genome (using something like HTSeq), we could simply count reads per contig, which is substantially faster. You can do this with ANY reference FASTA data, since most aligners can accept any FASTA as a reference index. To build the index:
| Code Block |
|---|
module load perl
module load bowtie/2.2.0
bowtie2-build hairpin_cDNA_hsa.fa hairpin_cDNA_hsa.fa |
That was fast! For comparison, doing the same command with the hg19 reference FASTA would take several hours (and should never be done on a login node). If you ls, you will see:
| Code Block |
|---|
hairpin_cDNA_hsa.fa
hairpin_cDNA_hsa.fa.1.bt2
hairpin_cDNA_hsa.fa.2.bt2
hairpin_cDNA_hsa.fa.3.bt2
hairpin_cDNA_hsa.fa.4.bt2
hairpin_cDNA_hsa.fa.rev.1.bt2
hairpin_cDNA_hsa.fa.rev.2.bt2 |
These files, together, constitute the bowtie2 reference index.
Data Staging
We will first copy them over from the BioITeam area on Corral, stage them in a directory in your scratch area, and look at them a little bit. The commands to do that would look something like this:
| Code Block |
|---|
mkdir -p $SCRATCH/my_rnaseq_course/day_4b
cd $SCRATCH/my_rnaseq_course/day_4b
cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.fastq.gz .
less human_mirnaseq.fastq.gz |
...
The third line has the name attached after the "+", which is an artifact of a storage method that we won't go into here. However, everything else is basically the same - read name, followed by sequence, strand, and quality scores. However, note the string of A's towards the end. This is because, as for many very short RNAs, our read extends past the actual RNA fragment. In this case, the 'adapter' sequence is obvious - it is just a poly-A string. However, what if it wasn't? Indeed, working with publicly available small RNA data, you will often not know what the adapter is (it may not be obvious), or you might not even know if for data coming from your lab (if we're being honest).
Bowtie2 Local Alignment
So, we would like to use an alignment strategy that can intelligently ignore the parts that won't align to a reference (the 'adapter') and align correctly the parts that align well. This is called a 'local' alignment, in contrast to a 'global' alignment, which would count the 'mismatches' in the adapter against the alignment score. Fortunately, you have already used a local-alignment-capable aligner in this class. Tophat2 runs on the Bowtie2 alignment engine, which (if used directly, i.e. not with Tophat2), can perform local alignment.
To run the alignment, we execute a command that is very similar to BWA or Tophat2, but with different syntax:
| Code Block |
|---|
bowtie2 --local -L 16 -x mirbase/hairpin_cDNA_hsa.fa -U human_mirnaseq.fastq.gz -S human_mirnaseq. |
...
sam |
| Expand | ||
|---|---|---|
| ||
Parameters are:
|
You can create and submit a cmds file like you have already learned how to do - this alignment would only take around 30 minutes or less. However, if TACC is being very slow (which is likely), than you can get what the results will look like by copying the result to your area with the following commands
| Code Block |
|---|
cd $SCRATCH/my_rnaseq_course/day_4b
cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.sam . |
miRNA Profiling
Recall that, because each contig in our reference is a feature, we do not need a true annotation to quantify our miRNAs. Instead, we just need to count how many reads aligned to each contig, and sort them accordingly. Fortunately, samtools has a simple utility called idxstats that reports exactly this. The following commands will produce this information by converting the SAM file to BAM, sorting and indexing it, then running idxstats:
| Code Block |
|---|
module load samtools/0.1.19
cd $SCRATCH/my_rnaseq_course/day_4b
samtools view -bS human_mirnaseq.sam > human_mirnaseq.bam
samtools sort human_mirnaseq.bam human_mirnaseq.sorted
samtools index human_mirnaseq.sorted.bam
samtools idxstats human_mirnaseq.sorted.bam |
Exercise #3: Ribonucleoprotein Immunoprecipitation and Sequencing (RIP-seq)
...