Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
mkdir -p $SCRATCH/my_rnaseq_course/day_4b/mirbase
cd $SCRATCH/my_rnaseq_course/day_4b/mirbase
cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.fastq.gz .
less human_mirnaseq.fastq.gz

 

...

hairpin_cDNA_hsa.fa .
less hairpin_cDNA_hsa.fa

If you run commands like wc -l or less on the reference FASTA, you will see how reduced the sequence space is.  Additionally, each "contig" is a feature, meaning that rather than count reads in a genome (using something like HTSeq), we could simply count reads per contig, which is substantially faster.  You can do this with ANY reference FASTA data, since most aligners can accept any FASTA as a reference index.  To build the index:

Code Block
module load perl
module load bowtie/2.2.0
bowtie2-build hairpin_cDNA_hsa.fa hairpin_cDNA_hsa.fa

That was fast!  For comparison, doing the same command with the hg19 reference FASTA would take several hours (and should never be done on a login node).  If you ls, you will see:

Code Block
hairpin_cDNA_hsa.fa
hairpin_cDNA_hsa.fa.1.bt2
hairpin_cDNA_hsa.fa.2.bt2
hairpin_cDNA_hsa.fa.3.bt2
hairpin_cDNA_hsa.fa.4.bt2
hairpin_cDNA_hsa.fa.rev.1.bt2
hairpin_cDNA_hsa.fa.rev.2.bt2

These files, together, constitute the bowtie2 reference index.

Data Staging

We will first copy them over from the BioITeam area on Corral, stage them in a directory in your scratch area, and look at them a little bit.  The commands to do that would look something like this:

Code Block
mkdir -p $SCRATCH/my_rnaseq_course/day_4b
cd $SCRATCH/my_rnaseq_course/day_4b
cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.fastq.gz .
less human_mirnaseq.fastq.gz

...

The third line has the name attached after the "+", which is an artifact of a storage method that we won't go into here. However, everything else is basically the same - read name, followed by sequence, strand, and quality scores.  However, note the string of A's towards the end.  This is because, as for many very short RNAs, our read extends past the actual RNA fragment.  In this case, the 'adapter' sequence is obvious - it is just a poly-A string.  However, what if it wasn't?  Indeed, working with publicly available small RNA data, you will often not know what the adapter is (it may not be obvious), or you might not even know if for data coming from your lab (if we're being honest).

Bowtie2 Local Alignment

So, we would like to use an alignment strategy that can intelligently ignore the parts that won't align to a reference (the 'adapter') and align correctly the parts that align well.  This is called a 'local' alignment, in contrast to a 'global' alignment, which would count the 'mismatches' in the adapter against the alignment score.  Fortunately, you have already used a local-alignment-capable aligner in this class.  Tophat2 runs on the Bowtie2 alignment engine, which (if used directly, i.e. not with Tophat2), can perform local alignment.

To run the alignment, we execute a command that is very similar to BWA or Tophat2, but with different syntax:

Code Block
bowtie2 --local -L 16 -x mirbase/hairpin_cDNA_hsa.fa -U human_mirnaseq.fastq.gz -S human_mirnaseq.

...

sam
Expand
titleWhat's going on?

Parameters are:

  • --local – local alignment mode
  • -L 16 – seed length 16
  • -x  mirbase/hairpin_cDNA_hsa.fa – prefix path of index files
  • -U human_mirnaseq.fastq.gz – FASTQ file for single-end (Unpaired) alignment
  • -S human_mirnaseq.sam – tells bowtie2 to report alignments in SAM format to the specified file

You can create and submit a cmds file like you have already learned how to do - this alignment would only take around 30 minutes or less.  However, if TACC is being very slow (which is likely),  than you can get what the results will look like by copying the result to your area with the following commands

Code Block
cd $SCRATCH/my_rnaseq_course/day_4b
cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.sam .

miRNA Profiling

Recall that, because each contig in our reference is a feature, we do not need a true annotation to quantify our miRNAs.  Instead, we just need to count how many reads aligned to each contig, and sort them accordingly.  Fortunately, samtools has a simple utility called idxstats that reports exactly this.  The following commands will produce this information by converting the SAM file to BAM, sorting and indexing it, then running idxstats:

Code Block
module load samtools/0.1.19
cd $SCRATCH/my_rnaseq_course/day_4b
samtools view -bS human_mirnaseq.sam > human_mirnaseq.bam
samtools sort human_mirnaseq.bam human_mirnaseq.sorted
samtools index human_mirnaseq.sorted.bam
samtools idxstats human_mirnaseq.sorted.bam

 

 

 

Exercise #3: Ribonucleoprotein Immunoprecipitation and Sequencing (RIP-seq)

...