Dhivya's suggestions for QC cutoffs

Disclaimer:

These are QC cutoffs that I (and some of the other bioinformaticians in our group) use with typical bulk RNA-Seq data. But, as always, please consider your dataset and it's properties before deciding if these are the best for you as well.


  1. Median quality per base:  Most RNA-seq datasets generated past 2018 have the median base quality in the 30-40 range for the entire read length. This is what I usually expect to see. If I see a significant drop off in quality (<20) in the last x bases, I will do nothing for i. mapping+differental expression projects. ii. I will trim those bases off for assembly projects.
  2. Sequence duplication:  It is typical to see anywhere from 10-50% duplication rates for standard bulk RNA-Seq data. Because we cannot tell if these are due to true PCR duplicates or due to highly expressed genes being covered with many reads, I do nothing about this. For tag-seq data, I or the core will run preprocessing scripts that will use the internal barcode to remove duplicate reads that are due to PCR amplification effects. Similarly, duplicates will be taken care of in single-cell RNA-Seq data using UMIs by preprocessing tools like cell ranger.
  3. Adaptor contamination: I do nothing if I see adaptor contamination less than 30% for mapping+differential expression projects because mappers will simply soft clip those regions. I will use Cutadapt or Trimmomatic to remove adaptor if I'm doing assembly projects. 


What's a good coverage for bulk RNA-seq data?

  1. encode rnaseq depth recommendations
  2. Typically, I find 10x coverage of the transcriptome is enough when doing bulk RNA-seq for differential expression analysis. Spend your money on sequencing as many replicates was possible.
  3. For assembling a transcriptome or for identifying rare transcripts/identifying novel transcripts, 30-40x coverage in bulk RNA-seq would be good.


Mapping suggestions:

  1. For mapping bulk standard RNA-seq to the transcriptome, use Kallisto. It's fast and accurate. BWA is also another good option, but not as fast.
  2. For mapping tag-seq data or single cell RNA-Seq data to the genome, use STAR.
  3. If interested in identifying novel transcripts, map to the genome using a spliced aligner like STAR or Hisat2.
  4. Mapping %:  A good mapping percentage for RNA-Seq is typically 70% or above. This is provided you are mapping to a well assembled transcriptome/genome reference that is from the same species as your data.


Back to COURSE OUTLINE