Disclaimer:
These are QC cutoffs that I (and some of the other bioinformaticians in our group) use with typical bulk RNA-Seq data. But, as always, please consider your dataset and it's properties before deciding if these are the best for you as well.
- Median quality per base: Most RNA-seq datasets generated past 2018 have the median base quality in the 30-40 range for the entire read length. This is what I usually expect to see. If I see a significant drop off in quality (<20) in the last x bases, I will do nothing for i. mapping+differental expression projects. ii. I will trim those bases off for assembly projects.
- Sequence duplication: It is typically to see anywhere from 10-50% duplication rates for standard bulk RNA-Seq data. Because we cannot tell if these are due to true PCR duplicates or due to highly expressed genes being covered with many reads, I do nothing about this. For tag-seq data, I or the core will run preprocessing scripts that will use the internal barcode to remove duplicate reads that are due to PCR amplification effects. Similarly, duplicates will be taken care of in single-cell RNA-Seq data using UMIs by preprocessing tools like cell ranger.
- Adaptor contamination: I do nothing if I see adaptor contamination less than 30% for mapping+differential expression projects because mappers will simply soft clip those regions. I will use Cutadapt or Trimmomatic to remove adaptor if I'm doing assembly projects.
- Mapping %: A good mapping percentage for RNA-Seq is typically 70% or above. This is provided you are mapping to a well assembled transcriptome/genome reference that is from the same species as your data.