FASTQ Quality Assurance Tools
The first order of business after receiving sequencing data should be to check your data quality. This often-overlooked step helps guide the manner in which you process the data, and can prevent many headaches.
FastQC
FastQC is a tool that produces a quality analysis report on FASTQ files.
Useful links:
First and foremost, the FastQC "Summary" should generally be ignored. Its "grading scale" (green - good, yellow - warning, red - failed) incorporates assumptions for a particular kind of experiment, and is not applicable to most real-world data. Instead, look through the individual reports and evaluate them according to your experiment type.
The FastQC reports I find most useful are:
- The Per base sequence quality report, which can help you decide if sequence trimming is needed before alignment.
2. The Per Sequence Quality Score report, which can tell you if a subset of your reads just have poor quality scores. These reads can be completely filtered from analysis.
3. The Sequence Duplication Levels report, which helps you evaluate library enrichment / complexity. But note that different experiment types are expected to have vastly different duplication profiles.
DETOUR: What are PCR duplicates?
- The Overrepresented Sequences report, which helps look for dominant sequences.
5. Adapter content report, which tells you about adapter contamination.
Note: For many of its reports, FastQC analyzes only the first 200,000 sequences in order to keep processing and memory requirements down.
Running FastQC
FastQC is available on lonestar5 as a module.
Here's how to run FastQC on our sample data:
module load biocontainers module load fastqc fastqc data/Sample1_R1.fastq
Exercise: FastQC results
What did FastQC create?
Looking at FastQC output
You can't run a web browser directly from your "dumb terminal" command line environment. The FastQC results have to be transferred to your computer to look at the html report.
Exercise: Should we trim this data?
Based on this FastQC output, should we trim this data?
Let's look at tools to do such manipulations to fastqc files, if we have to.