The first order of business after receiving sequencing data should be to check your data quality. This often-overlooked step helps guide the manner in which you process the data, and can prevent many headaches.
FastQC is a tool that produces a quality analysis report on FASTQ files.
Useful links:
First and foremost, the FastQC "Summary" should generally be ignored. Its "grading scale" (green - good, yellow - warning, red - failed) incorporates assumptions for a particular kind of experiment, and is not applicable to most real-world data. Instead, look through the individual reports and evaluate them according to your experiment type.
The FastQC reports I find most useful are:
![]()
2. The Per Sequence Quality Score report, which can tell you if a subset of your reads just have poor quality scores. These reads can be completely filtered from analysis.
![]()
3. The Sequence Duplication Levels report, which helps you evaluate library enrichment / complexity. But note that different experiment types are expected to have vastly different duplication profiles.
![]()
![]()
DETOUR: What are PCR duplicates?
![]()

Note: For many of its reports, FastQC analyzes only the first 200,000 sequences in order to keep processing and memory requirements down.
FastQC is available on lonestar5 as a module.
Here's how to run FastQC on our sample data:
module load biocontainers module load fastqc fastqc data/Sample1_R1.fastq |
Exercise: FastQC results
What did FastQC create?
|
You can't run a web browser directly from your "dumb terminal" command line environment. The FastQC results have to be transferred to your computer to look at the html report.
Exercise: Should we trim this data?
Based on this FastQC output, should we trim this data?
The Per base sequence quality report shows that trimming the last 10 bases may be a good idea. |
Let's look at tools to do such manipulations to fastqc files, if we have to.