The first order of business after receiving sequencing data should be to check your data quality. This often-overlooked step helps guide the manner in which you process the data, and can prevent many headaches.
FastQC
FastQC is a tool that produces a quality analysis report on FASTQ files.
Useful links:
- FastQC report for a good Illumina dataset
- FastQC report for a bad Illumina dataset
- Online documentation for each FastQC report
First and foremost, the FastQC "Summary" should generally be ignored. Its "grading scale" (green - good, yellow - warning, red - failed) incorporates assumptions for a particular kind of experiment, and is not applicable to most real-world data. Instead, look through the individual reports and evaluate them according to your experiment type.
The FastQC reports I find most useful are:
- The Per base sequence quality report, which can help you decide if sequence trimming is needed before alignment.
- The Sequence Duplication Levels report, which helps you evaluate library enrichment / complexity. But note that different experiment types are expected to have vastly different duplication profiles.
- The Overrepresented Sequences report, which helps evaluate adapter contamination.
| Expand | ||
|---|---|---|
| ||
|
Running FastQC
FastQC is not currently available from the TACC module system, but the command-line version has been installed in the $BI/bin/FastQC directory (downloaded from the Babraham Bioinformatics web site; interactive GUI versions are also available for Windows and Macintosh).
FastQC creates a sub-directory for each analyzed FASTQ file, so we should copy the file we want to look at locally first. Here's how to run FastQC using the version we installed:
| Code Block | ||
|---|---|---|
| ||
module load fastqc
fastqc data/SRR030257_1.fastq |
Exercise: FastQC results
What did FastQC create?
| Expand | |||||
|---|---|---|---|---|---|
| |||||
The Sample_Yeast_L005_R1.cat.fastq.gz file is what we analyzed, so FastQC created the other two items. Sample_Yeast_L005_R1.cat_fastqc is a directory (the "d" in "drwxrwxr-x"), so use ls Sample_Yeast_L005_R1.cat_fastqc to see what's in it. Sample_Yeast_L005_R1.cat_fastqc.zip is just a Zipped (compressed) version of the whole directory. |
Looking at FastQC output
You can't run a web browser directly from your "dumb terminal" command line environment. The FastQC results have to be placed where a web browser can access them. We put a copy at this URL:
| Code Block | ||
|---|---|---|
| ||
http://web.corral.tacc.utexas.edu/BioITeam/
|
Exercise: Should we trim this data?
Based on this FastQC output, should we trim this data?
| Expand | ||
|---|---|---|
| ||
The Per base sequence quality report does not look good. The data should probably be trimmed (to 40 or 50 bp) before alignment. |
Let's look at tools to do such manipulations to fastqc files, if we have to.