The first order of business after receiving sequencing data should be to check your data quality. This often-overlooked step helps guide the manner in which you process the data, and can prevent many headaches.

FastQC

FastQC is a tool that produces a quality analysis report on FASTQ files.

Useful links:

First and foremost, the FastQC "Summary" should generally be ignored. Its "grading scale" (green - good, yellow - warning, red - failed) incorporates assumptions for a particular kind of experiment, and is not applicable to most real-world data. Instead, look through the individual reports and evaluate them according to your experiment type.

The FastQC reports I find most useful are:

  1. The Per base sequence quality report, which can help you decide if sequence trimming is needed before alignment.

2. The Per Sequence Quality Score report, which can tell you if a subset of your reads just have poor quality scores. These reads can be completely filtered from analysis.

3. The Sequence Duplication Levels report, which helps you evaluate library enrichment / complexity. But note that different experiment types are expected to have vastly different duplication profiles.

 


DETOUR: What are PCR duplicates?

  1. The Overrepresented Sequences report, which helps look for dominant sequences.

    5. Adapter content report, which tells you about adapter contamination.



Note: For many of its reports, FastQC analyzes only the first 200,000 sequences in order to keep processing and memory requirements down.

Running FastQC

FastQC is available on lonestar5 as a module.

Here's how to run FastQC on our sample data:

module load biocontainers
module load fastqc
fastqc data/Sample1_R1.fastq

Exercise: FastQC results

What did FastQC create?


drwxrwx--- 4 daras G-801020  32768 May 16 14:03 Sample1_R1_fastqc
-rw-rw---- 1 daras G-801020 186116 May 16 13:58 Sample1_R1_fastqc.zip


Looking at FastQC output

You can't run a web browser directly from your "dumb terminal" command line environment. The FastQC results have to be transferred to your computer to look at the html report.


Exercise: Should we trim this data?

Based on this FastQC output, should we trim this data?

The Per base sequence quality report shows that trimming the last 10 bases may be a good idea.

Let's look at tools to do such manipulations to fastqc files, if we have to.