/
Evaluating and processing raw sequencing data GVA2023

Evaluating and processing raw sequencing data GVA2023

Overview

Before you start the alignment and analysis processes, it can be useful to perform some initial quality checks on your raw data. If you don't do this (or don't do this sufficiently), you may notice at the end of your analysis some things still are not clear: for example, maybe a large portion of reads do not map to your reference or maybe the reads map well, except the ends do not align at all. Both of these results can give you clues about how you need to process the reads to improve the quality of data that you are putting into your analysis.

A stitch in time saves nine

For many years this tutorial alternated between being included as an optional tutorial, a required tutorial, or if it should be ignored all together as the overall quality of data increases. A few years ago a colleague of mine spent several days working with and trying to understand some data he got back before reaching out for help, after I spent a few additional hours of running into a wall, FastQC was used. Less than 30 minutes later it was clear the library was not constructed correctly and could not be salvaged. I believe that makes this one of the most important tutorials available.

Luckily, read pre-processing has also gotten easier and faster. 

Learning Objectives

This tutorial covers the commands necessary to use several common programs for evaluating read files in FASTQ format and for processing them (if necessary).

  1. Use basic linux commands to determine read count numbers and pull out specific reads.
  2. Diagnose common issues in FASTQ read files that will negatively impact analysis.
  3. Trim adaptor sequences and low quality regions from the ends of reads to improve analysis.

FASTQ data format

A common question is 'after you submit something for sequencing what do you get back?' The answer is FASTQ files.

While there is some additional log files that you may be able to get off the instrument, the reality is none of those are actually 'data' of anything other than high level instrument performance. The good news is you don't actually need anything else. For single end sequencing you would have a single file, while paired end sequencing provides 2 files: 1 for read1 and another for read2. Each file contains a repeating 4-line entry for each individual read.

The first 4-line FASTQ read entry in the $BI/gva_course/mapping/data/SRR030257_1.fastq file