/
Evaluating and processing raw sequencing data GVA2021

Evaluating and processing raw sequencing data GVA2021

Overview

Before you start the alignment and analysis processes, it can be useful to perform some initial quality checks on your raw data. If you don't do this (or don't do this sufficiently), you may notice at the end of your analysis some things still are not clear: for example, maybe a large portion of reads do not map to your reference or maybe the reads map well, except the ends do not align at all. Both of these results can give you clues about how you need to process the reads to improve the quality of data that you are putting into your analysis.

A stitch in time saves nine

For many years this tutorial alternated between being included as an optional tutorial, a required tutorial, or if it should be ignored all together as the overall quality of data increases. Recently a colleague of mine spent several days working with and trying to understand some data he got back before reaching out for help, after I spent a few additional hours of running into a wall, FastQC was used. Less than 30 minutes later it was clear the library was not constructed correctly and could not be salvaged. I believe that makes this one of the most important tutorials available.

Learning Objectives

This tutorial covers the commands necessary to use several common programs for evaluating read files in FASTQ format and for processing them (if necessary).

  1. Use basic linux commands to determine read count numbers and pull out specific reads.
  2. Diagnose common issues in FASTQ read files that will negatively impact analysis.
  3. Trim adaptor sequences and low quality regions from the ends of reads to improve analysis.

FASTQ data format

A common question is 'after you submit something for sequencing what do you get back?' The answer is FASTQ files.

While there is some additional log files that you may be able to get off the instrument, the reality is none of those are actually 'data' of anything other than high level instrument performance. The good news is you don't actually need anything else. For single end sequencing you would have a single file, while paired end sequencing provides 2 files: 1 for read1 and another for read2. Each file contains a repeating 4-line entry for each individual read.

The first 4-line FASTQ read entry in the $BI/gva_course/mapping/data/SRR030257_1.fastq file
@SRR030257.1 HWI-EAS_4_PE-FC20GCB:6:1:385:567/1
TTACACTCCTGTTAATCCATACAGCAACAGTATTGG
+
AAA;A;AA?A?AAAAA?;?A?1A;;????566)=*1
  1. Line 1 is the read identifier, which describes the machine, flowcell, cluster, grid coordinate, end and barcode for the read. Except for the barcode information, read identifiers will be identical for corresponding entries in the R1 and R2 fastq files.
  2. Line 2 is the sequence reported by the machine.
  3. Line 3 is almost always just '+' . (occasionally the line will be the same as the first line except the intial @ symbol is changed to a +) 
  4. Line 4 is a string of Ascii-encoded base quality scores, one character per base in the sequence. For each base, an integer quality score = -10 log(probability base is wrong) is calculated, then added to 33 to make a number in the ASCII printable character range.

See the Wikipedia FASTQ format page for more information.


Determine 2nd sequence in a FASTQ file

What the 2nd sequence in the file $BI/gva_course/mapping/data/SRR030257_1.fastq is?