Evaluating and processing raw sequencing data GVA2021

Evaluating and processing raw sequencing data GVA2021

Overview

Before you start the alignment and analysis processes, it can be useful to perform some initial quality checks on your raw data. If you don't do this (or don't do this sufficiently), you may notice at the end of your analysis some things still are not clear: for example, maybe a large portion of reads do not map to your reference or maybe the reads map well, except the ends do not align at all. Both of these results can give you clues about how you need to process the reads to improve the quality of data that you are putting into your analysis.

A stitch in time saves nine

For many years this tutorial alternated between being included as an optional tutorial, a required tutorial, or if it should be ignored all together as the overall quality of data increases. Recently a colleague of mine spent several days working with and trying to understand some data he got back before reaching out for help, after I spent a few additional hours of running into a wall, FastQC was used. Less than 30 minutes later it was clear the library was not constructed correctly and could not be salvaged. I believe that makes this one of the most important tutorials available.

Learning Objectives

This tutorial covers the commands necessary to use several common programs for evaluating read files in FASTQ format and for processing them (if necessary).

  1. Use basic linux commands to determine read count numbers and pull out specific reads.

  2. Diagnose common issues in FASTQ read files that will negatively impact analysis.

  3. Trim adaptor sequences and low quality regions from the ends of reads to improve analysis.

FASTQ data format

A common question is 'after you submit something for sequencing what do you get back?' The answer is FASTQ files.

While there is some additional log files that you may be able to get off the instrument, the reality is none of those are actually 'data' of anything other than high level instrument performance. The good news is you don't actually need anything else. For single end sequencing you would have a single file, while paired end sequencing provides 2 files: 1 for read1 and another for read2. Each file contains a repeating 4-line entry for each individual read.

The first 4-line FASTQ read entry in the $BI/gva_course/mapping/data/SRR030257_1.fastq file
@SRR030257.1 HWI-EAS_4_PE-FC20GCB:6:1:385:567/1 TTACACTCCTGTTAATCCATACAGCAACAGTATTGG + AAA;A;AA?A?AAAAA?;?A?1A;;????566)=*1
  1. Line 1 is the read identifier, which describes the machine, flowcell, cluster, grid coordinate, end and barcode for the read. Except for the barcode information, read identifiers will be identical for corresponding entries in the R1 and R2 fastq files.

  2. Line 2 is the sequence reported by the machine.

  3. Line 3 is almost always just '+' . (occasionally the line will be the same as the first line except the intial @ symbol is changed to a +) 

  4. Line 4 is a string of Ascii-encoded base quality scores, one character per base in the sequence. For each base, an integer quality score = -10 log(probability base is wrong) is calculated, then added to 33 to make a number in the ASCII printable character range.

See the Wikipedia FASTQ format page for more information.

 

Determine 2nd sequence in a FASTQ file

What the 2nd sequence in the file $BI/gva_course/mapping/data/SRR030257_1.fastq is?

head $BI/gva_course/mapping/data/SRR030257_1.fastq

Line type

Value

Line type

Value

Read identifier

@SRR030257.2 HWI-EAS_4_PE-FC20GCB:6:1:407:767/1

Read Sequence

TAAGCCAGTCGCCATGGAATATCTGCTTTATTTAGC

Line 3

+

Quality score

AAAAAAAAAA;A?AA;A?AAAAAA8????+9=&=1;

If thats what you thought it was congratulations, if it is different, do you see where we got it from? If it doesn't make sense ask for help.

More advanced solutions to do slightly different things:

  1. The -n option can be used to control how many lines of a file are printed:

    Show just the first 2 reads

    head -n 8 $BI/gva_course/mapping/data/SRR030257_1.fastq
  2. The output of the head command can be piped to the tail command to isolate specific groups of lines:

    Show just the 2nd read

    head -n 8 $BI/gva_course/mapping/data/SRR030257_1.fastq | tail -n 4
  3. The grep command can be used to look for lines that contain only ACTG or N:

    head | tail | grep to isolate the 2nd read

    head -n 8 $BI/gva_course/mapping/data/SRR030257_1.fastq | tail -n 4 | grep "^[ACTGN]*$"

    head to show the first 10 lines, grep to only print the 3 sequence lines

    head $BI/gva_course/mapping/data/SRR030257_1.fastq | grep "^[ACTGN]*$"

    grep to only print the first 3 lines containing sequence

    grep -m 2 "^[ACTGN]*$" $BI/gva_course/mapping/data/SRR030257_1.fastq

    ^ by increasing the "-m" value we can now quickly get a block of sequence of any size of our choosing. This is the first truly useful command on this page. With a block of sequence, you can start to see things like:

    1. if the first/last bases are always the same

    2. if the reads are the same length

    3. if a single sequence shows up a huge number of times

Counting sequences

Often, the first thing you (or your boss) want to know about your sequencing run is simply, "how many reads are there?". For the $BI/gva_course/mapping/data/SRR030257_1.fastq file, the answer is 3,800,180. How can we figure that out?

The grep (or Global Regular Expression Print) command can be used to determine the number of lines which match some criteria as shown above. Above we used it to search for:

  1. anything from the group of ACTGN with the [] marking them as a group

  2. matching any number of times *

  3. from the beginning of the line ^

  4. to the end of the line $

Here, since we are only interested in the number of reads that we have, we can make use of knowing the 3rd line in the fastq file is a + and a + only, and grep's -c option to simply report the number of reads in a file.

grep -c "^+$" $BI/gva_course/mapping/data/SRR030257_1.fastq
No you don't!
grep -c "+" $BI/gva_course/mapping/data/SRR030257_1.fastq

Gives a result of 4,770,069 because the + sign is found in some of the quality score lines.

Remember computers always answer exactly what you ask, the trick is asking the right question

Without the anchors you asked the computer, "how many lines have a + symbol on them". With the anchors you asked "how many lines start with a + symbol and have no other characters on the line. Remember, this only works when we know for certain that line 3 is a "+" symbol by itself. This is where head/tail can be useful.

We can also check using similar methods (and give another example of different analysis giving us the same result):

grep to look specifically for sequences directly. This will work regardless of what line 3 is
grep -c "^[ACTGN]*$" $BI/gva_course/mapping/data/SRR030257_1.fastq
The wc command (word count) using the -l switch to tell it to count lines
wc -l $BI/gva_course/mapping/data/SRR030257_1.fastq