Evaluating and processing raw sequencing data GVA2023
- 1 Overview
- 2 Learning Objectives
- 3 FASTQ data format
- 4 Determine 2nd sequence in a FASTQ file
- 5 Counting sequences
- 6 Evaluating FASTQ files with FastQC
- 7 FASTQ Processing Tools
- 8 A note on versions
- 9 Optional Exercise: Improve the quality of R2 the same way you did for R1.
Overview
Before you start the alignment and analysis processes, it can be useful to perform some initial quality checks on your raw data. If you don't do this (or don't do this sufficiently), you may notice at the end of your analysis some things still are not clear: for example, maybe a large portion of reads do not map to your reference or maybe the reads map well, except the ends do not align at all. Both of these results can give you clues about how you need to process the reads to improve the quality of data that you are putting into your analysis.
A stitch in time saves nine
For many years this tutorial alternated between being included as an optional tutorial, a required tutorial, or if it should be ignored all together as the overall quality of data increases. A few years ago a colleague of mine spent several days working with and trying to understand some data he got back before reaching out for help, after I spent a few additional hours of running into a wall, FastQC was used. Less than 30 minutes later it was clear the library was not constructed correctly and could not be salvaged. I believe that makes this one of the most important tutorials available.
Luckily, read pre-processing has also gotten easier and faster.
Learning Objectives
This tutorial covers the commands necessary to use several common programs for evaluating read files in FASTQ format and for processing them (if necessary).
Use basic linux commands to determine read count numbers and pull out specific reads.
Diagnose common issues in FASTQ read files that will negatively impact analysis.
Trim adaptor sequences and low quality regions from the ends of reads to improve analysis.
FASTQ data format
A common question is 'after you submit something for sequencing what do you get back?' The answer is FASTQ files.
While there is some additional log files that you may be able to get off the instrument, the reality is none of those are actually 'data' of anything other than high level instrument performance. The good news is you don't actually need anything else. For single end sequencing you would have a single file, while paired end sequencing provides 2 files: 1 for read1 and another for read2. Each file contains a repeating 4-line entry for each individual read.
The first 4-line FASTQ read entry in the $BI/gva_course/mapping/data/SRR030257_1.fastq file
@SRR030257.1 HWI-EAS_4_PE-FC20GCB:6:1:385:567/1
TTACACTCCTGTTAATCCATACAGCAACAGTATTGG
+
AAA;A;AA?A?AAAAA?;?A?1A;;????566)=*1Line 1 is the read identifier, which describes the machine, flowcell, cluster, grid coordinate, end and barcode for the read. Except for the barcode information, read identifiers will be identical for corresponding entries in the R1 and R2 fastq files.
Line 2 is the sequence reported by the machine.
Line 3 is almost always just '+' . (occasionally the line will be the same as the first line except the initial @ symbol is changed to a +)
Line 4 is a string of Ascii-encoded base quality scores, one character per base in the sequence. For each base, an integer quality score = -10 log(probability base is wrong) is calculated, then added to 33 to make a number in the ASCII printable character range.
See the Wikipedia FASTQ format page for more information.
How to think of paired end files.
Often I will hear people refer to read1 as the "forward" read and read2 as the "reverse" read. While technically true in that the reads are on opposite strands, thinking of them in this manner seems to correlate with misunderstandings downstream. Specifically thinking of the reads as forward and reverse leads to thinking all read1 should map 5` to 3` on the genome while read2 should map 3` to 5` as complementary sequence. Further, when it comes to evaluating read mapping quality or variant support (covered in later tutorials), thinking of them in this manner results in incorrectly applying confidence checks in the believability of variant calls.
Read1 and Read2 come from the same fragment of DNA, but that piece of DNA can be placed between adapter sequences in either orientation. Saving application of additional terminology until after reads are mapped may help you keep things clearer in your head.
Determine 2nd sequence in a FASTQ file
What the 2nd sequence in the file $BI/gva_course/mapping/data/SRR030257_1.fastq is?
More advanced solutions to do slightly different things:
The -n option can be used to control how many lines of a file are printed:
Show just the first 2 reads
head -n 8 $BI/gva_course/mapping/data/SRR030257_1.fastqThe output of the head command can be piped to the
tailcommand to isolate specific groups of lines:Show just the 2nd read
head -n 8 $BI/gva_course/mapping/data/SRR030257_1.fastq | tail -n 4The grep command can be used to look for lines that contain only ACTG or N:
head | tail | grep to isolate the 2nd read
head -n 8 $BI/gva_course/mapping/data/SRR030257_1.fastq | tail -n 4 | grep "^[ACTGN]*$"head to show the first 10 lines, grep to only print the 3 sequence lines
head $BI/gva_course/mapping/data/SRR030257_1.fastq | grep "^[ACTGN]*$"grep to only print the first 3 lines containing sequence
grep -m 2 "^[ACTGN]*$" $BI/gva_course/mapping/data/SRR030257_1.fastq^ by increasing the "-m" value we can now quickly get a block of sequence of any size of our choosing. This is the first truly useful command on this page. With a block of sequence, you can start to see things like:
if the first/last bases are always the same
if the reads are the same length
if a single sequence shows up a huge number of times
Counting sequences
Often, the first thing you (or your boss) want to know about your sequencing run is simply, "how many reads are there?". For the $BI/gva_course/mapping/data/SRR030257_1.fastq file, the answer is 3,800,180. How can we figure that out?
The grep (or Global Regular Expression Print) command can be used to determine the number of lines which match some criteria as shown above. Above we used it to search for:
anything from the group of ACTGN with the [] marking them as a group
matching any number of times *
from the beginning of the line ^
to the end of the line $
Here, since we are only interested in the number of reads that we have, we can make use of knowing the 3rd line in the fastq file is a + and a + only, and grep's -c option to simply report the number of reads in a file.