Trimming low quality bases
There are a number of open source tools that can trim off 3' bases and produce a FASTQ file of the trimmed reads to use as input to the alignment program.
FASTX Toolkit
The FASTX-Toolkit provides a set of command line tools for manipulating fasta and fastq files. The available modules are described on their website. They include a fast fastx_trimmer utility for trimming fastq sequences (and quality score strings) before alignment.
...
- The -l 90 option says that base 90 should be the last base (i.e., trim down to 50 bases)
- the -Q 33 option specifies how base qualities on the 4th line of each fastq entry are encoded. The FASTX toolkit is an older program, written in the time when Illumina base qualities were encoded differently. These days Illumina base qualities follow the Sanger FASTQ standard (Phred score + 33 to make an ASCII character).
Exercise: fastx toolkit programs
What other fastx manipulation programs are part of the fastx toolkit?
Expand | |||||
---|---|---|---|---|---|
| |||||
Type fastx_ then tab to see their names
|
Exercise: What if you just want to get rid of reads that are too low in quality?
Code Block | ||
---|---|---|
| ||
fastq_quality_filter -q <N> -p <N> -i <inputfile> -o <outputfile> -q N: Minimum Base quality score -p N: Minimum percent of bases that must have [-q] quality |
Let's try it on our data- trim it to only include reads with atleast 80% of the read having a quality score of 30 or above.
Code Block | ||
---|---|---|
| ||
fastq_quality_filter -q 20 -p 80 -i data/Sample1_R1.fastq -Q 33 -o Sample1_R1.filtered.fastq |
Exercise: Compare the results of fastq_trimmer vs fastq_quality_filter
Code Block | ||
---|---|---|
| ||
grep '^@HWI' Sample1_R1.trimmed.fastq |wc -l grep '^@HWI' Sample1_R1.filtered.fastq |wc -l |
Adaptor Trimming
...
The GSAF website describes the flavaors of Illumina adapter and barcode sequence in more detail https://wikisutexas.utexasatlassian.edunet/wiki/display/GSAF/Illumina+-+all+flavors
FASTX Toolkit
One of the programs available as part of the fastx toolkit does a crude job of clipping adaptors out of sequences.
...
Code Block | ||
---|---|---|
| ||
fastx_clipper -a <adapter> -i <inputfile> -o <outputfile> -l <discardSeqsShorterThanN> |
Cutadapt
The cutadapt program is an excellent tool for removing adapter contamination. The program is not available through TACC's module system but we've installed a copy in our $BI/bin directory. Cutadapt has some advantages over fastx_clipper:
...
Expand | ||
---|---|---|
| ||
Please refer to https://wikisutexas.utexasatlassian.edunet/wiki/display/GSAF/Illumina+-+all+flavors for Illumina library adapter layout.
|
...