Page Comparison

...

Here we will assume you have data from GSAF's Illumina HiSeq or MiSeq sequencer.

Learning Objectives

...

The GSAF website describes the flavaors of Illumina adapter and barcode sequence in more detail https://wikisutexas.utexasatlassian.edunet/wiki/display/GSAF/Illumina+-+all+flavors

Cutadapt

The cutadapt program is an excellent tool for removing adapter contamination. The program is not available through TACC's module system but we've installed a copy in our $BI/bin directory.

...

Expand

title	The gory details on the -a adapter sequence argument

Please refer to https://wikisutexas.utexasatlassian.edunet/wiki/display/GSAF/Illumina+-+all+flavors for Illumina library adapter layout.

The top strand, 5' to 3', of a read sequence looks like this.

No Format

title	Illumina library read layout

<P5 capture> <indexRead2> <Read 1 primer> [insert] <Read 2 primer> <indexRead1> <P7 capture>

The -a argument to cutadapt is documented as the "sequence of adapter that was ligated to the 3' end". So we care about the <Read 2 primer> for R1 reads, and the <Read 1 primer> for R2 reads.

The "contaminent" for adapter trimming will be the <Read 2 primer> for R1 reads. There is only one Read 2 primer:

Code Block

title	Read 2 primer, 5' to 3', used as R1 sequence adapter

AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

The "contaminant" for adapter trimming will be the <Read 1 primer> for R2 reads. However, there are three different Read 1 primers, depending on library construction:

No Format

title	Read 1 primer depends on library construction

TCTACACGTTCAGAGTTCTACAGTCCGACGATCA    # small RNA sequencing primer site
CAGGTTCAGAGTTCTACAGTCCGACGATCA        # "other"
TCTACACTCTTTCCCTACACGACGCTCTTCCGATCT  # TruSeq Read 1 primer site. This is the RC of the R2 adapter

Since R2 reads are the reverse complement of R1 reads, the R2 adapter contaminent will be the RC of the Read 1 primer used.

For ChIP-seq libraries where reads come from both DNA strands, the TruSeq Read 1 primer is always used.
Since it is the RC of the Read 2 primer, its RC is just the Read 1 primer back
Therefore, for ChIP-seq libraries only one cutadapt command is needed:

Code Block

title	Cutadapt adapter sequence for ChIP-seq lib

raries, both R1 and R2 reads}
cutadapt -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

For RNAseq libraries, we use the small RNA sequencing primer as the Read 1 primer.
The contaminent is then the RC of this, minus the 1st and last bases:

No Format

title	Small RNA library Read 1 primer, 5' to 3', used as R2 sequence adapter

TCTACACGTTCAGAGTTCTACAGTCCGACGATCA    # R1 primer - small RNA sequencing Read 1 primer site, 5' to 3'
TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA    # R2 adapter contaminent (RC of R1 small RNA sequencing Read 1 primer)

...

Trimmomatic offers similar options to Flexbar with the potential benefit that many illumina adaptor sequences are already "built-in". It is available here.

More Example Data

See if you can figure out what's wrong with these data sets (copy them to your $SCRATCH directory before analyzing them) and then process them to get rid of the problem(s). If you're very ambitious, you could also map them to the reference genomes and perform variant calling before and after cleaning them up to see how the results change. Each file has a different problem.

Example #1: Single-end Illumina MiSeq data for E. coli

Code Block

language	bash
title	Example read and reference files #1

$BI/gva_course/read_processing/JJM104_TAAGGCGA-TAGATCGC_L001_R1_001.fastq.gz
$BI/gva_course/read_processing/REL606.fna

Expand

title	What's wrong with this data?

This

Example #2: Paired-end Illumina Genome Analyzer IIx data for E. coli

Code Block

language	bash
title	Example read and reference files #2

$BI/gva_course/read_processing/61FTVAAXX_2_R1_ZDB172.fastq.gz
$BI/gva_course/read_processing/61FTVAAXX_2_R2_ZDB172.fastq.gz
$BI/gva_course/read_processing/REL606.fna

Expand

title	What's wrong with this data?

There was some sort of problem during library prep that highly biased the beginning of reads to "T". Unfortunately, post-processing can't help with this one. The read sequences are fine, but the coverage across the genome is so uneven that many regions of the genome were not sampled (have zero coverage) even though the volume of sequencing data was very high for this microbial genome. The facility had to do a new library prep and re-sequence to correct this issue.

Versions Compared

Old Version 9

New Version Current

Key

Learning Objectives

Cutadapt

More Example Data