Page Comparison

...

Warning

When following along here, please start an idev session for running any example commands:

Code Block
idev -m 60 -q development

Table of Contents

Illumina sequence data format (FASTQ)

...

Code Block

title	A four-line FASTQ file entry representing one sequence


@HWI-ST1097:104:D13TNACXX:4:1101:1715:2142 1:N:0:CGATGT
GCGTTGGTGGCATAGTGGTGAGCATAGCTGCCTTCCAAGCAGTTATGGGAG
+
=<@BDDD=A;+2C9F<CB?;CGGA<<ACEE*1?C:D>DE=FC*0BAG?DB6

...

Expand

title	Answer

Code Block
head /corral-repl/utexas/BioITeam/ngs_course/intro_to_mapping/data/SRR030257_1.fastq

Executing the command above reports that the 2nd sequence has ID = @SRR030257.2 HWI-EAS_4_PE-FC20GCB:6:1:407:767/1, and the sequence TAAGCCAGTCGCCATGGAATATCTGCTTTATTTAGC

...

Code Block

title	Using wc -l to count lines


wc -l $BI/ngs_course/intro_to_mapping/data/SRR030257_1.fastq

...

Code Block

title	Using wc -l on a compressed file


gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | wc -l

...

Expand

title	How do I do math on the command line?

The bash shell has a really strange syntax for arithmetic: it uses a double-parenthesis operator. Go figure.

Code Block

title	Arithmetic in Bash


echo $((2368720 / 4))

FASTQ Quality Assurance tools

...

Code Block

title	Running FastQC example


# setup
cds
mkdir fastqc_test
cd fastqc_test
cp $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz .

# running the program
$BI/bin/FastQC/fastqc Sample_Yeast_L005_R1.cat.fastq.gz

...

Expand

title	Answer

No Format

title	ls -l shows something like this


drwxrwxr-x 4 abattenh G-803889     4096 May 20 22:59 Sample_Yeast_L005_R1.cat_fastqc
-rw-rw-r-- 1 abattenh G-803889   198239 May 20 22:59 Sample_Yeast_L005_R1.cat_fastqc.zip
-rwxr-xr-x 1 abattenh G-803889 51065629 May 20 22:59 Sample_Yeast_L005_R1.cat.fastq.gz

The Sample_Yeast_L005_R1.cat.fastq.gz file is what we analyzed, so FastQC created the other two items. Sample_Yeast_L005_R1.cat_fastqc is a directory (the "d" in "drwxrwxr-x"), so use ls Sample_Yeast_L005_R1.cat_fastqc to see what's in it. Sample_Yeast_L005_R1.cat_fastqc.zip is just a Zipped (compressed) version of the whole directory.

...

Code Block

title	FastQC results URL


http://loving.corral.tacc.utexas.edu/bioiteam/yeast_stuff/Sample_Yeast_L005_R1.cat_fastqc/fastqc_report.html

...

Code Block

title	Running samstat on FASTQ example


# setup
cds
mkdir samstat_test
cd samstat_test
cp $BI/ngs_course/intro_to_mapping/data/SRR030257_1.fastq .

# run the program
$BI/bin/samstat SRR030257_1.fastq

...

Code Block

title	URL for viewing samstat results


http://loving.corral.tacc.utexas.edu/bioiteam/SRR030257_1.fastq.html

...

Code Block

title	Running samstat on BAM example


# setup
cds
mkdir samstat_test2
cd samstat_test2
cp $BI/web/yeast_stuff/yeast_chip_sort.bam .

# run the program
$BI/bin/samstat yeast_chip_sort.bam

...

Code Block

title	URL for viewing samstat results


http://loving.corral.tacc.utexas.edu/bioiteam/yeast_stuff/yeast_chip_sort.bam.html

...

Code Block

title	FASTX_toolkit module description


module spider fastx_toolkit
module load fastx_toolkit

...

No Format

title	fastx_trimmer example


gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 > trimmed.fq

...

Expand

title	Answer

You could supply the -z option like this:

No Format

title	fastx_trimmer example


gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 -z > trimmed.fq.gz

Or you could gzip the output yourself:

No Format

title	fastx_trimmer example


gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 | gzip > trimmed.fq.gz

...

Expand

title	Hint

Type fastx_ then tab to see their names
See all the programs like this:

Code Block

title	fastx toolkit programs


ls $TACC_FASTX_BIN

Adapter trimming

...

The GSAF website describes the flavaors of Illumina adapter and barcode sequence in more detail https://wikisutexas.utexasatlassian.edunet/wiki/display/GSAF/Illumina+-+all+flavors

Cutadapt

The cutadapt program is an excellent tool for removing adapter contamination. The program is not available through TACC's module system but we've installed a copy in our $BI/bin directory.

...

Code Block

title	cutadapt command for R1 sequences


cutadapt -m 22 -O 10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

Code Block

title	cutadapt command for R2 sequences


cutadapt -m 22 -O 10 -a TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA

...

Expand

title	The gory details on the -a adapter sequence argument

Please refer to https://wikisutexas.utexasatlassian.edunet/wiki/display/GSAF/Illumina+-+all+flavors for Illumina library adapter layout.

The top strand, 5' to 3', of a read sequence looks like this.

No Format

title	Illumina library read layout


<P5 capture> <indexRead2> <Read 1 primer> [insert] <Read 2 primer> <indexRead1> <P7 capture>

The -a argument to cutadapt is documented as the "sequence of adapter that was ligated to the 3' end". So we care about the <Read 2 primer> for R1 reads, and the <Read 1 primer> for R2 reads.

The "contaminent" for adapter trimming will be the <Read 2 primer> for R1 reads. There is only one Read 2 primer:

Code Block

title	Read 2 primer, 5' to 3', used as R1 sequence adapter


AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

The "contaminent" for adapter trimming will be the <Read 1 primer> for R2 reads. However, there are three different Read 1 primers, depending on library construction:

No Format

title	Read 1 primer depends on library construction


TCTACACGTTCAGAGTTCTACAGTCCGACGATCA    # small RNA sequencing primer site
CAGGTTCAGAGTTCTACAGTCCGACGATCA        # "other"
TCTACACTCTTTCCCTACACGACGCTCTTCCGATCT  # TruSeq Read 1 primer site. This is the RC of the R2 adapter

Since R2 reads are the reverse complement of R1 reads, the R2 adapter contaminent will be the RC of the Read 1 primer used.

For ChIP-seq libraries where reads come from both DNA strands, the TruSeq Read 1 primer is always used.
Since it is the RC of the Read 2 primer, its RC is just the Read 1 primer back
Therefore, for ChIP-seq libraries only one cutadapt command is needed:

Code Block

title	Cutadapt adapter sequence for ChIP-seq lib

raries, both R1 and R2 reads}
cutadapt -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

For RNAseq libraries, we use the small RNA sequencing primer as the Read 1 primer.
The contaminent is then the RC of this, minus the 1st and last bases:

No Format

title	Small RNA library Read 1 primer, 5' to 3', used as R2 sequence adapter


TCTACACGTTCAGAGTTCTACAGTCCGACGATCA    # R1 primer - small RNA sequencing Read 1 primer site, 5' to 3'
TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA    # R2 adapter contaminent (RC of R1 small RNA sequencing Read 1 primer)

...

No Format

title	Example of trimming adaptor sequence from right end of reads


flexbar -n 1 --adapters adaptors.fna --source example.fastq --target example.ar --format fastq-sanger --adapter-threshold 2 --adapter-min-overlap 6 --adapter-trim-end RIGHT_TAIL

Code Block

title	Example adaptors.fna file


>adaptor1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>adaptor2
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
>adaptor1_RC
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
>adaptor2_RC
CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

Note that flexbar only searches for the exact sequences given (with options to allow for a given number of mismatches) not the reverse complement of those sequences therefore you must provide them yourself.

Trimmomatic

Trimmomatic offers similar options with the potential benefit that many illumina adaptor sequences are already "built-in". It is available here.

Versions Compared

Old Version 41

New Version Current

Key

Illumina sequence data format (FASTQ)

FASTQ Quality Assurance tools

Adapter trimming

Cutadapt