...
Warning |
---|
When following along here, please start an idev session for running any example commands: Code Block |
---|
idev -m 60 -q development
|
|
Illumina sequence data format (FASTQ)
...
Code Block |
---|
title | A four-line FASTQ file entry representing one sequence |
---|
|
@HWI-ST1097:104:D13TNACXX:4:1101:1715:2142 1:N:0:CGATGT
GCGTTGGTGGCATAGTGGTGAGCATAGCTGCCTTCCAAGCAGTTATGGGAG
+
=<@BDDD=A;+2C9F<CB?;CGGA<<ACEE*1?C:D>DE=FC*0BAG?DB6
|
...
Expand |
---|
|
Code Block |
---|
head /corral-repl/utexas/BioITeam/ngs_course/intro_to_mapping/data/SRR030257_1.fastq
|
Executing the command above reports that the 2nd sequence has ID = @SRR030257.2 HWI-EAS_4_PE-FC20GCB:6:1:407:767/1, and the sequence TAAGCCAGTCGCCATGGAATATCTGCTTTATTTAGC |
...
Code Block |
---|
title | Using wc -l to count lines |
---|
|
wc -l $BI/ngs_course/intro_to_mapping/data/SRR030257_1.fastq
|
...
Code Block |
---|
title | Using wc -l on a compressed file |
---|
|
gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | wc -l
|
...
Expand |
---|
title | How do I do math on the command line? |
---|
|
The bash shell has a really strange syntax for arithmetic: it uses a double-parenthesis operator. Go figure. Code Block |
---|
|
echo $((2368720 / 4))
|
|
FASTQ Quality Assurance tools
...
Code Block |
---|
title | Running FastQC example |
---|
|
# setup
cds
mkdir fastqc_test
cd fastqc_test
cp $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz .
# running the program
$BI/bin/FastQC/fastqc Sample_Yeast_L005_R1.cat.fastq.gz
|
...
Expand |
---|
|
No Format |
---|
title | ls -l shows something like this |
---|
|
drwxrwxr-x 4 abattenh G-803889 4096 May 20 22:59 Sample_Yeast_L005_R1.cat_fastqc
-rw-rw-r-- 1 abattenh G-803889 198239 May 20 22:59 Sample_Yeast_L005_R1.cat_fastqc.zip
-rwxr-xr-x 1 abattenh G-803889 51065629 May 20 22:59 Sample_Yeast_L005_R1.cat.fastq.gz
|
The Sample_Yeast_L005_R1.cat.fastq.gz file is what we analyzed, so FastQC created the other two items. Sample_Yeast_L005_R1.cat_fastqc is a directory (the "d" in "drwxrwxr-x"), so use ls Sample_Yeast_L005_R1.cat_fastqc to see what's in it. Sample_Yeast_L005_R1.cat_fastqc.zip is just a Zipped (compressed) version of the whole directory. |
...
Code Block |
---|
|
http://loving.corral.tacc.utexas.edu/bioiteam/yeast_stuff/Sample_Yeast_L005_R1.cat_fastqc/fastqc_report.html
|
...
Code Block |
---|
title | Running samstat on FASTQ example |
---|
|
# setup
cds
mkdir samstat_test
cd samstat_test
cp $BI/ngs_course/intro_to_mapping/data/SRR030257_1.fastq .
# run the program
$BI/bin/samstat SRR030257_1.fastq
|
...
Code Block |
---|
title | URL for viewing samstat results |
---|
|
http://loving.corral.tacc.utexas.edu/bioiteam/SRR030257_1.fastq.html
|
...
Code Block |
---|
title | Running samstat on BAM example |
---|
|
# setup
cds
mkdir samstat_test2
cd samstat_test2
cp $BI/web/yeast_stuff/yeast_chip_sort.bam .
# run the program
$BI/bin/samstat yeast_chip_sort.bam
|
...
Code Block |
---|
title | URL for viewing samstat results |
---|
|
http://loving.corral.tacc.utexas.edu/bioiteam/yeast_stuff/yeast_chip_sort.bam.html
|
...
Code Block |
---|
title | FASTX_toolkit module description |
---|
|
module spider fastx_toolkit
module load fastx_toolkit
|
...
No Format |
---|
title | fastx_trimmer example |
---|
|
gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 > trimmed.fq
|
...
Expand |
---|
|
You could supply the -z option like this: No Format |
---|
title | fastx_trimmer example |
---|
|
gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 -z > trimmed.fq.gz
|
Or you could gzip the output yourself: No Format |
---|
title | fastx_trimmer example |
---|
|
gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 | gzip > trimmed.fq.gz
|
|
...
Expand |
---|
|
Type fastx_ then tab to see their names See all the programs like this: Code Block |
---|
title | fastx toolkit programs |
---|
|
ls $TACC_FASTX_BIN
|
|
Adapter trimming
...
The GSAF website describes the flavaors of Illumina adapter and barcode sequence in more detail https://wikisutexas.utexasatlassian.edunet/wiki/display/GSAF/Illumina+-+all+flavors
Cutadapt
The cutadapt program is an excellent tool for removing adapter contamination. The program is not available through TACC's module system but we've installed a copy in our $BI/bin directory.
...
Code Block |
---|
title | cutadapt command for R1 sequences |
---|
|
cutadapt -m 22 -O 10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
|
Code Block |
---|
title | cutadapt command for R2 sequences |
---|
|
cutadapt -m 22 -O 10 -a TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA
|
...
Expand |
---|
title | The gory details on the *-a* adapter sequence argument |
---|
|
Please refer to https://wikisutexas.utexasatlassian.edunet/wiki/display/GSAF/Illumina+-+all+flavors for Illumina library adapter layout. The top strand, 5' to 3', of a read sequence looks like this. No Format |
---|
title | Illumina library read layout |
---|
|
<P5 capture> <indexRead2> <Read 1 primer> [insert] <Read 2 primer> <indexRead1> <P7 capture>
|
The -a argument to cutadapt is documented as the "sequence of adapter that was ligated to the 3' end". So we care about the <Read 2 primer> for R1 reads, and the <Read 1 primer> for R2 reads. The "contaminent" for adapter trimming will be the <Read 2 primer> for R1 reads. There is only one Read 2 primer: Code Block |
---|
title | Read 2 primer, 5' to 3', used as R1 sequence adapter |
---|
|
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
|
The "contaminent" for adapter trimming will be the <Read 1 primer> for R2 reads. However, there are three different Read 1 primers, depending on library construction: No Format |
---|
title | Read 1 primer depends on library construction |
---|
|
TCTACACGTTCAGAGTTCTACAGTCCGACGATCA # small RNA sequencing primer site
CAGGTTCAGAGTTCTACAGTCCGACGATCA # "other"
TCTACACTCTTTCCCTACACGACGCTCTTCCGATCT # TruSeq Read 1 primer site. This is the RC of the R2 adapter
|
Since R2 reads are the reverse complement of R1 reads, the R2 adapter contaminent will be the RC of the Read 1 primer used. For ChIP-seq libraries where reads come from both DNA strands, the TruSeq Read 1 primer is always used. Since it is the RC of the Read 2 primer, its RC is just the Read 1 primer back Therefore, for ChIP-seq libraries only one cutadapt command is needed: Code Block |
---|
title | Cutadapt adapter sequence for ChIP-seq lib |
---|
| raries, both R1 and R2 reads}
cutadapt -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
|
For RNAseq libraries, we use the small RNA sequencing primer as the Read 1 primer. The contaminent is then the RC of this, minus the 1st and last bases: No Format |
---|
title | Small RNA library Read 1 primer, 5' to 3', used as R2 sequence adapter |
---|
|
TCTACACGTTCAGAGTTCTACAGTCCGACGATCA # R1 primer - small RNA sequencing Read 1 primer site, 5' to 3'
TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA # R2 adapter contaminent (RC of R1 small RNA sequencing Read 1 primer)
|
|
...
No Format |
---|
title | Example of trimming adaptor sequence from right end of reads |
---|
|
flexbar -n 1 --adapters adaptors.fna --source example.fastq --target example.ar --format fastq-sanger --adapter-threshold 2 --adapter-min-overlap 6 --adapter-trim-end RIGHT_TAIL
|
Code Block |
---|
title | Example adaptors.fna file |
---|
|
>adaptor1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>adaptor2
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
>adaptor1_RC
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
>adaptor2_RC
CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
|
Note that flexbar only searches for the exact sequences given (with options to allow for a given number of mismatches) not the reverse complement of those sequences therefore you must provide them yourself.
Trimmomatic
Trimmomatic offers similar options with the potential benefit that many illumina adaptor sequences are already "built-in". It is available here.