Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Here we will assume you have data from GSAF's Illumina HiSeq or MiSeq sequencer.

Learning Objectives

...

See if you can figure out what's wrong with these data sets (copy them to your $SCRATCH directory before analyzing them) and then process them to get rid of the problem(s):. If you're very ambitious, you could also map them to the reference genomes and perform variant calling before and after cleaning them up to see how the results change. Each file has a different problem.

Example #1: Single-end Illumina MiSeq data for E. coli

Code Block
languagebash
titleExample read file and reference files #1
$BI/gva_course/read_processing_example/JJM104_TAAGGCGA-TAGATCGC_L001_R1_001.fastq.gz
$BI/gva_course/read_processing/REL606.fna
Expand
titleWhat's wrong with this data?
This

 

Example #2: Paired-end Illumina Genome Analyzer IIx data for E. coli

Code Block
languagebash
titleExample read and reference files #2
$BI/gva_course/read_processing/61FTVAAXX_2_R1_ZDB172.fastq.gz
$BI/gva_course/read_processing/61FTVAAXX_2_R2_ZDB172.fastq.gz
$BI/gva_course/read_processing/REL606.fna
Expand
titleWhat's wrong with this data?
There was some sort of problem during library prep that highly biased the beginning of reads to "T". Unfortunately, post-processing can't help with this one. The read sequences are fine, but the coverage across the genome is so uneven that many regions of the genome were not sampled (have zero coverage) even though the volume of sequencing data was very high for this microbial genome. The facility had to do a new library prep and re-sequence to correct this issue.