Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview:

This section provides directions for generating SSCS (Single Strand Consensus Sequence) reads and trimming molecular indexes from raw fastq files. 

Learning Objectives:

  1. Use python script to generate SSCS Reads.
  2. Use flexbar to trim molecular indexes from duplex seq libraries.

Tutorial: SSCS Reads

First we want to generate SSCS reads where we take advantage of the molecular indexes added during library prep. To do so we will use a "majority rules" python script (named SSCS_DCS.py) which was heavily modified by DED from a script originally created by Mike Schmitt and Scott Kennedy for the original duplex seq paper. This script can be found in the $BI/scripts directory. For the purpose of this tutorial, the paired end sequencing of sample DED110 (prepared with a molecular index library) has been placed in the  $BI/gva_course/mixed_population directory. Invoking the script is as simple as typing SSCS_DCS.py; adding -h will give a list of the available options. The goal of this command is to generate SSCS reads, for any molecular index where we have at least 2 reads present, and to generate a log file which will tell us some information about the data.

...

This should take ~15 minutes to complete in an idev shell. 

Error correction evaluation:

The SSCS_Log is a great place to start. Use the tail command to look at the last 8 lines of the log file to determine how many reads made it from raw reads to error corrected SSCS reads. 

...

The 3 columns are the read posistion, the number of bases changed, and the number of bases not changed. If you copy and paste these 3 columns into excel you can easily calculate the sum of the 2nd column to see that 446,104 bases were changed. The read position is based on the 5-3' sequence, and you should notice that generally the higher the read position, the more errors were corrected. This should make sense based on what we have talked about with decreasing quality scores as read length increases.

Tutorial (Trimmed Reads):

Similar to our comments about version control, it is not always necessary to update to the newest and best tools, sometimes the old ones work "well enough" and can be "left alone". For the purpose of this tutorial, we will be working with flexbar which like breseq is something that we have installed in the BioITeam as it is not a tacc module. Test that flexbar is installed and working correctly using the which command and 'flexbar -h'. If these commands work, note that this is the perfect example of the benefits of the BioITeam. This is has proven to be an extremely difficult command to install and configure and install in the past, leading me to place it in the $BI/bin directory, but it still required additional options (expand the next section to see what those may have been). Yet if it is working it strongly suggests some other member of the BioITeam found it available, and fixed it in some way so it works by default. While we will never again fight with flexbar to make it work somewhere new or with set of data that is different, as long as it is working we will keep using it. As a optional exercise you could figure out how to use the fastx_toolkit  using what you learned in the quality control tutorial to accomplish the same result.

...

Expand
titleclick here for the answer to which file is which

the trimmed_1/2.fastq is the trimmed fastq files

the trimmed_1/2.fastq.lengthdist is the length distribution file (which should have 2 lines: a header line and a line showing that all of the reads are 85 bp long now

Next step:

You should now have 3 new .fastq files which we will use to call variants in: DED110_SSCS.fastq, trimmed_1.fastq, and trimmed_2.fastq. You could take these files into a more in depth breseq tutorial that was prepared last year for comparisons of the specific mutations that are eliminated using the error correction (SSCS). Link to other optional tutorial.