Molecular Index Error correction GVA2019
Overview:
This section provides directions for generating SSCS (Single Strand Consensus Sequence) reads and trimming molecular indexes from raw fastq files.
Learning Objectives:
- Use python script to generate SSCS Reads.
- Use cutadapt to trim molecular indexes from duplex seq libraries.
Tutorial: SSCS Reads
First we want to generate SSCS reads where we take advantage of the molecular indexes added during library prep. To do so we will use a "majority rules" python script (named SSCS_DCS.py) which was heavily modified by DED from a script originally created by Mike Schmitt and Scott Kennedy for the original duplex seq paper. This script can be found in the $BI/bin directory. For the purpose of this tutorial, the paired end sequencing of sample DED110 (prepared with a molecular index library) has been placed in the $BI/gva_course/mixed_population directory. Invoking the script is as simple as typing SSCS_DCS.py; adding -h will give a list of the available options. The goal of this command is to generate SSCS reads, for any molecular index where we have at least 2 reads present, and to generate a log file which will tell us some information about the data.
This should take 10 minutes or less to complete in an idev shell. Suggest looking over the alternative library prep presentation or the duplex sequencing paper itself in the mean time
Error correction evaluation:
The SSCS_Log is a great place to start. Use the tail command to look at the last 8 lines of the log file to determine how many reads made it from raw reads to error corrected SSCS reads.
Perhaps more interesting is the number of errors removed. This is also available in the SSC_Log file, but in the middle of the file and don't have any good handle to grep with. One option is to cat the entire file and scroll around, another is to use tail/head commands you can get the specific lines only:
The 3 columns are the read posistion, the number of bases changed, and the number of bases not changed. If you copy and paste these 3 columns into excel you can easily calculate the sum of the 2nd column to see that 446,104 bases were changed. The read position is based on the 5-3' sequence, and you should notice that generally the higher the read position, the more errors were corrected. This should make sense based on what we have talked about with decreasing quality scores as read length increases.
Tutorial (Trimmed Reads with cutadapt):
From our earlier tutorial on read quality control you likely remember that you can load the cutadapt as a module. If you feel like you need a hint to do this, pause and think for a minute and try some things. If you still can't get it, raise your hand and talk to us as this is a concept that you should be able to do on your own by now so we need to help explain things differently.
Each of these commands will take 1-2 minutes to complete. Think about ways you could have run both commands at the same time. In the tutorials (including some of the optional ones) there are at least 3 ways that we have shown you to do it, and many more we haven't. How many can you come up with?
Of the 3 answers we show above, one of them will actually finish much sooner than the first. Do you know which one and why?
Checking the current contents of the directory will show you we've now made 2 new .trimmed.fastq files in addition to the trio of .fastq files we made in the error correction part of the tutorial. The DED110_SSCS.fastq is the one of most interest to us for the follow up tutorial, while both the .trimmed.fastq files will be of interest. Rather than working with 3 files for 2 samples (error corrected and trimmed), use what you have learned about piping to generate a single file called DED110_all.trimmed.fastq and check your work.
Next step:
You should now have 2 new .fastq files which we will use to call variants in: DED110_SSCS.fastq, and DED110_all.trimmed.fastq. You should take these files into a more in depth breseq tutorial for comparisons of the specific mutations that are eliminated using the error correction (SSCS). Link to other tutorial.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.