Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

A common question is 'after you submit something for sequencing what do you get back?' The answer is FASTQ files.

While there is some additional log files that you may be able to get off the instrument, the reality is none of those are actually 'data' of anything other than high level instrument performance. The good news is you don't actually need anything else. For single end sequencing you would have a single file, while paired end sequencing provides 2 files: 1 for read1 and another for read2. Each file contains a repeating 4-line entry for each individual read.

...

Often, the first thing you (or your boss) want to know about your sequencing run is simply, "how many reads do I have?". For the $BI/gva_course/mapping/data/SRR030257_1.fastq file, the answer is 3,800,180. How can we figure that out?

The grep (or Global Regular Expression Print) command can be used to determine the number of lines which match some criteria as shown above. Above we searched for:

  1. anything from the group of ACTGN with the [] marking them as a group
  2. matching any number of times *
  3. from the beginning of the line ^
  4. to the end of the line $

Here, since we are only interested in the number of reads that we have, we can make use of knowing the 3rd line in the fastq file is a + and a + only, and grep's -c option to simply report the number of reads in a file.

Code Block
languagebash
titleCan you use the information above to write a grep command to count the number of reads in the same file?
collapsetrue
grep -c "^+$" $BI/gva_course/mapping/data/SRR030257_1.fastq

...

Code Block
languagebash
titlePossible solution
collapsetrue
cutadapt -o SRR030257_1.Adepleted.fastq -a AAAAAAAAAAAAAAAAAAAA -m 16 SRR030257_1.fastq

 

Command portionpurpose
-o SRR030257_1.Adepleted.fastqcreate this new output file emphasizing that the A bases have been depleted

-a AAAAAAAAAAAAAAAAA

remove bases containing this sequence. 20 As are listed here. Thinking further, how often do homopolymers exist naturally in genomes? when they do exist, how long do they tend to be?
-m 16discard any read shorter than 16 bases after sequence removed as these are more likely difficult to uniquely align to your reference
SRR030257_1.fastquse this file as input

From the summary printed to the screen you can see that this removed a little over 2.5M bp of sequence.

...

Code Block
languagebash
titleOne possible solution
collapsetrue
cutadapt -m 16 -l 34 -o SRR030257_1.trimmed.fastq SRR030257_1.fastq
Command portionpurpose
-o SRR030257_1.trimmed.fastqcreate this new output file emphasizing that the reads have been trimmed

-l 34

Sets the read length to 34
-m 16discard any read shorter than 16 bases after sequence removed as these are more likely difficult to uniquely align to your reference
SRR030257_1.fastquse this file as input

This time, we see we have now removed much more sequence: more than 7.5M bp. That tells us that we got rid of at least 5 million bp that were not part of a chain of As without telling us anything about if they were the lower quality bases that were dragging down the overall quality score. Additionally you may have noticed that trimming was much quicker as the command was more simple.

...

This is another example of different solutions giving the same final product, and how careful reading of documentation may improve your work. NGS data analysis is a results driven process, if the result is correct, and you know how you got the result it is correct as long as it is reproducible.

A note on versions 


Info

In our first tutorial we mentioned how knowing what version of a program you are using can be. When we loaded the the cutadapt module we didn't specify what version to load. Can you figure out what version you used, and what the most recent version of the program there is? .

Expand
titleHow to figure out the currently installed version

try using the module system or the program's help files

Code Block
languagebash
titleStill not sure?
collapsetrue
module spider cutadapt

cutadapt --version

Figuring out the most recent version is a little more complicated. Unlike programs on your computer like Microsoft Office or your internet browser, there is nothing in an installed program that tells you if you have the newest version or even what the newest version is. If you go to the programs website (easily found with google or this link), the changes section lists all the versions that have been list with v2.10 being released on April 22nd this year.

Expand
titleTake a moment to think about why there might be such a big discrepancy before clicking here for the list of possible reasons I put together.

The biggest reason is that someone at tacc has to go through a process of noticing that there is a new version, figuring out if all of the changes are compatible with tacc, installing it, and then fielding questions and problems from users who were used to using the old version and have problems with the new version somehow.

The next reason is that the existing version works, and if you read through some of the recent changes, are very small and do not effect the function of the program very much.


Together, this is why I encourage you to make note of what version of the programs you use when you use them (primarily by loading modules complete with the versions in your .bashrc file), and to consider installing programs yourself when appropriate (as is discussed in the advanced trimmomatic tutorial for read trimming).

This won't be the last time we mention different program versions.

Optional Exercise: Improve the quality of R2 the same way you did for R1.

...