Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Often, the first thing you (or your boss) want to know about your sequencing run is simply, "how many reads do I have?". For the $BI/gva_course/mapping/data/SRR030257_1.fastq file, the answer is 3,800,180. How can we figure that out?

The grep (or Global Regular Expression Print) command can be used to determine the number of lines which match some criteria as shown above. Above we searched for:

  1. anything from the group of ACTGN with the [] marking them as a group
  2. matching any number of times *
  3. from the beginning of the line ^
  4. to the end of the line $

Here, since we are only interested in the number of reads that we have, we can make use of knowing the 3rd line in the fastq file is a + and a + only, and grep's -c option to simply report the number of reads in a file.

Code Block
languagebash
titleCan you use the information above to write a grep command to count the number of reads in the same file?
collapsetrue
grep -c "^+$" $BI/gva_course/mapping/data/SRR030257_1.fastq

...

Expand
titleAnswer

The Per base sequence quality report does not look great. If I were making the call to trim based soley on this I'd probably pick 31 or 32 as the last base as this is the first base that the average quality score drops significantly. More importantly, nearly 1.5% of all the sequences are all A's according to the Overrepresented sequences. This is something that often comes up in miSeq Illumina runs that has shorter insert sizes than the overall read length. Next we'll start looking at how to trim our data before continuing.

FASTQ Processing Tools

Cutadapt

There are a number of open source tools that can trim off 3' bases and produce a FASTQ file of the trimmed reads to use as input to the alignment program. Cutadapt provides a simple command line tool for manipulating fasta and fastq files. The program description on their website provides good details of all the capabilities and examples for some common tasks. Cutadapt is also available via the TACC module system allowing us to turn it on when we need to use it and not worry about it other times.

...