Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


  1. Expand
    titleWhat are the 4 new files that were generated, and where did they come from?
    • E1-7_S187_L001_R2_001.trim.fastq.gz and E1-7_S187_L001_R1_001.trim.fastq.gz

      • These were created with the -o and -O options, they are in the Trim_Reads folder, and you likely found them using the ls command
    • fastp.html and fastp.json
      • These are log files created by default since we didn't specify their names. This is part of why -j and -h were discussed above with the general command. 
      • While the json file can be evaluated in the terminal (cat less more head tail), the html file has to be transferred back to your computer to view.



  2. Expand
    titleHow many reads were left after trimming?
    • 5884 paired end reads

    • 11768 total reads

      • You likely found this out from using the zgrep command, or from the following blocks that printed as the command ran:

        No Format
        Read1 after filtering:
        total reads: 5884
        total bases: 791763
        Q20 bases: 782948(98.8867%)
        Q30 bases: 765510(96.6842%)
        
        Read2 after filtering:
        total reads: 5884
        total bases: 791763
        Q20 bases: 711414(89.8519%)
        Q30 bases: 658164(83.1264%)
        
        Filtering result:
        reads passed filter: 11768
        reads failed due to low quality: 2014
        reads failed due to too many N: 0
        reads failed due to too short: 0
        reads with adapter trimmed: 3970
        bases trimmed due to adapters: 193972




  3. Expand
    titleHow big was our fragment size? How is this estimated, what might make it inaccurate?
    • From the information generated while the command ran we see:

      • No Format
        Insert size peak (evaluated by paired-end reads): 171


      • This tells us that the average peak size was 171 bases, and that it was estimated by looking at the overlap between the read pairs. It is potentially inaccurate as reads which do not overlap each other can not estimate the size.
    • If you transferred the .html file back to your laptop, you would see this relevant histogram:

      • The general section of the summary at the top of the html tells us that the average insert size was 171, while the histogram tells us that 50% of our data is <18 or >272 bases



  4. Expand
    titleDid our sample have any adapter present?
    • If you only look at the information that printed to the screen, you probably answer "No"
      • you likely see the following block and think this is the end of the answer:

      • No Format
        Detecting adapter sequence for read1...
        No adapter detected for read1
        
        Detecting adapter sequence for read2...
        No adapter detected for read2


    • A more fuller answer might be "maybe" or "probably" or "I'm not sure" as:
      • 1. Not finding any adapter would be super rare
      • 2. If 45% of our reads have an insert size of 171 bases, and we did 151bp PE sequencing,  we should be able to find adapter sequences
      • 3. in the filtering results we see:

      • No Format
        Filtering result:
        reads passed filter: 11768
        reads failed due to low quality: 2014
        reads failed due to too many N: 0
        reads failed due to too short: 0
        reads with adapter trimmed: 3970
        bases trimmed due to adapters: 193972


    • If you look at the html file you probably answered "yes"
      • There is a section for Read1 and Read2 adapters which show a growing stretch of DNA which recreates the illumina adapter sequences.



  5. Expand
    titleWhy was answering if there were adapters present not straight forward?

    Like we saw in our fastqc reports (over represented sequences having "no hit" and adapter content staying at bottom of graph), for something to be classified as an "adapter" in the first section of the printed information, it has to meet certain criteria that in this (and many other instances) is perhaps a bit too stringent. 



  6. Expand
    titleWhat other interesting things can you find from this command?

    This is pretty open ended, take a look at the html file in particular, see what of it does or doesn't make sense and consider asking a question if you would like to know more.

    Info

    Of several things that you may stand out to you is large fraction of reads end with stretches of "G" on the end. There are 2 things to note with this: 1. 2 color sequencing on illumina (detailed information here) reads "no color" as "G", 2. This library is very fragmented and contains adapter dimers meaning that in some cases there are only ~40bp downstream of the sequencing primer location leaving 60 cycles that have no template available. If you look at the help for fastp the following options may stand out to you as a way to deal with this:

    • -g, --trim_poly_g                    force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
    • --poly_g_min_len                 the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
    • -G, --disable_trim_poly_g            disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
    • -x, --trim_poly_x                    enable polyX trimming in 3' ends.
    • --poly_x_min_len                 the minimum length to detect polyX in the read tail. 10 by default. (int [=10])

    Consider rerunning the fastp command while adding "-g" to the command line and see how the results differ.





Trim all the samples from the multiqc tutorial

...