Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Last year's tutorial which ran on lonestar5 using only the R1 data, the runs took ~40 minutes to complete. You can review last year's tutorial for specific commands of how to do so if interested. Since we expect that 40 minute time frame to expand into at least 1.5-2 hours, it is instead better to tackle this via the job queue system On stampede2 with both R1 and R2, the run takes ~5 hours to complete making it a task for the job queue system rather than an idev node. When we begin to make our slurm file, we will request 6 hours to be on the safe side (this behavior will be discussed on Friday).

Info
titleWhy might "only" looking at R1 data be ok?

Throwing out half your data probably seems like a bad idea, certainly a waste of money. Sometimes, it does not impact your actual analysis, and sometimes it can actually improve your analysis.

1. If you have more coverage than you need (ie 200x coverage of a clonal sample), downsampling your data to use fewer reads can improve both the analysis speed and accuracy as fewer errors will be present. If the analysis pipeline you are using does not make use of paired end information, this can be accomplished by only looking at R1, if not, you could use the head or tail commands to extract a given number of reads. Some programs also include options to limit coverage to a certain level and go about this in different ways but require no modification of your read files.

2. While rare, it is possible that R1 or R2 simply has much lower quality scores than the other read. In this case it would be appropriate to drop 1 read or the other from your analysis. This would be something you could determine using fastqc and/or multiqc which have their own tutorial at: MultiQC - fastQC summary tool -- GVA2021

...