Bedtools tutorial -- GVA2019
Introduction
Throughout the course we have focused on samll data sets, a limited number of samples, and in some cases even purposefully capped the total number of reads you have access to. This has been done for the purpose of time and letting you see the results tick by rather than simply having you come in for 30 minutes, submit a job, and wait an hour (or 6) before it starts running, and have it take another 10 hours to run. The reality is while you will sometimes work with a test sample or a small pilot project, Big Data in Biology means LOTS of data and lots of data means needing to not just identify variants in 1 sample, but to identify commonality across different systems. here we introduce you to bedtools. A program designed to make comparisons across differnt file types generaterated from different samples or using different parameters of a given pipeline.
A note on version control
As mentioned in our samtools tutorial, different versions of software behave differently. Once again we have a situation where the BioITeam bedtools is available by default (version 2.20.1), and a different version is available on lonestar (version 2.25.0). A KEY addition to version 2.21 (aka the version after that which is available to you by default through the BioITeam as of this writing) was the ability of bedtools to simultaneously scan multiple files at once rather than having to sequencially scan pairs of files. Using the old version, to identify the common variants of files A, B, C and D, bedtools would have to be invoked a minimum of 3 times:
- A & B = E
- C & D = F
- E & F = G
More broadly, it would have to be invoked a minimum of (number of samples - 1) times. So as you start adding more and more samples, the commands get more and more difficult to write. Hopefully after yesterday's tutorials, the sentence "you could write a wrapper to handle this for you" makes sense, and if it doesn't you should ask a question even if the question is "I don't know what question to ask". With the new version however, while the wrapper is not necessary, the computational requirements increase substantially. This is esspecially true when looking for convergence (ie common across all), or worse still thresholds (ie what variants are present in at least x% of the total samples). The larger your data set the more likely this can be a problem. Always make sure your programs are actually completing (NOT just erroring out), and remember there are multiple ways to make the program finish running correctly allbeit at a slower rate (read the documents, post on forums, reach out to former instructors, etc).
Based on what was just said, what do you think the first 2 things you should do are?