/
Bedtools tutorial -- GVA2019

Bedtools tutorial -- GVA2019

Introduction

Throughout the course we have focused on samll data sets, a limited number of samples, and in some cases even purposefully capped the total number of reads you have access to. This has been done for the purpose of time and letting you see the results tick by rather than simply having you come in for 30 minutes, submit a job, and wait an hour (or 6) before it starts running,  and have it take another 10 hours to run. The reality is while you will sometimes work with a test sample or a small pilot project, Big Data in Biology means LOTS of data and lots of data means needing to not just identify variants in 1 sample, but to identify commonality across different systems. here we introduce you to bedtools. A program designed to make comparisons across differnt file types generaterated from different samples or using different parameters of a given pipeline.

A note on version control

As mentioned in our samtools tutorial, different versions of software behave differently. Once again we have a situation where the BioITeam bedtools is available by default (version 2.20.1), and a different version is available on lonestar (version 2.25.0). A KEY addition to version 2.21 (aka the version after that which is available to you by default through the BioITeam as of this writing) was the ability of bedtools to simultaneously scan multiple files at once rather than having to sequencially scan pairs of files. Using the old version, to identify the common variants of files A, B, C and D, bedtools would have to be invoked a minimum of 3 times:

  1. A & B = E
  2. C & D = F
  3. E & F = G

More broadly, it would have to be invoked a minimum of (number of samples - 1) times. So as you start adding more and more samples, the commands get more and more difficult to write. Hopefully after yesterday's tutorials, the sentence "you could write a wrapper to handle this for you" makes sense, and if it doesn't you should ask a question even if the question is "I don't know what question to ask". With the new version however, while the wrapper is not necessary,  the computational requirements increase substantially. This is esspecially true when looking for convergence (ie common across all), or worse still thresholds (ie what variants are present in at least x% of the total samples). The larger your data set the more likely this can be a problem. Always make sure your programs are actually completing (NOT just erroring out), and remember there are multiple ways to make the program finish running correctly allbeit at a slower rate (read the documents, post on forums, reach out to former instructors, etc). 


Based on what was just said, what do you think the first 2 things you should do are?

 Need a hint?
  1. Different versions exist
  2. Can be very computationally intensive
 Think you know? Check your guess...

Because different versions exist, you need to make sure that you are using the right version:

Check that you remember how to check if you have access to a command, what version that command is, load a different version, and verify that is the version being used
which bedtools  # check if you have access to bedtools
bedtools --version  # determine what version of bedtools you are using
 
# This will tell you that the bedtools you are using is in the $BI/bin directory and that you are using version v2.20.1-2-g6f0be00
 
module load bedtools  # load the TACC version of bedtools, remember module spider module avail and module list can help you find exactly what module you are looking for
which bedtools
bedtools --version
 
# This will now tell you that you are using version v2.25.0 which is located in the /opt/apps/bedtools/2.25.0/bin dreictory.


Becau