Structural Variant (SV) calling with SVdetect 2023
- 1 Overview
- 2 Learning objectives:
- 3 Calling SV with SVdetect:
- 3.1 Prepare your directories
- 3.2 Map data using bowtie2
- 3.3 Install SVDetect
- 3.3.1 conda installation
- 3.3.1.1 We can now activate our new environment
- 3.3.1.2 Using the "which -a" command shows us we actually have access to multiple different cpan executables. Friday's class will discuss more about how you can end up with multiple executable files named the same thing stored in different directories, and how the command line will treat them and why this can cause problems
- 3.3.1.3 Install Perl modules required for SVDetect. (I do not know why the words do and for are appearing in bold, they are not meant as some kind of hint).
- 3.3.1.4 relaunch cpan
- 3.3.1.5 On the cpan prompt
- 3.3.1.6 Install Perl modules required for SVDetect. (I do not know why the words do and for are appearing in bold, they are not meant as some kind of hint).
- 3.3.1 conda installation
- 3.4 Analyze read mapping distribution
- 3.5 Running SVDetect
- 3.5.1 Using nano, create the file svdetect.conf with this text
- 3.5.2 OR take advantage of this one liner which captures the line listed above, pulls the 2 numbers to a pair of variables, and then replaces unknown1 and unknown2 with the correct values in the existing file
- 3.5.3 Commands to run SNVDetect
Overview
The information in this tutorial deals ONLY with SHORT reads. Additional tutorials will be available later in the week for dealing with long read sequencing from oxford nanopore. Discussion of why long read sequencing is far superior to short read sequencing for identifying structural variants will be presented then. Instead here we focus on calling such variants from short read data as the extreme vast majority of data that exists is that of short reads, is typically cheaper to produce, and there is no reason to ignore what SV can be detected in such data even if there is a false discovery rate associated with such studies.
Most approaches for predicting structural variants require you to have paired-end or mate-pair reads. They use the distribution of distances separating these reads to find outliers and also look at pairs with incorrect orientations. As mentioned during several of the presentations, many researchers choose to ignore these types of mutations and combined with the increased difficulty of accurately identifying them, the community is less settled on the "best" way to analyze them. Here we present a tutorial on a somewhat older program SVDetect. SVDetect is a type of program that makes use of configuration files rather than command line options (something you may encounter with other programs in your own work).
Other possible tools:
BreakDancer - hard to install prerequisites on TACC. Requires installing libgd and the notoriously difficult GD Perl module.
PEMer - hard to install prerequisites on TACC. Requires "ROOT" package.
Good discussion of some of the issues of predicting structural variation:
Comparison of many different SV tools
Friday's class will deal with how to identify a tool other than SVDetect that may be more appropriate for SV calling in short read data.
Learning objectives:
Identify structural variants in a new data set.
Work with a new type of program that uses configuration files rather than entering all information on a single command at the command line. This is similar to the queue system TACC uses which will be discussed on Friday.
Calling SV with SVdetect:
Here we'll look an E. coli genome re-sequencing sample where a key mutation producing a new structural variant was responsible for a new phenotype involving citrate, something the Barrick lab has studied.
Prepare your directories
suggested directory set up. Note the copy command must be run while on the head node, not an idev node
cds
cp -r $BI/gva_course/structural_variation/data GVA_sv_tutorial
cd GVA_sv_tutorial
This is Illumina mate-paired data (having a larger insert size than paired-end data) from genome re-sequencing of an E. coli clone.
File Name | Description | Sample |
|---|---|---|
| Paired-end Illumina, First of mate-pair, FASTQ format | Re-sequenced E. coli genome |
| Paired-end Illumina, Second of mate-pair, FASTQ format | Re-sequenced E. coli genome |
| Reference Genome in FASTA format | E. coli B strain REL606 |
NC_012967.1.lengths | Simple tab delimtered file based on the size of the reference needed for SVDetect so you don't have to create it yourself |
|
Map data using bowtie2
First we need to (surprise!) map the data. This will hopefully reinforce the bowtie2 tutorial you just completed.
Do not run on head node
Use hostname to verify you are still on the idev node.
If not, and you need help getting a new idev node, see this tutorial.
conda activate GVA-bowtie2-mapping
bowtie2-build NC_012967.1.fasta NC_012967.1
bowtie2 -t -p 48 -X 5000 --rf -x NC_012967.1 -1 61FTVAAXX_2_1.fastq -2 61FTVAAXX_2_2.fastq -S 61FTVAAXX.samNew options not used in the mapping tutorial:
--rftells bowtie2 that your read pairs are in the "reverse-forward" orientation of a mate-pair library-X 5000tells bowtie2 to not mark read pairs as discordant unless their insert size is greater than 5000 bases.
You may notice that these commands complete pretty quickly. Always remember speed is not necessarily representative of how taxing something is for TACC's head node, and always try to be a good TACC citizen and do as much as you can on idev nodes or as job submissions
Install SVDetect
This will be the most complicated installation yet. In addition to needing to install several different programs in the same conda installation command, we will need to install perl modules through cpan. Unfortunately, the cpan network can not be accessed through the compute nodes, so you must log out of your idev session using the logout command before continuing. If you are unsure if you are in an idev session remember you can use the hostname command to check.
conda installation
Like we saw in our samtools installation, we will need to install several programs at the same time to make sure they are all going to work with each other. In addition, we are going to create a new environment for working with SVDetect as some of the dependencies of SVDetect clash with those of samtools.
conda create --name GVA-SV -c conda-forge -c bioconda -c imperial-college-research-computing _libgcc_mutex perl libgcc-ng svdetect
cpan module installations
Again, make sure you are NOT on an idev node for working with cpan
We can now activate our new environment
conda activate GVA-SV
cpanWhen you attempt to launch cpan, you may get a message similar to the following:
/home1/0004/train402/miniconda3/envs/svdetect/bin/perl: symbol lookup error: /home1/apps/bioperl/1.007002/lib/perl5/x86_64-linux-thread-multi/auto/version/vxs/vxs.so: undefined symbol: Perl_xs_apiversion_bootcheck
In the following block note that each elipse will include large blocks of scrolling text as different modules are downloaded and installed. The process will take several minutes in total, just be ready to execute the next command when you get the cpan prompt back.
Install Perl modules required for SVDetect. (I do not know why the words do and for are appearing in bold, they are not meant as some kind of hint).
# choose 'yes' to do as much automatically as possible
# choose 'local::lib' for the approach you want (as you don't have admin rights on TACC)
# choose 'yes' to automatically choose some CPAN mirror sites for you
...
# choose 'yes' to append the information to your .bashrc file
...
cpan[1]> install Config::General
...
cpan[2]> install Tie::IxHash
...
cpan[3]> install Parallel::ForkManager
...
cpan[4]> quit
How to fix cpan if downloads give errors
Several students have had trouble with the cpan downloads in the past that seem to be related to some kind of interruption in the initial download process. The following commands have solved the issue for at least 1 student. Please try the following if you were unable to get the above cpan downloads to work, and let me know if you continue to experience difficulties.
The above solution is based on steps 4-7 of this page. Again, this installation is known to be difficult, if you are having problems, please let me know.