SV calling with SVdetect

SV calling with SVdetect

Overview

Most approaches for predicting structural variants require you to have paired-end or mate-pair reads. They use the distribution of distances separating these reads to find outliers and also look at pairs with incorrect orientations.

Possible tools:

  • BreakDancer - hard to install prerequisites on TACC. Requires installing libgd and the notoriously difficult GD Perl module.

  • PEMer - hard to install prerequisites on TACC. Requires "ROOT" package.

  • SVDetect - good instructions, relatively hefty configuration files.

Good discussion of some of the issues of predicting structural variation:

Example: E. coli genome with structural variation

Here's an E. coli genome re-sequencing sample where a key mutation producing a new structural variant was responsible for a new phenotype.

suggested directory set up
cds cp -r $BI/gva_course/structural_variation/data sv_tutorial cd sv_tutorial

This is Illumina mate-paired data (having a larger insert size than paired-end data) from genome re-sequencing of an E. coli clone.

File Name

Description

Sample

File Name

Description

Sample

61FTVAAXX_2_1.fastq

Paired-end Illumina, First of mate-pair, FASTQ format

Re-sequenced E. coli genome

61FTVAAXX_2_2.fastq

Paired-end Illumina, Second of mate-pair, FASTQ format

Re-sequenced E. coli genome

NC_012967.1.fasta

Reference Genome in FASTA format

E. coli B strain REL606

Map data using bowtie2

First we need to (surprise!) map the data. We'll use bowtie2 for this tutorial, but BWA would also work with the proper options.

Do not run on head node

Many of the commands past this point are computationally intensive. You should run them through an idev shell or by qsub. We recommend idev for the tutorial, but you could also qsub any command that takes more than a few seconds to complete.

Example command to start an idev shell
idev -m 60 -q development -A "UT-2015-05-18"
bowtie2-build NC_012967.1.fasta NC_012967.1 bowtie2 -p 12 -X 5000 --rf -x NC_012967.1 -1 61FTVAAXX_2_1.fastq -2 61FTVAAXX_2_2.fastq -S 61FTVAAXX.sam

Possibly unfamiliar options:

  • --rf tells bowtie2 that your read pairs are in the "reverse-forward" orientation of a mate-pair library

  • -X 5000 tells bowtie2 to not mark read pairs as discordant unless their insert size is greater than 5000 bases.

 

bowtie doesn't output discordantly mapped pairs!

Run SVDetect

The first step is to look at all mapped read pairs and whittle down the list only to those that have an unusual insert sizes (distances between the two reads in a pair). You should submit this command to the TACC queue.

BAM_preprocessingPairs.pl -p 0 61FTVAAXX.sam

As we discussed in our earlier presentation, SV are often detected by looking for variations in library insert sizes. The stdout of the pearl script will answer the question: What is the normal insert size for this library? 

SVDetect demonstrates a common strategy in some programs with complex input where instead of including a lot of options on the command line, it reads in a simple text file that sets all of the required options. Lets look at how to create a configuration file:

You'll need to substitute your own paths for /full/path/to/61FTVAAXX.ab.sam and /full/path/to/NC_012967.1.lengths on lines 7 and 8 below.

Create the file svdetect.conf with this text
<general> input_format=sam sv_type=all mates_orientation=RF read1_length=35 read2_length=35 mates_file=/full/path/to/61FTVAAXX.ab.sam cmap_file=/full/path/to/NC_012967.1.lengths num_threads=1 </general> <detection> split_mate_file=0 window_size=2000 step_length=1000 </detection> <filtering> split_link_file=0 nb_pairs_threshold=3 strand_filtering=1 </filtering> <bed> <colorcode> 255,0,0=1,4 0,255,0=5,10 0,0,255=11,100000 </colorcode> </bed>

 

You also need to create a tab-delimited file of chromosome lengths. Use the tab key rather than writing out <tab>!!

File NC_012967.1.lengths
1<tab>NC_012967<tab>4629812

You'll want to submit the first two commands to the TACC queue or do them in an idev shell. They take a while. Consult the manual for a full description of what these commands and options are doing while the commands are running.

Commands to run SNVDetect
SVDetect linking -conf svdetect.conf SVDetect filtering -conf svdetect.conf SVDetect links2SV -conf svdetect.conf

Take a look at the resulting file: 61FTVAAXX.ab.sam.links.filtered.sv.txt.

We've highlighted a few lines below:

chr_type SV_type BAL_type chromosome1 start1-end1 average_dist chromosome2 start2-end2 nb_pairs score_strand_filtering score_order_filtering score_insert_size_filtering final_score breakpoint1_start1-end1 breakpoint2_start2-end2 ... INTRA NORMAL_SENSE - chrNC_012967 599566-601025 - chrNC_012967 663036-664898 430 100% - - 1 - - ... INTRA NORMAL_SENSE - chrNC_012967 3-2025 - chrNC_012967 4627019-4628998 288 100% - - 1 - - ... INTRA REVERSE_SENSE - chrNC_012967 16999-19033 - chrNC_012967 2775082-2777014 274 100% - - 1 - -

1. This is a tandem head-to-tail duplication of the region from approximately 600000 to 663000.
2. This is just the origin of the circular chromosome, connecting its end to the beginning!
3. This is a big chromosomal inversion mediated by recombination between repeated IS elements in the genome. It would not have been detected if the insert size of the library wasn't > ~1,500 bp!

... Many of the others are due to new insertions of transposable elements.

Very, Very Advanced Exercise

  • SVDetect has a nice option to output a file that can be read by Circos to produce drawings.

Optional: Install SVDetect

We have installed SVdetect for you already as installation is a bit difficult. It should be in your $PATH under $BI/bin.

Install SVDetect scripts

Navigate to the SVDetect project page

More information:

Try to download the code yourself onto TACC.

wget http://sourceforge.net/projects/svdetect/files/SVDetect/0.70/SVDetect_r0.7m.tar.gz tar -xvzf SVDetect_r0.7m.tar.gz cd SVDetect_r0.7m

Move the Perl scripts and make them executable

cp bin/SVDetect $HOME/local/bin chmod 775 scripts/BAM_preprocessingPairs.pl cp scripts/BAM_preprocessingPairs.pl $HOME/local/bin

Install required Perl modules

SVdetect requires a few Perl modules to be installed. In the default TACC environment, you can use the cpan shell to install most well-behaved Perl modules (with the exception of some complicated ones that require other libraries to be installed or things to compile). Here's how:

Install Perl modules required for SVDetect
login1$ module load perl login1$ cpan ... cpan[4]> install Config::General ... cpan[4]> install Tie::IxHash ... cpan[4]> install Parallel::ForkManager ... cpan[4]> quit login1$