SV calling tutorial (GVA14)

Most approaches for predicting structural variants require you to have paired-end or mate-pair reads. They use the distribution of distances separating these reads to find outliers and also look at pairs with incorrect orientations.

Possible tools:

BreakDancer - hard to install prerequisites on TACC. Requires installing libgd and the notoriously difficult GD Perl module.
PEMer - hard to install prerequisites on TACC. Requires "ROOT" package.
SVDetect - good instructions, relatively hefty configuration files.

Good discussion of some of the issues of predicting structural variation:

http://www.genome.gov/Pages/Research/DER/1000GenomesProjectTutorials/StructuralVariants-JanKorbel.pdf

Optional: Install SVDetect

We have installed SVdetect for you under $BI/bin so it should be in your $PATH, so you can skip this part.

Install SVDetect scripts

Navigate to the SVDetect project page

More information:

Try to download the code yourself onto TACC.

Here's how...

wget http://sourceforge.net/projects/svdetect/files/SVDetect/0.70/SVDetect_r0.7m.tar.gz
tar -xvzf SVDetect_r0.7m.tar.gz
cd SVDetect_r0.7m

Move the Perl scripts and make them executable

cp bin/SVDetect $HOME/local/bin
chmod 775 scripts/BAM_preprocessingPairs.pl
cp scripts/BAM_preprocessingPairs.pl $HOME/local/bin

Install required Perl modules

SVdetect requires a few Perl modules to be installed. In the default TACC environment, you can use the cpan shell to install most well-behaved Perl modules (with the exception of some complicated ones that require other libraries to be installed or things to compile). Here's how:

Install Perl modules required for SVDetect

login1$ module load perl
login1$ cpan
...
cpan[4]> install Config::General
...
cpan[4]> install Tie::IxHash
...
cpan[4]> install Parallel::ForkManager
...
cpan[4]> quit
login1$

Example: E. coli genome with structural variation

Here's an E. coli genome re-sequencing sample where a key mutation producing a new structural variant was responsible for a new phenotype.

cds

cp -r $BI/gva_course/structural_variation/data structural_variation
cd structural_variation

This is Illumina mate-paired data (having a larger insert size than paired-end data) from genome re-sequencing of an E. coli clone.

File Name	Description	Sample
`61FTVAAXX_2_1.fastq`	Paired-end Illumina, First of mate-pair, FASTQ format	Re-sequenced E. coli genome
`61FTVAAXX_2_2.fastq`	Paired-end Illumina, Second of mate-pair, FASTQ format	Re-sequenced E. coli genome
`NC_012967.1.fasta`	Reference Genome in FASTA format	E. coli B strain REL606

Map data using BWA

You should submit the bwa aln and bwa sampe commands as jobs to the queue, one after the other.

bwa index NC_012967.1.fasta
bwa aln -f 61FTVAAXX_2_1.sai NC_012967.1.fasta 61FTVAAXX_2_1.fastq
bwa aln -f 61FTVAAXX_2_2.sai NC_012967.1.fasta 61FTVAAXX_2_2.fastq
bwa sampe -n 0 -N 100 -f 61FTVAAXX.sam NC_012967.1.fasta 61FTVAAXX_2_1.sai 61FTVAAXX_2_2.sai 61FTVAAXX_2_1.fastq 61FTVAAXX_2_2.fastq

Possibly unfamiliar options:

-n 0 tells bwa to report zero pairs for proper mates
-N 100 tells bwa to report at most 100 possible matches for mates with abnormal distances or orientations.

If you use bowtie (version 1) to do your mapping, you won't predict any read SVs. Why?

Answer....

bowtie doesn't map discordant pairs!

Run SVDetect

The first step is to look at all mapped read pairs and whittle down the list only to those that have an unusual insert sizes (distances between the two reads in a pair). You should submit this command to the TACC queue.

BAM_preprocessingPairs.pl -p 0 61FTVAAXX.sam

What is the normal insert size for this library? (Check stdout from the command.)

SVDetect demonstrates a common strategy in some programs with complex input where instead of including a lot of options on the command line, it reads in a simple text file that sets all of the required options.

Create a configuration file:

Create the file svdetect.conf with this text

<general>
input_format=sam
sv_type=all
mates_orientation=RF
read1_length=35
read2_length=35
mates_file=/full/path/to/61FTVAAXX.ab.sam
cmap_file=/full/path/to/NC_012967.1.lengths
num_threads=1
</general>

<detection>
split_mate_file=0
window_size=2000
step_length=1000
</detection>

<filtering>
split_link_file=0
nb_pairs_threshold=3
strand_filtering=1
</filtering>

<bed>
  <colorcode>
    255,0,0=1,4
    0,255,0=5,10
    0,0,255=11,100000
  </colorcode>
</bed>

You'll need to substitute your own paths for /full/path/to/61FTVAAXX.ab.sam and /full/path/to/NC_012967.1.lengths.

You also need to create a tab-delimited file of chromosome lengths.

File NC_012967.1.lengths

1<tab>NC_012967<tab>4629812

You'll want to submit the first two commands to the TACC queue. They take a while.

Consult the manual for a full description of what these commands and options are doing.

Commands to run SNVDetect

SVDetect linking -conf ./svdetect.conf
SVDetect filtering -conf ./svdetect.conf
SVDetect links2SV -conf ./svdetect.conf

Take a look at the resulting file: 61FTVAAXX.ab.sam.links.filtered.sv.txt.

We've highlighted a few lines below:

chr_type        SV_type BAL_type        chromosome1     start1-end1     average_dist    chromosome2     start2-end2     nb_pairs        score_strand_filtering  score_order_filtering   score_insert_size_filtering     final_score     breakpoint1_start1-end1 breakpoint2_start2-end2
...
INTRA   NORMAL_SENSE    -       chrNC_012967    599566-601002   -       chrNC_012967    662994-664998   1577    100%    -       -       1       -       -
...
INTRA   NORMAL_SENSE    -       chrNC_012967    3-2025  -       chrNC_012967    4627019-4628998 989     100%    -       -       1       -       -
...
INTRA   REVERSE_SENSE   -       chrNC_012967    16999-18903     -       chrNC_012967    2775979-2777838 873     100%    -       -       1       -       -

Any idea what sorts of mutations produced these three structural variants?

Answers...

1. This is a tandem head-to-tail duplication of the region from approximately 600000 to 663000.
2. This is just the origin of the circular chromosome, connecting it's end to the beginning!
3. This is a big chromosomal inversion mediated by recombination between repeated IS elements in the genome. It would not have been detected if the insert size of the library wasn't > ~1,500 bp!

Very, Very Advanced Exercise

SVDetect has a nice option to output a file that can be read by Circos to produce drawings.