...

Most approaches for predicting structural variants require you to have paired-end or mate-pair reads. They use the distribution of distances separating these reads to find outliers and also look at pairs with incorrect orientations. Possible As mentioned during several of the presentations, many researchers choose to ignore these types of mutations and combined with the increased difficulty of accurately identifying them, the community is less settled on the "best" way to analyze them. Here we present a tutorial on SVDetect based on the quality of its instructions, and easy of installation despite its use of relatively hefty configuration files.

Other possible tools:

BreakDancer - hard to install prerequisites on TACC. Requires installing libgd and the notoriously difficult GD Perl module.
PEMer - hard to install prerequisites on TACC. Requires "ROOT" package.
SVDetect - good instructions, relatively hefty configuration files.

Good discussion of some of the issues of predicting structural variation:

...

Code Block

language	bash
title	suggested directory set up

cds
cp -r $BI/gva_course/structural_variation/data BDIB_sv_tutorial
cd BDIB_sv_tutorial

This is Illumina mate-paired data (having a larger insert size than paired-end data) from genome re-sequencing of an E. coli clone.

...

First we need to (surprise!) map the data. We'll use bowtie2 for this tutorial, but BWA would also work with the proper optionsThis will hopefully reinforce the bowtie2 tutorial you just completed, but if you are feeling adventurous you could use BWA as optional reinforcement.

Warning

title	Do not run on head node

Many of the commands past this point are computationally intensive. You should run them through an idev shell or by qsub. We recommend idev for the tutorial, but you could also qsub any command that takes more than a few seconds to complete.Make sure you are on an idev node using the command: showq -u

idev

Code Block

title	Example command to start an idev shell

If you need a new idev node
collapse	true

idev  -m 60 -q developmentr CCBB_Bio_Summer_School_2016_Day2 -A "UT-2015-05-18" -N 2 -n 8

Code Block
bowtie2-build NC_012967.1.fasta NC_012967.1 bowtie2 -p 1248 -X 5000 --rf -x NC_012967.1 -1 61FTVAAXX_2_1.fastq -2 61FTVAAXX_2_2.fastq -S 61FTVAAXX.sam

...

--rf tells bowtie2 that your read pairs are in the "reverse-forward" orientation of a mate-pair library
-X 5000 tells bowtie2 to not mark read pairs as discordant unless their insert size is greater than 5000 bases.

You may notice that these commands complete pretty quickly. Always remember speed is not necessarily representative of how taxing something is for TACC's head node, and always try to be a good TACC citizen and do as much as you can on idev nodes or as job submissions

Expand

title	If you use version 1 of bowtie to do your mapping (or several other mapping programs), you won't predict any read SVs. Why?

bowtie doesn't output discordantly mapped pairs!

...

The first step is to look at all mapped read pairs and whittle down the list only to those that have an unusual insert sizes (distances between the two reads in a pair). You should submit this command to the TACC queue.

Code Block
BAM_preprocessingPairs.pl -p 0 61FTVAAXX.sam

As we discussed in our earlier presentation, SV are often detected by looking for variations in library insert sizes. The stdout of the pearl script will answer the questionquestions:

What is the normal insert size for this library?
What percent of reads appear to be associated with potential structural variants?
What percent of reads suggest potential novel sequence insertions or contamination with other sources of DNA?
How can you use the above answers to make characterizations about new sequencing data, or use this type of analysis to make implications about your overall sequencing?

Expand

title	Answers from the std output

-- using -1142.566-5588.410 as normal range of insert size
Approximately 20% based on:
-- 994952 mapped pairs
---- 195705 abnormal mapped pairs
Approximately 0.5% based on:
-- Total : 1000000 pairs analysed
-- 5048 pairs whose one or both reads are unmapped
Consider the following:
1. The first answer can tell you what type of library it is if you did not know ahead of time (remember paired end reads have ~500-700bp inserts on average not 1000s of bp)
2. This should help underscore that a significant portion of your total reads (and thus variation) may be in structural variants. Unfortunately, this does require generating a mate-pair library to learn.
3. Low levels of unmapped read pairs suggests that both the reference is accuate, of a high quality, and free of contamination.
  1. Note that contamination in this case refers only to other organisms, not other samples.

SVDetect demonstrates a common strategy in some programs with complex input where instead of including a lot of options on the command line, it reads in a simple text file that sets all of the required options. Lets look at how to create a configuration file:

Note
You'll need to substitute your own paths for the output of the pwd command on lines 7 and 8 below in front of "/full/path/to/" , but you will need to leave "61FTVAAXX.ab.sam and /full/path/to/" and "NC_012967.1.lengths on lines 7 and 8 below".

Code Block

title	Create the file svdetect.conf with this text
linenumbers	true

<general>
input_format=sam
sv_type=all
mates_orientation=RF
read1_length=35
read2_length=35
mates_file=/full/path/to/61FTVAAXX.ab.sam
cmap_file=/full/path/to/NC_012967.1.lengths
num_threads=148
</general>

<detection>
split_mate_file=0
window_size=2000
step_length=1000
</detection>

<filtering>
split_link_file=0
nb_pairs_threshold=3
strand_filtering=1
</filtering>

<bed>
  <colorcode>
    255,0,0=1,4
    0,255,0=5,10
    0,0,255=11,100000
  </colorcode>
</bed>

...

You also need to create a tab-delimited file of chromosome lengths. Use the tab key rather than writing out <tab>!lengths named NC_012967.1.lengths. YOU CAN NOT COPY PASTE THIS COMMAND!

Code Block

title	File NC_012967.1.lengths

1<tab>NC_012967<tab>4629812

...

  # Use the tab key rather than writing out <tab>!!

The following commands will take a while and must be completed in order, so no advantages/ability to have them run in the background. Consult the manual for a full description of what these commands and options are doing while the commands are running.

...

Take a look at the resulting file: 61FTVAAXX.ab.sam.links.filtered.sv.txt. Another downside of command line applications is that while you can print files to the screen, the formatting is not always the nicest. On the plus side in 95% of cases, you can directly copy the output from the terminal window to excel and make better sense of what the columns actually are

We've highlighted a few lines below:

...

Expand

title	Any idea what sorts of mutations produced these three structural variants?

1. This is a tandem head-to-tail duplication of the region from approximately 600000 to 663000.
2. This is just the origin of the circular chromosome, connecting its end to the beginning!
3. This is a big chromosomal inversion mediated by recombination between repeated IS elements in the genome. It would not have been detected if the insert size of the library wasn't > ~1,500 bp!

... Many of the others are due to new insertions of transposable elements.

Very, Very Advanced Exercise

...

.

Expand

title

...

Code Block

title	This is untested...

<circos>
organism_id=fish
  <colorcode>
    255,0,0=1,4
    0,255,0=5,10
    0,0,255=11,100000
  </colorcode>
</circos>

Here's how..

.

Expand

title	click here for installation instructions

Optional: Install SVDetect

We have installed SVdetect for you already as installation is a bit difficult . It should be (though still much easier than the alternatives listed in the introduction). You can verify it's location using which SVDetect in your $PATH under $BI/bin. One of the advantages (or disadvantages) of using the communal resource is that someone else can update all the necessary programs and packages for you. Alternatively, you can make a personal copy of the program yourself using the following commands. NOTE that

Install SVDetect scripts

Navigate to the SVDetect project page

More information:

Try to download Download the code yourself onto TACC.

Expand

Here's how...

Code Block

wget

http

https://sourceforge.net/projects/svdetect/files/

SVDetect/0.70/SVDetect_r0.7m.tar.gz

latest/download
tar -xvzf SVDetect_

r0

r*.

7m.

tar.gz
cd SVDetect_

r0.7m

r*

Move the Perl scripts and make them executable

Code Block
cp bin/SVDetect $HOME/local/bin chmod 775 scripts/BAM_preprocessingPairs.pl cp scripts/BAM_preprocessingPairs.pl $HOME/local/bin

Install required Perl modules

SVdetect requires a few Perl modules to be installed. In the default TACC environment, you can use the cpan shell to install most well-behaved Perl modules (with the exception of some complicated ones that require other libraries to be installed or things to compile). Here's how:

Code Block

title	Install Perl modules required for SVDetect

login1$ module load perl
login1$ cpan
...
cpan[4]> install Config::General
...
cpan[4]> install Tie::IxHash
...
cpan[4]> install Parallel::ForkManager
...
cpan[4]> quit
login1$

Versions Compared

Old Version 1

New Version Current

Key

Very, Very Advanced Exercise

Optional: Install SVDetect

Install SVDetect scripts

Install required Perl modules

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Very, Very Advanced Exercise

Optional: Install SVDetect

Install SVDetect scripts

Install required Perl modules