Most approaches for predicting structural variants require you to have paired-end or mate-pair reads. They use the distribution of distances separating these reads to find outliers and also look at pairs with incorrect orientations.
Possible tools:
- BreakDancer - hard to install prerequisites on TACC. Requires installing libgd and the notoriously difficult GD Perl module.
- PEMer - hard to install prerequisites on TACC. Requires "ROOT" package.
- SVDetect - good instructions, relatively hefty configuration files.
Good discussion of some of the issues of predicting structural variation:
Table of Contents
Optional: Install SVDetect
We have installed SVdetect for you under $BI/bin
so it should be in your $PATH
, so you can skip this part.
Install SVDetect scripts
Navigate to the SVDetect project page
More information:
Try to download the code yourself onto TACC.
Move the Perl scripts and make them executable
cp bin/SVDetect $HOME/local/bin chmod 775 scripts/BAM_preprocessingPairs.pl cp scripts/BAM_preprocessingPairs.pl $HOME/local/bin
Install required Perl modules
SVdetect requires a few Perl modules to be installed. In the default TACC environment, you can use the cpan shell to install most well-behaved Perl modules (with the exception of some complicated ones that require other libraries to be installed or things to compile). Here's how:
login1$ module load perl login1$ cpan ... cpan[4]> install Config::General ... cpan[4]> install Tie::IxHash ... cpan[4]> install Parallel::ForkManager ... cpan[4]> quit login1$
Example: E. coli genome with structural variation
Here's an E. coli genome re-sequencing sample where a key mutation producing a new structural variant was responsible for a new phenotype.
cds cp -r $BI/gva_course/structural_variation/data structural_variation cd structural_variation
This is Illumina mate-paired data (having a larger insert size than paired-end data) from genome re-sequencing of an E. coli clone.
File Name | Description | Sample |
---|---|---|
| Paired-end Illumina, First of mate-pair, FASTQ format | Re-sequenced E. coli genome |
| Paired-end Illumina, Second of mate-pair, FASTQ format | Re-sequenced E. coli genome |
| Reference Genome in FASTA format | E. coli B strain REL606 |
Map data using BWA
You should submit the bwa aln
and bwa sampe
commands as jobs to the queue, one after the other.
bwa index NC_012967.1.fasta bwa aln -f 61FTVAAXX_2_1.sai NC_012967.1.fasta 61FTVAAXX_2_1.fastq bwa aln -f 61FTVAAXX_2_2.sai NC_012967.1.fasta 61FTVAAXX_2_2.fastq bwa sampe -n 0 -N 100 -f 61FTVAAXX.sam NC_012967.1.fasta 61FTVAAXX_2_1.sai 61FTVAAXX_2_2.sai 61FTVAAXX_2_1.fastq 61FTVAAXX_2_2.fastq
Possibly unfamiliar options:
-n 0
tells bwa to report zero pairs for proper mates-N 100
tells bwa to report at most 100 possible matches for mates with abnormal distances or orientations.
If you use bowtie (version 1) to do your mapping, you won't predict any read SVs. Why?
Run SVDetect
The first step is to look at all mapped read pairs and whittle down the list only to those that have an unusual insert sizes (distances between the two reads in a pair). You should submit this command to the TACC queue.
BAM_preprocessingPairs.pl -p 0 61FTVAAXX.sam
What is the normal insert size for this library? (Check stdout from the command.)
SVDetect demonstrates a common strategy in some programs with complex input where instead of including a lot of options on the command line, it reads in a simple text file that sets all of the required options.
Create a configuration file:
<general> input_format=sam sv_type=all mates_orientation=RF read1_length=35 read2_length=35 mates_file=/full/path/to/61FTVAAXX.ab.sam cmap_file=/full/path/to/NC_012967.1.lengths num_threads=1 </general> <detection> split_mate_file=0 window_size=2000 step_length=1000 </detection> <filtering> split_link_file=0 nb_pairs_threshold=3 strand_filtering=1 </filtering> <bed> <colorcode> 255,0,0=1,4 0,255,0=5,10 0,0,255=11,100000 </colorcode> </bed>
You'll need to substitute your own paths for /full/path/to/61FTVAAXX.ab.sam
and /full/path/to/NC_012967.1.lengths
.
You also need to create a tab-delimited file of chromosome lengths.
1<tab>NC_012967<tab>4629812
You'll want to submit the first two commands to the TACC queue. They take a while.
Consult the manual for a full description of what these commands and options are doing.
SVDetect linking -conf ./svdetect.conf SVDetect filtering -conf ./svdetect.conf SVDetect links2SV -conf ./svdetect.conf
Take a look at the resulting file: 61FTVAAXX.ab.sam.links.filtered.sv.txt
.
We've highlighted a few lines below:
chr_type SV_type BAL_type chromosome1 start1-end1 average_dist chromosome2 start2-end2 nb_pairs score_strand_filtering score_order_filtering score_insert_size_filtering final_score breakpoint1_start1-end1 breakpoint2_start2-end2 ... INTRA NORMAL_SENSE - chrNC_012967 599566-601002 - chrNC_012967 662994-664998 1577 100% - - 1 - - ... INTRA NORMAL_SENSE - chrNC_012967 3-2025 - chrNC_012967 4627019-4628998 989 100% - - 1 - - ... INTRA REVERSE_SENSE - chrNC_012967 16999-18903 - chrNC_012967 2775979-2777838 873 100% - - 1 - -
Any idea what sorts of mutations produced these three structural variants?
Very, Very Advanced Exercise
- SVDetect has a nice option to output a file that can be read by Circos to produce drawings.