Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Velvet is a De Bruijn graph assembler works fairly rapidly on short (microbial) genomes. In this tutorial we will use velvet to assemble an E. coli genome from simulated Illumina reads. Unfortunatly, at the time of this class, velvet is not available on lonestar5. While we hope and think that this may just be because lonestar5 is very new still and it just hasn't happened yet, there is obviously no guarantee that it will come online as a module or a timeframe for when that may happen. Luckily (but somewhat annoyingly) it is available as a module on stampede. As we have discussed in class, stampede compute nodes are not able to access the BioITeam repositories so you will have to make sure you have copied your files to the locations you want them before you enter an idev node.

Learning Objectives

  • Run velvet to perform de novo assembly on fragment, paired-end, and mate-paired data.
  • Use contig_stats.pl to display assembly statistics.
  • Find proteins of interest in an assembly using Blast.

...

Table of Contents

Data

First log into stampede using the same log in credentials you have been using for lonestar5. Next, let's copy the fastq read files.

 

Code Block
titleMove to scratch, copy the raw data, and change into this directory for the tutorial
cds
mkdir velvet_tutorial
cp $BI/ngs_course/velvet/data/*/* velvet_tutorial
cp $BI/bin/contig_stats.pl .  # because we are on stampede not lonestar5 we need to copy this file before we start an idev session
cd velvet_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

...

Now let's use Velvet to assemble the reads.

First, you will need to load Velvet via get an idev node and load the velvet module.

Expand
I need to see the command...
I need to see the command...

Note that since we are on stampede rather than ls5, our reservation will not work. Therefore to get an idev node just type "idev"

Code Block
titleLoad the velvet module
 module load velvet

If this didn't work, make sure you are on stampede not lonestar5 then ask for help

Using velvet consists of a sequence of two commands:

...

We'll need to create a commands file and submit it to TACC. Let's make the commands file say:

Code Block
languagebash
titleFor a "commands" file - to run four velvet assemblies in parallel. Alternatively, run 1 line at a time on an idev node. If you copy and paste, be sure that there are ONLY four lines in your file.
velveth single_out 61 -fastq single_end_100_c_50.fastq && velvetg single_out -exp_cov auto -amos_file yes
velveth pairedc20_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq && velvetg pairedc20_out -exp_cov auto -amos_file yes
velveth pairedc25_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_25.fastq && velvetg pairedc25_out -exp_cov auto -amos_file yes
velveth pairedc50_out 61 -fastq -shortPaired paired_end_2x100_ins_400_c_50.fastq && velvetg pairedc50_out -exp_cov auto -amos_file yes

Now use launcher_creator.py to make a *.sge for the commands file and qsub it.

Warning
titleUse the correct "wayness"

Velvet and other assemblers usually take large amounts of RAM to complete. Running these 4 commands on a single node will use more RAM than is available on a single node so it's necessary to change the number of commands per node (wayness) from the default of 12 to 1. When you use launcher_creator.py, you set the "wayness" (number of commands per node) using the -w flag.

You can set the allotted time for this job to just 10 minutes.

launcher_creator.py -w 1 -n velvet -q normal -j commands -t 00:10:00

This will make more sense after we do the job submission tutorial. Similarly, the amount of RAM necessary is why we run them sequentially on a single idev node rather than in parallel. This should also underscore to you that you should not run this on the head node.

Expand
I need some help with my launcher_creator.py commandI need some help with my launcher_creator.py command
Code Block

If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue.If you find yourself waiting a long time for the assembly process to run, you can also start an idev session and try running some of the velveth and velvetg commands interactively. Each one takes a few minutes to complete.

Velvet Output

Explore each output directory that was created by Velvet.

...

You can generate some summary stats and graphs about each assembly using the contig_stats.pl script that we have installed under copied from  $BI/bin. Try figuring out how to do this on your own. bin before luanching the idev session. You probably want to change into the directory of results for a specific assembly before running this command, since it generates several output files in the current working directory. You  Since you are on $SCRATCH and in a directory that is not in your path simply typing the command will not work. As such you need to explicitly tell the command line to launch the file like you have done with "commands" files in other tutorials.  You will need to copy the PNG output files back to your computer to view it using scp.

Expand
Example commandExample command
Code Block
titleNeed a hint?

Since you are moved into each of the results directories, you need to give the relevant location

Code Block
languagebash
titleNot still not sure? click here for the solution
collapsetrue
./contig_stats.pl -plot -One_pdf -tool Velvet contigs.fa

What do I do now?

...

  1. Get a better assembly: maybe add a different library size, or go into a detailed genome completion project (commonly called "finishing") using a sequence assembly editor like consed or gap4 or AMOS. (Be careful though, the amount of data in NGS data sets can be very difficult for these programs to deal with, since many were designed for Sanger sequencing reads.) You may have a lot of PCR products to make to close gaps and/or to order and orient scaffolds. consed in particular makes this pretty easy, but it may still consume a lot more time and money than the initial shotgun assembly. You can identify some misassemblies by mapping the original reads to the assembly and then viewing them in IGV to look for discordant mate pairs, for example.
  2. Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Most assemblers will be able to take 2x100 data and give you full gene sequences since these are non-repetitive and so assemble well. You can turn the contigs.fa into a blast database (formatdb or makeblastdb depending on which version of blast you have) and start blasting away.

Exercise: Finding proteins in the assembly

Change into one of the Velvet result directories for the set of contigs that you want to analyze and load the blast module.

Find your favorite bacterial protein and see if it exists within the assembly.

...

Try NCBI's "Protein" database - search for Escherichia.

...

Code Block
titleCopy over a few E. coli protein sequences for the next blast
cp /corral-repl/utexas/BioITeam/ngs_course/velvet/test.fa .

Once you have some query sequences you'd like to find in your assembly, turn your assembly into a local blast database and search for them:

Code Block
titleMake a blast database of the contigs and try to find a gene (might have to load the blast module first)
makeblastdb -in contigs.fa -dbtype nucl
tblastn -query test.fa -db contigs.fa -evalue 1e-50 | more

...

Code Block
titleReverse search - all the contigs blasted against a small protein database
makeblastdb -in test.fa -dbtype prot
blastx -query contigs.fa -db test.fa -outfmt "6 qseqid sallseqid bitscore frames ppos sallacc" -evalue 1e-10 | more

Other paths from here:

  1. Predict genes/annotate the genome with de novo gene prediction tool like glimmer, maker, or one of the online gene prediction tools available at NCBI or JCVI.
  2. Use a variety of assembly evaluation tools (a rapidly growing field) - we'll talk about that more on the next page.

Advanced Exercise: A5 Haploid Microbial Genome Assembler

Another approach to try for haploid microbial genomes is the A5 pipeline.

It includes several additional quality control steps (like filtering adaptor contamination and low quality score portions of reads) that make sense when using a De Bruijn graph assembler and is optimized for haploid genomes. They're described in this paper.

Install the a5 pipeline by downloading from here and either 1) moving all of the items in the bin directory into your local bin directory or adding the location of the a5/bin directory to your $PATH.

...

Code Block
titleFor an interleaved paired end input file
a5_pipeline.pl paired_end_2x100_ins_3000_c_25.fastq output

...

  1. .

Further Reading