Long Read Genome Assembly GVA2023

Overview

One of (if not the) most common use of long read sequencing is in improving the structure of genomes (both known and uknown) as the improved read length makes spanning large repetitive regions known to make it difficult to fully assemble genomes. In practice, if you are going to do this for your own work, you would follow https://github.com/rrwick/Trycycler/wiki which is a fully dedicated wiki page to the assembly of genomes using both long and short reads in combination. This represents something of a departure from most other points in the course where its stress that there is no "best" or "one" way to do things. Trycycler is hands down the gold standard tool for long and mixed read genome assembly in my experience. The 1 thing that keeps it from being an actual exception, is that if you read through all the step you will see that trycler essentially uses any and all other read assembly program that produce reasonable assembled genomes and combines them to form a single consensus sequence with the greatest accuracy possible. Again, the most observant may notice that this is yet another tool from Ryan Wick (porechop and filtlong).

Learning Objectives

Here we will use the reads trimmed and or filtered in the long readQC tutorial to assemble a genome using flye (one of the best performing assemblers used in trycycler).

Prerequisite required

This tutorial makes use of data generated in the quick Long Read QC tutorial. If you have not done that tutorial already you should do it first.

Get some data

Copy all barcode01 reference files (combined, trimmed, adatper, filtered) that you generated in previous tutorial to a new directory named "GVA-assembly-LongRead"

mkdir $SCRATCH/GVA-assembly-LongRead
cd $SCRATCH/GVA-assembly-LongRead
cp $SCRATCH/GVA_nanopore/barcode01* .


Flye

Install

conda create --name GVA-flye -c conda-forge -c bioconda flye


Assembling

Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

setting up output folders and fly commands
mkdir Assembly
for f in bar*; do echo "flye --threads 48 --out-dir Assembly/$f --nano-raw $f";done > flye.commands
chmod +x flye.commands

use the head command to make sure that the flye.commands file has a command for each of the different read subsets you are interested in comparing and then launch the commands using:

./flye.commands

Interpreting the data

There are several ways you can analyze the data:

  • While you can watch the progress scroll by in fits and jerks, and try to scroll back and forth looking at the different summary statistics
  • You can copy paste them into something like excel to see them side by side
  • You can wait until they are done and focus on the output files:
    • assembly_info.txt

    • assembly.fasta
  • Additionally, depending on the number of contigs generated for each sample, it may be useful to determine how the contigs are related to each other visually using a program like https://rrwick.github.io/Bandage/. (Can you identify the developer?)
    • To do this you would need to install bandage locally, transfer the assembly_graph.gfa files back to your computer and load them into the program.

As we've begun reminding today, idev is not the normal method of doing analysis and while the same progress output that prints to the screen is generated, it is going to be more beneficial to instead focus on the output files as those are tangible product that you would expect to be the most robust and systematic.

Can you answer the following questions?

  1. Were the same number of contigs generated for all sets of reads?
  2. Is there a difference between the sizes of the genomes assembled?
  3. Is there anyway to tell from what you currently have which is right?
  4. How can you figure out what these contigs are?
If you dont know the answers to these, or don't see how to get the answers to these from the files you have available, talk to your instructor.

Next steps

In the scope of this class, you could consider blasting the assembled genome to see if you can identify what organism it is, and determine how similar your assembled genome is to something already known. Alternatively , you could filter or trim the reads differently in an attempt to improve the assemblies or you could try some of the other assembly programs recommended as part of trycycler.


Return to GVA2023 course page.