As discussed throughout the class, long reads are recently becoming cheaper and more accurate making them more available and more useful for NGS applications. This tutorial is designed to build on the same principles (though using some different tools in some cases) to familiarize your self with, and improve a set of long reads.
This turorial introduces you to long read fastq files generated from oxford nanopore data, and compares such data to short read Illumina data in light of what you have learned throughout the course. After completing this tutorial you should:
Depending on what optional tutorials you have worked on, you may have created several different tutorials which contain the several of the programs used later in this tutorial. As environments are set around programs not around data, it should make sense that you can use the same environments that you created for short reads with long reads. This does not mean that the "best" tool for short reads is the "best" tool for long reads or even "operational" with long reads and vice versa. This tutorial is written assuming you create the following new environment, you can of course skip this and change out your environment as needed for whatever parts of the tutorial you are interested in trying for yourself.
conda create --name GVA-ReadQC-long -c conda-forge -c bioconda fastqc seqfu cutadapt porechop filtlong |
As mentioned throughout the course, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files. |
The choice between the 2 boxes is a quetion of style and preference, as you continue on you may find yourself favoring one over the other. |
The files downloaded are from a single sample and are the raw files provided after on-instrument calling in an mk1c instrument. In the next section we will begin to interrogate them.
34 You can tell this by
|
4,000 OR 952 An command might look like: zgrep -c "^+$" *.gz The highest numbered file has a read count of 952 while the rest have exactly 4,000 reads, this may do the following:
|
This may be the only place in the course where file compression options are explicitly discussed. There are multiple different ways that files can be compressed though throughout the course, we only work with gzip. This allows us to take advantage of zgrep (as above) but also the ability to quickly combine gzipped files. You may be able to think of uses for this with paired end reads if you have sequenced the same sample on multiple runs. Be careful when doing so, as most programs that use paired end information only do so by comparing line by line between the paired end files, not actually checking the header information for pairs. Other compression programs (zip, bzip2, xz) do not offer this functionality.
In the case of long read sequencing, we are working with single end sequencing and while there may be reasons to keep these partial files, quality assessment is more logical if done on the entire data set. Additionally, long read sequencing is single ended, so the order that the files appear in the combined file does not actually matter.
Recall that "control + c" will stop whatever command you are currently running. This is mentioned here to highlight the importance of the ">" mark in the next code block. |
cd $SCRATCH/GVA_nanopore/ cat raw_reads/*.gz > barcode01.combined.fastq.gz |
132,952 An command might look like: zgrep -c "^+$" barcode01.combined.fastq.gz |
Just like when you were first introduced to short read fastq files, it is very common to want to quickly get first impressions of the data you are working with. Again, we will use fastQC but also seqfu which gives additional metrics of our file.
fastqc barcode01.combined.fastq.gz |
After transferring the barcode01.combined_fastqc.html file back to your computer, being at least a little familiar with short read data, you may have questions such as:
If you installed seqfu in a new conda environment at the start of this tutorial you can also use it to look at some overall sequencing stats. In my own work, I find this particularly useful for comparing across samples and trying to correlate assembly or variant calling outcomes with these metrics. Luckily the command is very simple and with a single sample it does not take too long.
seqfu stats -n -v *.gz --csv |
For the combined reads from this we can see:
Given the information you have about the data set, there are a few things that can be done to improve the data set. Take a moment to consider what we saw above in the fastqc and seqfu results to consider how you might do this before continuing.
Ultimately, this is something that is not routinely done and does not appear to interfere with data analysis (citations needed). Unfortunately, our favorite trimming tool fastp does not work on nanopore data, nor do other programs like trimmomatic.
https://github.com/rrwick/Porechop is a tool specifically developed and used for removing adapter sequences from nanopore libraries. Unfortunately can be quite slow and is abandoned by the creator for difficulties in keeping it up and running as you can see on the github page. That being said the tool is functional and it does work. It also provides an option to remove reads which detect the presence of a different barcode sequence in the middle of it which seems like a safer bet. As mentioned, this command can take quite a bit of time to complete so consider modifying the following command to work only on one of the raw sequencing files that only has 4,000 reads in it if you are just looking for an overview of the program and dont plan to run the assembly afterwords.
porechop -t 48 -i barcode01.combined.trimmed.fastq -o barcode01.combined.adapter-trimmed.fastq.gz --discard_middle |
From the output you can notice the following about what adapters are present
Trimming adapters from read ends
Rapid_adapter: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA
BC01_rev: CACAAAGACACCGACAACTTTCTT
BC01: AAGAAAGTTGTCGGTGTCTTTGTG
BC02_rev: ACAGACGACTACAAACGGAATCGA
BC02: TCGATTCCGTTTGTAGTCGTCTGT
BC01: AAGAAAGTTGTCGGTGTCTTTGTG
BC01_rev: CACAAAGACACCGACAACTTTCTT
BC02: TCGATTCCGTTTGTAGTCGTCTGT
BC02_rev: ACAGACGACTACAAACGGAATCGA |
Obviously there is some concern about the presence of barcode 1 and barcode 2 sequences in a single sample. I don't have an explanation for how this can happen either relating to the library prep, nor how it happens with the sequencer splitting the reads based on the presence of this signal, but suspect it may be related to the middle adapter situation. Reads with an adapter in the middle of the read are quite rare:
2,459 / 132,952 reads were discarded based on middle adapters |
In total more than 14 Mbp were removed:
132,715 / 132,952 reads had adapters trimmed from their start (13,594,659 bp removed) 19,374 / 132,952 reads had adapters trimmed from their end (708,769 bp removed) |
As mentioned in the fastqc results, the first few bases are directly related to the barcode sequence present, and as mentioned in many other tutorials when you can get rid of these artifacts, its usually a good idea. In the initial readqc tutorial we used cutadapt to remove low quality bases from the 3prime end of the read, but it can just as easily be used to remove bases from the 5prime end.
cutadapt -o barcode01.combined.trimmed.fastq -u 9 barcode01.combined.fastq |
~1.2M based on: Total basepairs processed: 438,208,826 bp Total written (filtered): 437,012,258 bp (99.7%) |
As mentioned in several of the presentations, you dont need super high coverage for most applications (and when you do need high coverage you often need to compete with error rates). Therefore it is often beneficial to use only the best of the best reads. filtlong (https://github.com/rrwick/Filtlong) is a program that allows you to set various thresholds to get these divisions. Those who are extra observant this is the same developer as the one who created porechop.
What filtering needs to be done will largely be dependent on the downstream applications, and a single dataset may even be filtered different ways for different phases of the analysis (ie only the longest reads used to assemble, while the highest quality are used to polish an assembly removing as many SNV as possible). One of the benefits of this program beyond its capabilities is that the github page is extremely descriptive allowing easy understanding of the commands and what you need to modify. For now focus on the example command for having no reference available https://github.com/rrwick/Filtlong#example-commands-detailed. Since we are under 500 Mbp, you should ignore that option.
filtlong --min_length 1000 --keep_percent 90 barcode01.combined.fastq | gzip > barcode01.combined.min1k.top90.fastq.gz |
83,189 seqfu also shows an increase of ~50% in average read length. |
Consider trimming in different ways, or making more or less stringent filtered sets, and moving onto the next tutorials.
Return to the GVA 2023 course page