This turorial introduces you to long read fastq files generated from oxford nanopore data, and compares such data to short read Illumina data in light of what you have learned throughout the course. After completing this tutorial you should:
Depending on what optional tutorials you have worked on, you may have created several different tutorials which contain the several of the programs used later in this tutorial. As environments are set around programs not around data, it should make sense that you can use the same environments that you created for short reads with long reads. This does not mean that the "best" tool for short reads is the "best" tool for long reads or even "operational" with long reads and vice versa. This tutorial is written assuming you create the following new environment, you can of course skip this and change out your environment as needed for whatever parts of the tutorial you are interested in trying for yourself.
conda create --name GVA-ReadQC-long -c conda-forge -c bioconda fastqc seqfu cutadapt porechop filtlong |
As mentioned throughout the course, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files. |
The choice between the 2 boxes is a quetion of style and preference, as you continue on you may find yourself favoring one over the other. |
The files downloaded are from a single sample and are the raw files provided after on-instrument calling in an mk1c instrument. In the next section we will begin to interrogate them.
34 You can tell this by
|
4,000 OR 952 An command might look like: zgrep -c "^+$" *.gz The highest numbered file has a read count of 952 while the rest have exactly 4,000 reads, this may do the following:
|
This may be the only place in the course where file compression options are explicitly discussed. There are multiple different ways that files can be compressed though throughout the course, we only work with gzip. This allows us to take advantage of zgrep (as above) but also the ability to quickly combine gzipped files. You may be able to think of uses for this with paired end reads if you have sequenced the same sample on multiple runs. Be careful when doing so, as most programs that use paired end information only do so by comparing line by line between the paired end files, not actually checking the header information for pairs. Other compression programs (zip, bzip2, xz) do not offer this functionality.
In the case of long read sequencing, we are working with single end sequencing and while there may be reasons to keep these partial files, quality assessment is more logical if done on the entire data set. Additionally, long read sequencing is single ended, so the order that the files appear in the combined file does not actually matter.
Recall that "control + c" will stop whatever command you are currently running. This is mentioned here to highlight the importance of the ">" mark in the next code block. |
cd $SCRATCH/GVA_nanopore/ cat raw_reads/*.gz > barcode01.combined.fastq.gz |
132,952 An command might look like: zgrep -c "^+$" barcode01.combined.fastq.gz The highest numbered file has a read count of 952 while the rest have exactly 4,000 reads, this may do the following:
|
Just like when you were first introduced to short read fastq files, it is very common to want to quickly get first impressions of the data you are working with. Again, we will use fastQC but also seqfu which gives additional metrics of our file.