...

Code Block
language bash
title activate conda envrionment
conda activate GVA-bowtie2-mapping bowtie2 --version
You should see version 2.4.5

Code Block

language	bash
title	Convert reference to fasta

module load bioperl
bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fasta

Warning

title	Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

Code Block

language	bash
title	Index the reference

mkdir bowtie2
bowtie2-build --threads 68 CP009273.1_Eco_BW25113.fasta bowtie2/CP009273.1_Eco_BW25113

The following command will take ~7 ~5 minutes to complete. Before you run the command execute 'bowtie2 -h' so while the command is running you can try to figure out what the different options are doing that we did not include in our first tutorial. Note that in this case we are passing gzipped (compressed) fastq files to bowtie2 rather than uncompressed fastq files. Storing compressed files saves a significant amount of space, and often can be worked with without needing to directly decompress them.

Code Block

language	bash
title	Map reads

gunzip *.gz
bowtie2 --very-sensitive-local -t -p 68 -x bowtie2/CP009273.1_Eco_BW25113 -1 SRR4341249_1.fastq.gz -2 SRR4341249_2.fastq.gz -S bowtie2/SRR4341249-vsl.sam --un-conc SRR4341249-unmapped-vsl.fastq

Expand

title	Click here to explain the new options.

option	effect
--very-sensitive-local	map in very sensitive local alignment mode
-t	print time info
-p 68	use 68 processors
-x bowtie2/CP009273.1_Eco_BW25113	index
-1 SRR4341249_1.fastq.gz -2 SRR4341249_2.fastq.gz	reads used for mapping
-S bowtie2/SRR4341249-vsl.sam	Sam file detailing mapping
--un-conc SRR4341249-unmapped-vsl.fastq	print reads which do not map to the genome to the file SRR4341249-unmapped-vsl.fastq

Expand

title	What percent of reads mapped?

The stdoutput of the program listed:

6566.74% 62% overall alignment rate

Expand

title	How many reads are in our unmapped fastq file?

grep -c "^+SRR4341249" SRR4341249-unmapped-vsl.1.fastqfastq

returns: 441842441835

Note that head or tail of the fastq file shows that our 3rd line is not just "+", hence why we have to add the additional sequence ID to the grep command

...

Expand

title	How many contigs were generated in each case?

2 for the plasmid and 16 for the full

Code Block

language	bash
title	Where does the answer come from?
collapse	true

grep -c "^>" unmapped*/contigs.fasta

Expand

title	Are any of the contigs the same?

Yes, both contigs detected in the plasmid mode were also detected in the full mode

Code Block

language	bash
title	Where does the answer come from?
collapse	true

grep "^>" unmapped*/contigs.fasta
# lists identical lengths and coverages for 2 plasmid contig

Expand

title	What sizes are the contigs?

5441

3170

14 others less than 500bp

Code Block

language	bash
title	Where does the answer come from?
collapse	true

grep "^>" unmapped*/contigs.fasta
# same command as above, just focusing on the length value

Expand

title	which is most likely to be our plasmid?

The contig that is 3170.

Code Block

language	bash
title	Where does the answer come from?
collapse	true

# From above, I stated this was a high copy plasmid, it has a coverage value of 12,698 compared to 98 for the larger contig

Expand

title	Is that actually our plasmid?

Yes! The actual plasmid reference locus line stats:

LOCUS GFP_Plasmid_Sequ 3115 bp DNA circular UNA 18-NOV-2013

Expand

title	Why might the sizes not agree?

My thought is that this was raw fastq files that were fed into the assembly, not trimmed files. Leads me to hypothesize that the difference in size is related to the presence of adapter sequence. Alternatively, it may be that small changes in either bowtie2 or spades versions or read trimmers have influenced what reads are considered.

Next steps

Here we have presented a proof of concept that unmapped reads can be used to find something that we actually did know was present. We also found something that was even longer that wasn't expected.

...

Expand

title	How might we go about finding out what an assembled product actually is when it truly is novel rather than a positive control?

blast. In fact I did just that and identified it as an artifact of sequencing. The contig corresponds to phiX.

Code Block

title	Steps to identify phiX
linenumbers	true

copy full 5441 bp of sequence
Go to https://blast.ncbi.nlm.nih.gov/Blast.cgi
large list of results, including vast majority listing phiX or genome assembly/scaffold

Why does seeing phiX (link to screen shot of blast results) tell me that it is an artifact? phiX is used as a loading control for illumina runs to both tell the difference between a failed run because of bad libraries and a failed run due to poor base diversity.

Expand
title How would we decide if it was real or important if we hadn't recognized it?
Depends on blast results, how high coverage is compared to genome, gene content
Expand
title In other systems what else might you find?
viruses, mobile genetic elements, evidence of microbiome, mitochondria, chloroplast, other plasmids
Expand
title How does this effect mapping?
Consider advanced read mapping with multiple references tutorial

Expand

title	Do you expect to find more novel DNA in a highly accurate reference file, or a "similar' reference file?

Similar. The fact that the reference is not as accurate will lower the alignment scores across the board, potentially dropping below thresholds to be able to anchor the match at all. Look deeper at the bowtie2 mapping command where you used --very-sensative-local mode the documentation tells you about tolerated mismatches etc. The more reads you have that don't match, the more novel DNA inserts you are likely to deal with.

...

Versions Compared

Old Version 1

New Version 2

Key

Next steps

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Next steps