Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The following DNA sequencing read data files were downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment.

Transferring Data

We have already downloaded data files for this example and put them in the pathRather than having to download these files from the SRA or EUN and NCBI, these data files are available in the following directory:

Code Block
$BI/gva_course/mapping/data

...

In this case the bp_seqconvert.pl perl script is included as part of the bioperl module package. Rather than attempt to find it as part of a conda package, or in some other repository we will use the module version. If needing this script in the future outside of TACC, https://metacpan.org/dist/BioPerl/view/bin/bp_seqconvert


bash
Code Block
language
titleRecall that we have used the which command to determine where executable files are located, and only take 2 pieces of information.Load the bioperl module and run the script without any options to display the help contents
module load bioperl/1.007002
which -a bp_seqconvert.pl

...




Info
titleThe information in this box is related to the path variable, perl programming libraries, having multiple copies of a script/file available in your path, and computer architecture. If you are not interested in this, you can skip this box.


Code Block
languagebash
titleOn the head node, after you have installed the bioperl module, there are actually 2 instances of bp_seqconvert.pl available to you.
module load bioperl/1.007002
which -a bp_seqconvert.pl

If you run on an idev node you get 1 result related to the bioperl module, but if you run on the head node (outside idev) you get 2 results. On the head node, 1 points to the BioITeam near where you keep finding your data (/corral-repl/utexas/BioITeam/) which is part of the

...

BioITeam, specifically the "bin" folder

...

which is full of binary or (typically small) bash/python/perl/R scripts that someone has written to help the TACC community. The other is in a folder specifically associated with the bioperl module. You can load and unload the bioperl module to see the difference.

Info
titleWhy do you get 2 different results depending on if you are inside or outside of an idev node

This has to do with how compute nodes are configured. On stampede2 /corral-repl/ and all of its subdirectories are not accessible so even though the BioITeam is in your $PATH, on the compute node, the command line can't access it. This is why in later tutorials you have to log out of the idev session to copy new raw data files to work with.

If you try to run the BioITeam version of the script

...

(/corral-repl/utexas/BioITeam/bin/bp_seqconvert.

...

pl)from the head node without the bioperl module loaded, you get an error message similar to the following:


Code Block
module unload bioperl
bp_seqconvert.pl


No Format
Can't locate Bio/SeqIO.pm in @INC (@INC contains: /corral-repl/utexas/BioITeam//local/share/perl5 /corral-repl/utexas/BioITeam//perl5/lib/perl5/x86_64-linux-thread-multi /corral-repl/utexas/BioITeam//perl5/lib/perl5 /corral-repl/utexas/BioITeam//perl5/lib64/perl5/auto /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.
BEGIN failed--compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.


Info
titleDeciphering error messages

The above error message is pretty helpful, but much less so if you are not familiar with perl. As I doubt anyone in the class is more familiar with perl than I am, and I am not familiar with perl hardly at all, this is a great opportunity to explain how I would tackle the error message to figure out what is going on.

  1. "compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8." 
    1. The last line here actually tells us that the script did not get very far, only to line 8.
    2. My experience with other programing language tells me that the beginning of scripts is all about checking that the script has access to all the things it needs to do what it is intended to do, so this has me thinking some kind of package might be missing.
  2. "(@INC contains: ..."
    1. This reads like the PATH variable, but is locations I don't recognize as being in my path, suggesting this is not some external binary or other program.
    2. Many of the individual pathways list "lib" in one form or another. This reinforces the idea from above that some kind of package is missing.
  3. "Can't locate Bio/SeqIO.pm in @INC"
    1. "Can't locate" reads like a plain text version of something being missing, and like something generic that is not related to my system/environment (like all the listed directories), and not related to the details of the script I am trying to run (like the last line that details the name of the script we tried to envoke)
    2. This is what should be googled for help solving the problem. 
      1.  the google results list similar error messages associated with different repositories/programs (github issues) suggesting some kind of common underlying problem.
      2. The 3rd result https://www.biostars.org/p/345331/ reads like a generic problem and sure enough the answers detail needing to have the Bio library installed from cpan (perl's package management system)

We get this error message because because

...

while perl is installed on stampede2, the required

...

SeqIO.pm library is not

...

available by default

...

but it is easily installed with the bioperl module. As it is likely rare that you will need to convert sequence files between different format, bioperl is actually not listed as one of the modules on your .bashrc file in your $HOME directory that you set up yesterday

...

After loading the bioperl library to get past the error message, run the script from the BioITeam without any arguments to get the help message:

Code Block
module load bioperl

, but if you find yourself using the command `module load bioperl` often, you may want to add it.

Code Block
languagebash
titleOn the head node, after loading the bioperl module, you have access to the program in 2 different locations. 
module load bioperl
which -a bp_seqconvert.pl

How does the computer know which location to use?

  • It will use whatever location it finds earliest in the $PATH,
  • which is the same as the top line in the which -a command output,
  • which is the same as the line printed if you run the `which` command without the "-a".

Using just the script name by itself, will use which ever is found first, but you can always force the computer to use a given copy by specifying the full path to the copy you want. Thus, the following 2 commands are not equal:

Code Block
linenumberstrue
/corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl

...

 
/home1/apps/bioperl/1.007002/bin/bp_seqconvert.pl 

While the commands are different, both copies can use the same bioperl library SeqIO.pm when the bioperl module is loaded and thus work. 


Convert a gbk reference to a embl reference

...

Code Block
languagebash
titleTry reading through the program help when you run the bp_seqconvert.pl without any options to see the syntax required
collapsetrue
module load bioperl
bp_seqconvert.pl --from genbank --to embl < NC_012967.1.gbk > NC_012967.1.embl
head -n 100 NC_012967.1.embl

...