Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version. When you run the code below, you should see that the version on TACC is bedtools v2.25.0. (Login to login5.ls5.tacc.utexas.edu first.)

Code Block
titlecopy and paste exercise files
module load bedtools
bedtools --version

...

Take a quick look at a yeast annotation file, sacCer_R64-1-1_20110208.gff using less. (Login to login5.ls5.tacc.utexas.edu first.)

Code Block
languagebash
titleLook at GFF annotation entries with less
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .

less sacCer_R64-1-1_20110208.gff

...

You should see something like this.

Code Block
titlePart of the sacCer_R64-1-1_20110208.gff annotation fileHistogram of yeast annotation features
  7077 CDS
  6607 gene
   480 noncoding_exon
   383 long_terminal_repeat
   376 intron
   337 ARS
   299 tRNA
   190 region
   129 repeat_region
   102 nucleotide_match
    89 transposable_element_gene
    77 snoRNA
    50 LTR_retrotransposon
    32 telomere
    31 binding_site
    27 rRNA
    24 five_prime_UTR_intron
    21 pseudogene
    17 chromosome
    16 centromere
    15 ncRNA
     8 external_transcribed_spacer_region
     8 internal_transcribed_spacer_region
     6 snRNA
     3 gene_cassette
     2 insertion

...

The program reads the input file twice – once to gather all the attribute names, and then a second time to write the attribute values in well-defined columns. You'll see output like this:

Code Block
title
languagebashConvert GFF to BED with BioITeam script
----------------------------------------
Gathering all attribute names for GTF 'sc_genes.gff'...
  urlDecode = 1, tagAttr = tag
Done!
  6607 lines read
  6607 locus entries
  8 attributes found:
(Alias ID Name Note Ontology_term dbxref gene orf_classification)
----------------------------------------
Writing BED output for GTF 'sc_genes.gff'...
Done! Wrote 6607 locus entries from 6607 lines

To find out what the resulting columns are, look at the header line out the output BED file:

Code Block
title
languagebashConvert GFF to BED with BioITeam script
head -1 sc_genes.converted.bed 

...

Doesn't this look better?

Converted BED attributes
Code Block
title
chrI    334     649     YAL069W 315     +       YAL069W Dubious
chrI    537     792     YAL068W-A       255     +       YAL068W-A       Dubious
chrI    1806    2169    YAL068C 363     -       PAU8    Verified
chrI    2479    2707    YAL067W-A       228     +       YAL067W-A       Uncharacterized
chrI    7234    9016    YAL067C 1782    -       SEO1    Verified
chrI    10090   10399   YAL066W 309     +       YAL066W Dubious
chrI    11564   11951   YAL065C 387     -       YAL065C Uncharacterized
chrI    12045   12426   YAL064W-B       381     +       YAL064W-B       Uncharacterized
chrI    13362   13743   YAL064C-A       381     -       YAL064C-A       Uncharacterized
chrI    21565   21850   YAL064W 285     +       YAL064W Verified
chrI    22394   22685   YAL063C-A       291     -       YAL063C-A       Uncharacterized
chrI    23999   27968   YAL063C 3969    -       FLO9    Verified
chrI    31566   32940   YAL062W 1374    +       GDH3    Verified
chrI    33447   34701   YAL061W 1254    +       BDH2    Uncharacterized
chrI    35154   36303   YAL060W 1149    +       BDH1    Verified
chrI    36495   36918   YAL059C-A       423     -       YAL059C-A       Dubious
chrI    36508   37147   YAL059W 639     +       ECM1    Verified
chrI    37463   38972   YAL058W 1509    +       CNE1    Verified
chrI    38695   39046   YAL056C-A       351     -       YAL056C-A       Dubious
chrI    39258   41901   YAL056W 2643    +       GPB2    Verified

...