Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Unfortunately, both formats are obscure and hard to work with directly. While bedtools does accept annotation files in GFF/GTF format, you will not like the results. This is because the most useful information in a GFF/GTF file is in a looslyloosely-structured attributes field.

Also unfortunately, there are a number of variations of both annotation formats However both GTF and GFF share the first 8 fields ( Tab-separated )fields:

  1. seqname - The name of the chromosome or scaffold.
  2. source - Name of the program that generated this feature, or other data source (e.g. database)
  3. feature_type - Type of the feature. Examples of common feature types include:
    • Some examples of common feature types are:
      • CDS (coding sequence), exon
      • gene, transcript
      • start_codon, stop_codon
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A numeric value. Often but not always an integer.
  7. strand - Defined as + (forward), - (reverse), or . (no relevant strand)
  8. frame - For a CDS, one of 0, 1 or 2, specifying the reading frame of the first base; otherwise '.'

...

Code Block
languagebash
titleLook at GFF annotation entries with less
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .

# Use the less pager to look at multiple lines
less sacCer_R64-1-1_20110208.gff

# Look at just the most-important Tab-separated columns
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5 | head -20

# Include the ugly 9th column where attributes are stored
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head

In addition to comment lines (starting with #), you can see the chrI contig names in column 1 and various feature types in column 3. You see also see tags like Name=YAL067C;gene=SEO1; among the attributes on some records, but in general the attributes column information is really ugly.

...

Code Block
languagebash
titleCreate a histogram of all the feature types in a GFF
cd $SCRATCH/core_ngs/bedtools
cat sacCer_R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \
  sort | uniq -c | sort -k1,1nr | more

...

Code Block
titleFilter GFF gene feature with awk
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | \
  awk 'BEGIN{FS=OFS="\t"}{ if($3=="gene"){print} }' \
  > sc_genes.gff
wc -l sc_genes.gff 

The line count of sc_genes.gff should be 6607 – one for each gene entry.

...