...
Unfortunately, both formats are obscure and hard to work with directly. While bedtools does accept annotation files in GFF/GTF format, you will not like the results. This is because the most useful information in a GFF/GTF file is in a looslyloosely-structured attributes field.
Also unfortunately, there are a number of variations of both annotation formats However both GTF and GFF share the first 8 fields ( Tab-separated )fields:
- seqname - The name of the chromosome or scaffold.
- source - Name of the program that generated this feature, or other data source (e.g. database)
- feature_type - Type of the feature. Examples of common feature types include:
- Some examples of common feature types are:
- CDS (coding sequence), exon
- gene, transcript
- start_codon, stop_codon
- Some examples of common feature types are:
- start - Start position of the feature, with sequence numbering starting at 1.
- end - End position of the feature, with sequence numbering starting at 1.
- score - A numeric value. Often but not always an integer.
- strand - Defined as + (forward), - (reverse), or . (no relevant strand)
- frame - For a CDS, one of 0, 1 or 2, specifying the reading frame of the first base; otherwise '.'
...
| Code Block | ||||
|---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/bedtools cd $SCRATCH/core_ngs/bedtools cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff . # Use the less pager to look at multiple lines less sacCer_R64-1-1_20110208.gff # Look at just the most-important Tab-separated columns cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5 | head -20 # Include the ugly 9th column where attributes are stored cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head |
In addition to comment lines (starting with #), you can see the chrI contig names in column 1 and various feature types in column 3. You see also see tags like Name=YAL067C;gene=SEO1; among the attributes on some records, but in general the attributes column information is really ugly.
...
| Code Block | ||||
|---|---|---|---|---|
| ||||
cd $SCRATCH/core_ngs/bedtools cat sacCer_R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \ sort | uniq -c | sort -k1,1nr | more |
...
| Code Block | ||
|---|---|---|
| ||
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | \ awk 'BEGIN{FS=OFS="\t"}{ if($3=="gene"){print} }' \ > sc_genes.gff wc -l sc_genes.gff |
The line count of sc_genes.gff should be 6607 – one for each gene entry.
...