...
would return:
['187932364']
Other Resources For Searching Entrez
List Of Filters For Entrez Searches
Fetching Records Using Entrez
Given a comma-separated list of GI numbers, you can retrieve the records using Entrez eFetch.
Parsing Genbank Files
Suppose we have a GI number for a protein and we want to retrieve the corresponding Genbank file, see if it contains a CDS, and get the gene name for the protein in the CDS (if it has a gene name).
...
Warning | ||
---|---|---|
| ||
I have had trouble parsing multiple records fetched at once, using a comma-separated list, if retrieving in Genbank format. In principle, BioPython's |
Suppose we want to get the Genbank record for the protein with GI number '187932364':
Code Block | ||||
---|---|---|---|---|
| ||||
from Bio import Entrez, SeqIO
handle = Entrez.efetch(db='protein', id='187932364', rettype='gb')
record = SeqIO.read(handle, 'genbank')
handle.close()
|
Parsing Genbank Files
The record
that was parsed using SeqIO
above is a Python object. Its features
attribute is a list of BioPython SeqFeature
objects. Each feature contains a list of "qualifiers", which contain information such as the locus tag, gene (if available), db_xrefs, etc. To see the qualifiers for a CDS feature in the Genbank record, we can run this code:
Code Block | ||||
---|---|---|---|---|
| ||||
for feature in record.features:
if feature.type == 'CDS':
print feature.qualifiers |
which will print the following dictionary:
{'locus_tag': ['CLL_A2397'], 'coded_by': ['complement(NC_010674.1:2488690..2490378)'], 'transl_table': ['11'], 'note': ['identified by match to protein family HMM PF02463; match to , 'db_xref': ['GeneID:19966649'], 'gene': ['recN']}
Other Resources For Parsing Genbank Files
Great page from the Wilke Lab on parsing Genbank files
Dealing with GenBank files in Biopython by Peter Cock
An Example Of Finding RefSeq Gene By Locus Tag
This example script takes a tab-delimited file, where one column contains the locus tags, and prints the gene name in RefSeq for that locus tag (if the gene name is specified). If the gene name isn't specified, it simply prints -
.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
#! /usr/bin/env python
import argparse
from Bio import Entrez, SeqIO
parser = argparse.ArgumentParser(description='Specify a field of NCBI locus tags, which get looked up in RefSeq for a corresponding gene name, which is appended to the line.')
parser.add_argument('-f', action='store', dest='locus_tag_col_str', required=True, help='The locus tag field in the tab-delimited input file.')
parser.add_argument('-e', action='store', dest='email_address', required=True, help='Entrez requires your email address.')
parser.add_argument('infile_name', help='Input file')
args = parser.parse_args()
Entrez.email = args.email_address
locus_tag_col = int(args.locus_tag_col_str) - 1
with open(args.infile_name, 'r') as infile:
for line in infile:
gene_name = '-'
locus_tag = line.split()[locus_tag_col]
search_term = 'refseq[FILTER] AND {}'.format(locus_tag)
handle = Entrez.esearch(db='protein', term=search_term)
results = Entrez.read(handle)
handle.close()
gene_name_assigned = False
for gi in results['IdList']:
if gene_name_assigned:
break
handle = Entrez.efetch(db='protein', id=gi, rettype='gb')
fetched_record = SeqIO.read(handle, 'genbank')
handle.close()
for feature in fetched_record.features:
if feature.type == 'CDS':
if 'gene' in feature.qualifiers:
gene_name = feature.qualifiers['gene'][0]
gene_name_assigned = True
print gene_name |