Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

would return:

['187932364'] 

Other Resources For Searching Entrez

List Of Filters For Entrez Searches

Fetching Records Using Entrez

Given a comma-separated list of GI numbers, you can retrieve the records using Entrez eFetch.

Parsing Genbank Files

Suppose we have a GI number for a protein and we want to retrieve the corresponding Genbank file, see if it contains a CDS, and get the gene name for the protein in the CDS (if it has a gene name).

 

...

Warning
titleRetrieving Multiple Genbank Records At Once

I have had trouble parsing multiple records fetched at once, using a comma-separated list, if retrieving in Genbank format. In principle, BioPython's SeqIO.parse(handle, 'genbank') function should return the records in a list, but I've run into trouble trying this. Whether it's an error in BioPython or my own mistake, if someone can clear this up, please do so and remove this warning.

Suppose we want to get the Genbank record for the protein with GI number '187932364':

Code Block
languagepy
titleRetrieve Genbank Record By GI
from Bio import Entrez, SeqIO
 
handle = Entrez.efetch(db='protein', id='187932364', rettype='gb')
record = SeqIO.read(handle, 'genbank')
handle.close()
 

 

Parsing Genbank Files

The record that was parsed using SeqIO above is a Python object. Its features attribute is a list of BioPython SeqFeature objects. Each feature contains a list of "qualifiers", which contain information such as the locus tag, gene (if available), db_xrefs, etc. To see the qualifiers for a CDS feature in the Genbank record, we can run this code:

Code Block
languagepy
titleBioPython Entrez Qualifiers For A CDS
for feature in record.features:
	if feature.type == 'CDS':
		print feature.qualifiers

which will print the following dictionary:

{'locus_tag': ['CLL_A2397'], 'coded_by': ['complement(NC_010674.1:2488690..2490378)'], 'transl_table': ['11'], 'note': ['identified by match to protein family HMM PF02463; match to , 'db_xref': ['GeneID:19966649'], 'gene': ['recN']}

Other Resources For Parsing Genbank Files

Great page from the Wilke Lab on parsing Genbank files

Dealing with GenBank files in Biopython by Peter Cock

 

An Example Of Finding RefSeq Gene By Locus Tag

This example script takes a tab-delimited file, where one column contains the locus tags, and prints the gene name in RefSeq for that locus tag (if the gene name is specified). If the gene name isn't specified, it simply prints -.

Code Block
languagepy
titlePrint Gene Name From RefSeq Locus Tag
collapsetrue
#! /usr/bin/env python
	
import argparse
	
from Bio import Entrez, SeqIO
	
parser = argparse.ArgumentParser(description='Specify a field of NCBI locus tags, which get looked up in RefSeq for a corresponding gene name, which is appended to the line.')
	
parser.add_argument('-f', action='store', dest='locus_tag_col_str', required=True, help='The locus tag field in the tab-delimited input file.')
parser.add_argument('-e', action='store', dest='email_address', required=True, help='Entrez requires your email address.')
parser.add_argument('infile_name', help='Input file')
	
args = parser.parse_args()
	
Entrez.email = args.email_address
locus_tag_col = int(args.locus_tag_col_str) - 1
	
with open(args.infile_name, 'r') as infile:
        for line in infile:
                gene_name = '-'
                locus_tag = line.split()[locus_tag_col]
                search_term = 'refseq[FILTER] AND {}'.format(locus_tag)
                handle = Entrez.esearch(db='protein', term=search_term)
                results = Entrez.read(handle)
                handle.close()
                gene_name_assigned = False
                for gi in results['IdList']:
                        if gene_name_assigned:
                                break
                        handle = Entrez.efetch(db='protein', id=gi, rettype='gb')
                        fetched_record = SeqIO.read(handle, 'genbank')
                        handle.close()
                        for feature in fetched_record.features:
                                if feature.type == 'CDS':
                                        if 'gene' in feature.qualifiers:
                                                gene_name = feature.qualifiers['gene'][0]
                                                gene_name_assigned = True
                print gene_name