NCBI Entrez Interface

NCBI makes a huge amount of data available via its Entrez interface and API. Documentation for using the API can be scarce, however, and using Entrez is frequently frustrating. Please add your tips and tricks for using Entrez here!

The current examples use BioPython as the Entrez interface. NCBI has also released command-line tools with similar functionality. If you can provide examples of how to implement similar scripts with the command-line tools, please share here!

Setup

The Entrez APIs require an email address. If BioPython, you specify your email address before accessing Entrez like so:

BioPython: Entrez email
from Bio import Entrez
 
Entrez.email = username@host

Searching Entrez

A Basic Search

Suppose we want to search Entrez for the locus tag "CLL_A2397".

BioPython Entrez Search
handle = Entrez.esearch(db='protein', term='CLL_A2397')
results = Entrez.read(handle)
handle.close()
 

Specify A Database

Let's specify that we want to search in RefSeq.

BioPython Entrez Search With Filter
handle = Entrez.esearch(db='protein', term='refseq[FILTER] AND CLL_A2397')
results = Entrez.read(handle)
handle.close()

Specify An Organism

Suppose we want all GI numbers for humans from RefSeq. The NCBI tax id for Homo sapiens is 9606, so we will include txid9606[Organism] in the search term.

BioPython Entrez Search By Organism
handle = Entrez.esearch(db='protein', term='refseq[FILTER] AND txid9606[Organism]')
results = Entrez.read(handle)
handle.close()

Getting Accession Numbers

Once you have the results of a search (assuming you've parsed them using Entrez.read() as above) , BioPython stores the results as a dictionary. To get the GIs retrieved, you can look under the 'IdList' key. Suppose we searched RefSeq for the locus tag 'CLL_A2397' as in an example above:

BioPython Entrez Getting GI Numbers
from Bio import Entrez
 
Entrez.email = username@host
 
handle = Entrez.esearch(db='protein', term='refseq[FILTER] AND CLL_A2397')
results = Entrez.read(handle)
handle.close()
 
print results['IdList'] 

would return:

['187932364']

Other Resources For Searching Entrez

List Of Filters For Entrez Searches

Fetching Records Using Entrez

Given a comma-separated list of GI numbers, you can retrieve the records using Entrez eFetch.

Retrieving Multiple Genbank Records At Once

I have had trouble parsing multiple records fetched at once, using a comma-separated list, if retrieving in Genbank format. In principle, BioPython's SeqIO.parse(handle, 'genbank') function should return the records in a list, but I've run into trouble trying this. Whether it's an error in BioPython or my own mistake, if someone can clear this up, please do so and remove this warning.

Suppose we want to get the Genbank record for the protein with GI number '187932364':

Retrieve Genbank Record By GI
from Bio import Entrez, SeqIO
 
handle = Entrez.efetch(db='protein', id='187932364', rettype='gb')
record = SeqIO.read(handle, 'genbank')
handle.close()

Parsing Genbank Files

The record that was parsed using SeqIO above is a Python object. Its features attribute is a list of BioPython SeqFeature objects. Each feature contains a list of "qualifiers", which contain information such as the locus tag, gene (if available), db_xrefs, etc. To see the qualifiers for a CDS feature in the Genbank record, we can run this code:

BioPython Entrez Qualifiers For A CDS
for feature in record.features:
	if feature.type == 'CDS':
		print feature.qualifiers

which will print the following dictionary:

{'locus_tag': ['CLL_A2397'], 'coded_by': ['complement(NC_010674.1:2488690..2490378)'], 'transl_table': ['11'], 'note': ['identified by match to protein family HMM PF02463; match to , 'db_xref': ['GeneID:19966649'], 'gene': ['recN']}

Other Resources For Parsing Genbank Files

Great page from the Wilke Lab on parsing Genbank files

Dealing with GenBank files in Biopython by Peter Cock

eFetch And Parsing Results With BioPython [PDF]

An Example Of Finding RefSeq Gene By Locus Tag

This example script takes a tab-delimited file, where one column contains the locus tags, and prints the gene name in RefSeq for that locus tag (if the gene name is specified). If the gene name isn't specified, it simply prints -.

Print Gene Name From RefSeq Locus Tag
#! /usr/bin/env python
	
import argparse
	
from Bio import Entrez, SeqIO
	
parser = argparse.ArgumentParser(description='Specify a field of NCBI locus tags, which get looked up in RefSeq for a corresponding gene name, which is appended to the line.')
	
parser.add_argument('-f', action='store', dest='locus_tag_col_str', required=True, help='The locus tag field in the tab-delimited input file.')
parser.add_argument('-e', action='store', dest='email_address', required=True, help='Entrez requires your email address.')
parser.add_argument('infile_name', help='Input file')
	
args = parser.parse_args()
	
Entrez.email = args.email_address
locus_tag_col = int(args.locus_tag_col_str) - 1
	
with open(args.infile_name, 'r') as infile:
        for line in infile:
                gene_name = '-'
                locus_tag = line.split()[locus_tag_col]
                search_term = 'refseq[FILTER] AND {}'.format(locus_tag)
                handle = Entrez.esearch(db='protein', term=search_term)
                results = Entrez.read(handle)
                handle.close()
                gene_name_assigned = False
                for gi in results['IdList']:
                        if gene_name_assigned:
                                break
                        handle = Entrez.efetch(db='protein', id=gi, rettype='gb')
                        fetched_record = SeqIO.read(handle, 'genbank')
                        handle.close()
                        for feature in fetched_record.features:
                                if feature.type == 'CDS':
                                        if 'gene' in feature.qualifiers:
                                                gene_name = feature.qualifiers['gene'][0]
                                                gene_name_assigned = True
                print gene_name