Other Resources For Searching Entrez

Fetching Records Using Entrez

Given a comma-separated list of GI numbers, you can retrieve the records using Entrez eFetch.

Parsing Genbank Files

Suppose we have a GI number for a protein and we want to retrieve the corresponding Genbank file, see if it contains a CDS, and get the gene name for the protein in the CDS (if it has a gene name).

...

Warning

title	Retrieving Multiple Genbank Records At Once

I have had trouble parsing multiple records fetched at once, using a comma-separated list, if retrieving in Genbank format. In principle, BioPython's SeqIO.parse(handle, 'genbank') function should return the records in a list, but I've run into trouble trying this. Whether it's an error in BioPython or my own mistake, if someone can clear this up, please do so and remove this warning.

Suppose we want to get the Genbank record for the protein with GI number '187932364':

Code Block

language	py
title	Retrieve Genbank Record By GI

from Bio import Entrez, SeqIO
 
handle = Entrez.efetch(db='protein', id='187932364', rettype='gb')
record = SeqIO.read(handle, 'genbank')
handle.close()

Parsing Genbank Files

The record that was parsed using SeqIO above is a Python object. Its features attribute is a list of BioPython SeqFeature objects. Each feature contains a list of "qualifiers", which contain information such as the locus tag, gene (if available), db_xrefs, etc. To see the qualifiers for a CDS feature in the Genbank record, we can run this code:

Code Block

language	py
title	BioPython Entrez Qualifiers For A CDS

for feature in record.features:
	if feature.type == 'CDS':
		print feature.qualifiers

which will print the following dictionary:

{'locus_tag': ['CLL_A2397'], 'coded_by': ['complement(NC_010674.1:2488690..2490378)'], 'transl_table': ['11'], 'note': ['identified by match to protein family HMM PF02463; match to , 'db_xref': ['GeneID:19966649'], 'gene': ['recN']}

Other Resources For Parsing Genbank Files

Great page from the Wilke Lab on parsing Genbank files

Dealing with GenBank files in Biopython by Peter Cock

An Example Of Finding RefSeq Gene By Locus Tag

This example script takes a tab-delimited file, where one column contains the locus tags, and prints the gene name in RefSeq for that locus tag (if the gene name is specified). If the gene name isn't specified, it simply prints -.

Code Block

language	py
title	Print Gene Name From RefSeq Locus Tag
collapse	true

#! /usr/bin/env python
	
import argparse
	
from Bio import Entrez, SeqIO
	
parser = argparse.ArgumentParser(description='Specify a field of NCBI locus tags, which get looked up in RefSeq for a corresponding gene name, which is appended to the line.')
	
parser.add_argument('-f', action='store', dest='locus_tag_col_str', required=True, help='The locus tag field in the tab-delimited input file.')
parser.add_argument('-e', action='store', dest='email_address', required=True, help='Entrez requires your email address.')
parser.add_argument('infile_name', help='Input file')
	
args = parser.parse_args()
	
Entrez.email = args.email_address
locus_tag_col = int(args.locus_tag_col_str) - 1
	
with open(args.infile_name, 'r') as infile:
        for line in infile:
                gene_name = '-'
                locus_tag = line.split()[locus_tag_col]
                search_term = 'refseq[FILTER] AND {}'.format(locus_tag)
                handle = Entrez.esearch(db='protein', term=search_term)
                results = Entrez.read(handle)
                handle.close()
                gene_name_assigned = False
                for gi in results['IdList']:
                        if gene_name_assigned:
                                break
                        handle = Entrez.efetch(db='protein', id=gi, rettype='gb')
                        fetched_record = SeqIO.read(handle, 'genbank')
                        handle.close()
                        for feature in fetched_record.features:
                                if feature.type == 'CDS':
                                        if 'gene' in feature.qualifiers:
                                                gene_name = feature.qualifiers['gene'][0]
                                                gene_name_assigned = True
                print gene_name

Versions Compared

Old Version 1

New Version 2

Key

Other Resources For Searching Entrez

Fetching Records Using Entrez

Parsing Genbank Files

Other Resources For Parsing Genbank Files

An Example Of Finding RefSeq Gene By Locus Tag

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Other Resources For Searching Entrez

Fetching Records Using Entrez

Parsing Genbank Files

Other Resources For Parsing Genbank Files

An Example Of Finding RefSeq Gene By Locus Tag