Success Without Glamour, NIH


See yesterday’s post about publications in Cell, Nature, and Science (CNS). During the discussion on Twitter Ethan Perlstein noted:

and also:

So, lets repeat the analysis.


For this run I’ll use the NSF F32 Awards as my data set. We have some duplicates so we’ll modify yesterday’s script accordingly:

import csv, os, sys
import pickle
from Bio import Entrez = ""

nih = csv.DictReader(open("nih_F32.csv", 'rt'))
names = []

for row in nih:
    # Use rstrip to remove any trailing periods after the middle initial
    names.append(row['Contact PI / Project Leader'].rstrip('.'))

# Remove duplicate entries
names = set(names)
names = list(names)

Again, I used Biopython’s Entrez interface to PubMed (Cock et al. 2009). In addition to the NLM Unique IDs for the CNS journals (Nature:410462, Science:0404511, Cell:0413066), I added the PLoS journals (with the exception of PLoS Collections),1 and will use those in the search:

glamour_NLM_ID = ['410462','0404511','0413066']
plos_NLM_ID = ['101183755','101231360','101238921','101238922','101239074','101285081','101291488']

# Set up data structures:
# list of names
annointed_ones = []
enlightened_ones = []
# Any names which generate malformed xml go here
no_xml = []

Another change is that I store the Entrez results in a dictionary for separate analysis.

# Lets save some data in case we want it for later analysis. Create a dictionary
# containing the PMIDs and NLMUIDs for each fellow
pubs = {}

Next, I looped over each name, searching PubMed, then extracting any hits, and skipping over any names which lacked them:

for name in names:
    # Search pubmed for all hits with our authors name
    print name
        search = Entrez.esearch(db="pubmed", term=''.join([name, '[AUTH]']))
        record =
        # Get the records
        if len(record["IdList"]) > 0:
            hits = Entrez.efetch(db="pubmed", id=record["IdList"], retmode="xml")
            hitlist =

            # make a list of NLM UIDs
            nlmIDs=[hit['MedlineCitation']['MedlineJournalInfo']['NlmUniqueID'] for hit in hitlist]
            journal_titles = [hit['MedlineCitation']['Article']['Journal']['Title'] for hit in hitlist]
            pubs[name] = {'PMID': record["IdList"], 'NLMUID': nlmIDs, 'journal_titles': journal_titles}

Writing the results this to disk this time:

# Save the pubs dictionary into a pickle file.
import pickle
pickle.dump(pubs, open( "nih_pubs_dict.p", "wb" ) )
# Write plain text dict to file:
f = open('nih_pubs_dict.txt', 'wt')

I let this run overnight. I then parse the results as following:

# Load results. Create a dictionary containing the PMIDs and NLMUIDs
# for each fellow
pubs = pickle.load(open( "nih_pubs_dict.p", "rb" ))

for name in pubs.keys():
    # Search pubmed for all hits with our authors name
    if len([val for val in pubs[name]['NLMUID'] if val in glamour_NLM_ID]) > 0:
        # Wow, congrats!
    if len([val for val in pubs[name]['NLMUID'] if val in plos_NLM_ID]) > 0:
        # Wow, congrats!

Code and data available here.


I collected any names with CNS hits in the annointed_ones list, which we can compare with all the names:

cns_pct = float(len(annointed_ones))/len(pubs.keys()) * 100
plos_pct = float(len(enlightened_ones))/len(pubs.keys()) * 100

print "Unique names: "
print len(names)
print "\nSuccessful queries: "
print len(pubs.keys())
print "\nSuccessful query percentage: "
print "{0:.2f}%".format(float(len(pubs.keys()))/len(names) * 100)
print "\nCNS pub percentage: "
print "{0:.2f}%".format(cns_pct)
print "\nPLoS pub percentage: "
print "{0:.2f}%".format(plos_pct)

Which returns:

Unique names: 

Successful queries: 

Successful query percentage: 

CNS pub percentage: 

PLoS pub percentage: 

So we have 38 Postdoctoral fellows with CNS publications, or about 9.4% of the successful queries.


This is a very rough sketch, obviously, but initially it appears that the difference isn’t as large as one might expect. Other notes:

  • Need to rerun to get the whole data set, but I don’t expect the results to be very different.
  • The script needs to be a bit more robust. Entrez is a little opaque, and errors in the small test sample were resolved by re-querying, so the script probably needs to iterate over the dictionary until all results are filled or some sane limit is reached.
  • A couple of other ideas:
    • Do people with CNS pubs get more funding? (Small sample size, though)
    • Do people at elite institutions have more CNS pubs. (But: small n, chicken & egg, etc.)
    • As far as NIH publicatons are concerned JAMA should probably be included, but for that matter a comparison with impact factors or H-index might be useful. That might be an entry into the faculty and PI side of the problem, because while the idea of figuring out tenure track numbers by way of parsing hundreds of faculty web pages makes me want to go play in traffic, attempting to address the question of grants seems relatively straightforward.2

There you have it.



Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, et al. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25: 1422–1423. (Accessed January 30, 2013).
Marshall JC, Buttars P, Callahan T, Dennehy JJ, Harris DJ, Lunt B, Mika M, Shupe R. 2009. Letter to the Editors. Israel Journal of Ecology and Evolution 55: 381–392. (Accessed January 31, 2013).

  1. I’m using PLoS as a stand-in for OA journals. Including other journals, or individually open articles, would probably produce different results. [return]
  2. TODO: read the link to Marshall et al. (2009) posted by \@pleunipennings, one of whose authors, Markus Mika, happens to be down the street at UNLV. A look at the literature would be in order, ironically. [return]