Success Without Glamour, NIH
Background
See yesterday’s post about publications in Cell, Nature, and Science (CNS). During the discussion on Twitter Ethan Perlstein noted:
and also:@james_estevez @mbeisen NSF and NIH are two different tribes. Much more glamorphilia among NIH-funded academics
— Ethan Perlstein (@eperlste) January 30, 2013
@eperlste @mbeisen Sure. Data set for F32 is much larger, ~1300, so it’ll run overnight.
— James Estevez (@james_estevez) January 30, 2013
So, lets repeat the analysis.
Script
For this run I’ll use the NSF F32 Awards as my data set. We have some duplicates so we’ll modify yesterday’s script accordingly:
import csv, os, sys
import pickle
from Bio import Entrez
Entrez.email = "james.estevez@gmail.com"
nih = csv.DictReader(open("nih_F32.csv", 'rt'))
names = []
for row in nih:
# Use rstrip to remove any trailing periods after the middle initial
names.append(row['Contact PI / Project Leader'].rstrip('.'))
# Remove duplicate entries
names = set(names)
names = list(names)
Again, I used Biopython’s Entrez interface to PubMed (Cock et al. 2009). In addition to the NLM Unique IDs for the CNS journals (Nature:410462, Science:0404511, Cell:0413066), I added the PLoS journals (with the exception of PLoS Collections),1 and will use those in the search:
glamour_NLM_ID = ['410462','0404511','0413066']
plos_NLM_ID = ['101183755','101231360','101238921','101238922','101239074','101285081','101291488']
# Set up data structures:
# list of names
annointed_ones = []
enlightened_ones = []
# Any names which generate malformed xml go here
no_xml = []
Another change is that I store the Entrez results in a dictionary for separate analysis.
# Lets save some data in case we want it for later analysis. Create a dictionary
# containing the PMIDs and NLMUIDs for each fellow
pubs = {}
Next, I looped over each name, searching PubMed, then extracting any hits, and skipping over any names which lacked them:
for name in names:
# Search pubmed for all hits with our authors name
print name
try:
search = Entrez.esearch(db="pubmed", term=''.join([name, '[AUTH]']))
record = Entrez.read(search)
# Get the records
if len(record["IdList"]) > 0:
hits = Entrez.efetch(db="pubmed", id=record["IdList"], retmode="xml")
hitlist = Entrez.read(hits)
hits.close()
# make a list of NLM UIDs
nlmIDs=[hit['MedlineCitation']['MedlineJournalInfo']['NlmUniqueID'] for hit in hitlist]
journal_titles = [hit['MedlineCitation']['Article']['Journal']['Title'] for hit in hitlist]
pubs[name] = {'PMID': record["IdList"], 'NLMUID': nlmIDs, 'journal_titles': journal_titles}
else:
pass
except:
no_xml.append(name)
continue
Writing the results this to disk this time:
# Save the pubs dictionary into a pickle file.
import pickle
pickle.dump(pubs, open( "nih_pubs_dict.p", "wb" ) )
# Write plain text dict to file:
f = open('nih_pubs_dict.txt', 'wt')
f.writelines(pubs)
f.close()
I let this run overnight. I then parse the results as following:
# Load results. Create a dictionary containing the PMIDs and NLMUIDs
# for each fellow
pubs = pickle.load(open( "nih_pubs_dict.p", "rb" ))
for name in pubs.keys():
# Search pubmed for all hits with our authors name
if len([val for val in pubs[name]['NLMUID'] if val in glamour_NLM_ID]) > 0:
# Wow, congrats!
annointed_ones.append(name)
else:
continue
if len([val for val in pubs[name]['NLMUID'] if val in plos_NLM_ID]) > 0:
# Wow, congrats!
enlightened_ones.append(name)
else:
continue
Code and data available here.
Output
I collected any names with CNS hits in the annointed_ones
list, which we can compare with all the names:
cns_pct = float(len(annointed_ones))/len(pubs.keys()) * 100
plos_pct = float(len(enlightened_ones))/len(pubs.keys()) * 100
print "Unique names: "
print len(names)
print "\nSuccessful queries: "
print len(pubs.keys())
print "\nSuccessful query percentage: "
print "{0:.2f}%".format(float(len(pubs.keys()))/len(names) * 100)
print "\nCNS pub percentage: "
print "{0:.2f}%".format(cns_pct)
print "\nPLoS pub percentage: "
print "{0:.2f}%".format(plos_pct)
Which returns:
Unique names:
887
Successful queries:
403
Successful query percentage:
45.43%
CNS pub percentage:
9.43%
PLoS pub percentage:
2.48%
So we have 38 Postdoctoral fellows with CNS publications, or about 9.4% of the successful queries.
Questions
This is a very rough sketch, obviously, but initially it appears that the difference isn’t as large as one might expect. Other notes:
- Need to rerun to get the whole data set, but I don’t expect the results to be very different.
- The script needs to be a bit more robust. Entrez is a little opaque, and errors in the small test sample were resolved by re-querying, so the script probably needs to iterate over the dictionary until all results are filled or some sane limit is reached.
- A couple of other ideas:
- Do people with CNS pubs get more funding? (Small sample size, though)
- Do people at elite institutions have more CNS pubs. (But: small n, chicken & egg, etc.)
- As far as NIH publicatons are concerned JAMA should probably be included, but for that matter a comparison with impact factors or H-index might be useful. That might be an entry into the faculty and PI side of the problem, because while the idea of figuring out tenure track numbers by way of parsing hundreds of faculty web pages makes me want to go play in traffic, attempting to address the question of grants seems relatively straightforward.2
There you have it.
Follow-up
References
- I’m using PLoS as a stand-in for OA journals. Including other journals, or individually open articles, would probably produce different results. [return]
- TODO: read the link to Marshall et al. (2009) posted by \@pleunipennings, one of whose authors, Markus Mika, happens to be down the street at UNLV. A look at the literature would be in order, ironically. [return]