Success Without Glamour
Background
So there was a discussion on Twitter, started by Michael Eisen using the hashtag #SuccessWithoutGlamour
discussing the necessity of publishing in the Big Three journals (CNS, or Cell, Nature, and Science). Heartening responses, but I suspected there might be some selection bias and decided to take a marginally more quantitative look.
Data
I thought this poster by Priem and colleagues was interesting and took it as a starting point (Priem, Costello, and Dzuba 2012). I’m rather pressed for time, so I couldn’t try their manual approach to look at faculty, so I decided to narrow the question considerably. I didn’t want to spend a bunch of time parsing HTML either, so I decided to focus on the current NSF biology postdoctoral fellows. I exported a csv from the NSF’s search containing 88 active or expired fellowships awarded after 21 June 2010 (around the time of PLoS One’s first impact factor).
import csv, os, sys
from Bio import Entrez
Entrez.email = "james.estevez@gmail.com"
nsf = csv.DictReader(open("Awards.csv", 'rt'))
names = []
for row in nsf:
names.append(row['PrincipalInvestigator'])
As we see above, I used Biopython’s Entrez interface to PubMed (Cock et al. 2009). I manually located the NLM Unique IDs for the CNS journals (Nature:410462
, Science:0404511
, Cell:0413066
), and will use those in the search:
glamour_NLM_ID = ['410462','0404511','0413066']
annointed_ones = []
no_xml = []
Next, I looped over each name, searching PubMed, then extracting any hits, and skipping over any names which lacked them:
for name in names:
# Search pubmed for all hits with our authors name
try:
search = Entrez.esearch(db="pubmed", term=''.join([name, '[AUTH]']))
record = Entrez.read(search)
# Get the records
if len(record["IdList"]) > 0:
hits = Entrez.efetch(db="pubmed", id=record["IdList"], retmode="xml")
hitlist = Entrez.read(hits)
hits.close()
# make a list of NLM UIDs
nlmIDs=[hit['MedlineCitation']['MedlineJournalInfo']['NlmUniqueID'] for hit in hitlist]
if len([val for val in nlmIDs if val in glamour_NLM_ID]) > 0:
# Wow, congrats!
journal_titles = [hit['MedlineCitation']['Article']['Journal']['Title'] for hit in hitlist]
annointed_ones.append(name)
else:
continue
else:
pass
except NotXMLError:
no_xml.append(name)
Code and data available here.
Error checking:
>>> print len(no_xml)
0
I collected any names with hits in the annointed_ones
list, which we can compare with all the names:
>>> pct = float(len(annointed_ones))/len(names) * 100
>>> pct
5.681818181818182
So we have five Postdoctoral fellows with CNS publications, or about 5.7% of the total.
Additional questions
This is a quick and dirty proof of concept. Obvious follow-ups:
- What about OA publications?
- This needs to be expanded to be more specific, and to account for common surnames.
- Faculty, of course, are a different animal entirely. You’d probably want to look at tenure track faculty, but I can’t think of a clean and automated way to do so.
And others which escape me at the moment. I may return to this in the future, at the very least to rerun this with OA journals included, or to get more detailed information on the publications in which the fellows did publish.