Success Without Glamour

Background

So there was a discussion on Twitter, started by Michael Eisen using the hashtag #SuccessWithoutGlamour discussing the necessity of publishing in the Big Three journals (CNS, or Cell, Nature, and Science). Heartening responses, but I suspected there might be some selection bias and decided to take a marginally more quantitative look.

Data

I thought this poster by Priem and colleagues was interesting and took it as a starting point (Priem, Costello, and Dzuba 2012). I’m rather pressed for time, so I couldn’t try their manual approach to look at faculty, so I decided to narrow the question considerably. I didn’t want to spend a bunch of time parsing HTML either, so I decided to focus on the current NSF biology postdoctoral fellows. I exported a csv from the NSF’s search containing 88 active or expired fellowships awarded after 21 June 2010 (around the time of PLoS One’s first impact factor).

import csv, os, sys
from Bio import Entrez

Entrez.email = "james.estevez@gmail.com"

nsf = csv.DictReader(open("Awards.csv", 'rt'))

names = []

for row in nsf:
    names.append(row['PrincipalInvestigator'])

As we see above, I used Biopython’s Entrez interface to PubMed (Cock et al. 2009). I manually located the NLM Unique IDs for the CNS journals (Nature:410462, Science:0404511, Cell:0413066), and will use those in the search:

glamour_NLM_ID = ['410462','0404511','0413066']
annointed_ones = []
no_xml = []

Next, I looped over each name, searching PubMed, then extracting any hits, and skipping over any names which lacked them:

for name in names:
    # Search pubmed for all hits with our authors name
    try:
        search = Entrez.esearch(db="pubmed", term=''.join([name, '[AUTH]']))
        record = Entrez.read(search)
        # Get the records
        if len(record["IdList"]) > 0:
            hits = Entrez.efetch(db="pubmed", id=record["IdList"], retmode="xml")
            hitlist = Entrez.read(hits)
            hits.close()

            # make a list of NLM UIDs
            nlmIDs=[hit['MedlineCitation']['MedlineJournalInfo']['NlmUniqueID'] for hit in hitlist]
            if len([val for val in nlmIDs if val in glamour_NLM_ID]) > 0:
                # Wow, congrats!
                journal_titles = [hit['MedlineCitation']['Article']['Journal']['Title'] for hit in hitlist]
                annointed_ones.append(name)
            else:
                continue
        else:
            pass
    except NotXMLError:
        no_xml.append(name)

Code and data available here.

Error checking:

>>> print len(no_xml)
0

I collected any names with hits in the annointed_ones list, which we can compare with all the names:

>>> pct = float(len(annointed_ones))/len(names) * 100
>>> pct
5.681818181818182

So we have five Postdoctoral fellows with CNS publications, or about 5.7% of the total.

Additional questions

This is a quick and dirty proof of concept. Obvious follow-ups:

  • What about OA publications?
  • This needs to be expanded to be more specific, and to account for common surnames.
  • Faculty, of course, are a different animal entirely. You’d probably want to look at tenure track faculty, but I can’t think of a clean and automated way to do so.

And others which escape me at the moment. I may return to this in the future, at the very least to rerun this with OA journals included, or to get more detailed information on the publications in which the fellows did publish.

Refs

Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, et al. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics [Internet] 25:1422–1423. Available from: http://bioinformatics.oxfordjournals.org/content/25/11/1422
Priem J, Costello K, Dzuba T. 2012. Prevalence and use of Twitter among scholars. FigShare.