First post

In lieu of a “Hello World” post, let’s make a figure to illustrate the Wikipedia article on bacterial genome size. Boilerplate and libraries out first:

library(grDevices)
library(ggplot2)
library(plyr)
library(taxize)
library(knitcitations)

Get the data

We’ll need data, of course:

# Download our tables from NCBI's FTP site. Accessed Mon Jan 21 20:36:36
# PST 2013
prok <- read.table("ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt", 
    sep = "\t", comment.char = "!", header = T, stringsAsFactors = F)

# Clear missing values ('-')
prok.cut <- prok[(prok$Size..Mb. != "-") & (prok$Proteins != "-"), ]

# Set classes
prok.cut$Size..Mb. <- as.numeric(prok.cut$Size..Mb.)
prok.cut$Proteins <- as.numeric(prok.cut$Proteins)
prok.cut$Group <- as.factor(prok.cut$Group)

One should probably differentiate between bacteria and archea (Pace, 2006). I make use of the taxize by Chamberlain et. al. (2012) package we loaded earlier. I’ll also use plyr (Wickham, 2011) to put it all together:

# From which domain of life does each genome come?
groups <- levels(prok.cut$Group)
get_domain <- function(x) {
    first.hit <- classification(get_uid(x))[[1]]  # return the first hit
    kingdom <- as.character(first.hit[which(first.hit[, "Rank"] == "superkingdom"), 
        1])  # extract domain
    return(data.frame(Group = x, Domain = kingdom))
}
domains <- ldply(groups, get_domain)
foo <- prok.cut
prok.cut <- merge(prok.cut, domains, by = "Group")

Drawing the plot

I’ll use ggplot2 for the figure (Wickham, 2009). Lets export to SVG:

# Draw our plot
p <- ggplot(prok.cut, aes(Size..Mb., Proteins, color = Domain))

# Save our plot to SVG
svg(filename = "ncbi-prok-genomesize.svg", width = 14, height = 8)
p + geom_point(alpha = 0.5, size = 2) + scale_y_log10() + scale_x_log10() + 
    scale_shape(solid = FALSE) + ggtitle("The total genome size and the number of genes in bacteria and archaea.") + 
    xlab("Genome size (Megabases)") + ylab("Number of protein coding genes") + 
    scale_colour_brewer(type = "qual", palette = 3)
dev.off()
p <- ggplot(prok.cut, aes(Size..Mb., Proteins, color = Domain))

p + geom_point(alpha = 0.5, size = 2) + scale_y_log10() + scale_x_log10() + 
    scale_shape(solid = FALSE) + ggtitle("The total genome size and the number of genes in bacteria and archaea.") + 
    xlab("Genome size (Megabases)") + ylab("Number of protein coding genes") + 
    scale_colour_brewer(type = "qual", palette = 3)

Genome size vs. gene count.

There you have it. Post the final figure to Wikimedia Commons and done. Written with knitr (Xie, 2012), with knitcitations (Boettiger, 2012).

References

Boettiger C. 2012. knitcitations: Citations for knitr markdown files. Available from: http://CRAN.R-project.org/package=knitcitations
Chamberlain S, Szoecs E, Boettiger C. 2012. taxize: Taxonomic search and phylogeny retrieval. Available from: http://CRAN.R-project.org/package=taxize
Pace NR. 2006. Time for a change. Nature 441:289–289.Available from: http://ww.cafescicolorado.org/Time%20for%20a%20change.pdf
Wickham H. 2009. ggplot2: elegant graphics for data analysis. Springer New York Available from: http://had.co.nz/ggplot2/book
Wickham H. 2011. The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software [Internet] 40:1–29. Available from: http://www.jstatsoft.org/v40/i01/
Xie Y. 2012. knitr: A general-purpose package for dynamic report generation in R. Available from: http://CRAN.R-project.org/package=knitr