First post
In lieu of a “Hello World” post, let’s make a figure to illustrate the Wikipedia article on bacterial genome size. Boilerplate and libraries out first:
library(grDevices)
library(ggplot2)
library(plyr)
library(taxize)
library(knitcitations)
Get the data
We’ll need data, of course:
# Download our tables from NCBI's FTP site. Accessed Mon Jan 21 20:36:36
# PST 2013
prok <- read.table("ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt",
sep = "\t", comment.char = "!", header = T, stringsAsFactors = F)
# Clear missing values ('-')
prok.cut <- prok[(prok$Size..Mb. != "-") & (prok$Proteins != "-"), ]
# Set classes
prok.cut$Size..Mb. <- as.numeric(prok.cut$Size..Mb.)
prok.cut$Proteins <- as.numeric(prok.cut$Proteins)
prok.cut$Group <- as.factor(prok.cut$Group)
One should probably differentiate between bacteria and archea (Pace, 2006). I make use of the taxize
by Chamberlain et. al. (2012) package we loaded earlier. I’ll also use plyr
(Wickham, 2011) to put it all together:
# From which domain of life does each genome come?
groups <- levels(prok.cut$Group)
get_domain <- function(x) {
first.hit <- classification(get_uid(x))[[1]] # return the first hit
kingdom <- as.character(first.hit[which(first.hit[, "Rank"] == "superkingdom"),
1]) # extract domain
return(data.frame(Group = x, Domain = kingdom))
}
domains <- ldply(groups, get_domain)
foo <- prok.cut
prok.cut <- merge(prok.cut, domains, by = "Group")
Drawing the plot
I’ll use ggplot2 for the figure (Wickham, 2009). Lets export to SVG:
# Draw our plot
p <- ggplot(prok.cut, aes(Size..Mb., Proteins, color = Domain))
# Save our plot to SVG
svg(filename = "ncbi-prok-genomesize.svg", width = 14, height = 8)
p + geom_point(alpha = 0.5, size = 2) + scale_y_log10() + scale_x_log10() +
scale_shape(solid = FALSE) + ggtitle("The total genome size and the number of genes in bacteria and archaea.") +
xlab("Genome size (Megabases)") + ylab("Number of protein coding genes") +
scale_colour_brewer(type = "qual", palette = 3)
dev.off()
p <- ggplot(prok.cut, aes(Size..Mb., Proteins, color = Domain))
p + geom_point(alpha = 0.5, size = 2) + scale_y_log10() + scale_x_log10() +
scale_shape(solid = FALSE) + ggtitle("The total genome size and the number of genes in bacteria and archaea.") +
xlab("Genome size (Megabases)") + ylab("Number of protein coding genes") +
scale_colour_brewer(type = "qual", palette = 3)
There you have it. Post the final figure to Wikimedia Commons and done. Written with knitr (Xie, 2012), with knitcitations (Boettiger, 2012).
References
Boettiger C. 2012.
knitcitations: Citations for knitr markdown files. Available from: http://CRAN.R-project.org/package=knitcitations
Chamberlain S, Szoecs E,
Boettiger C. 2012. taxize: Taxonomic search and phylogeny retrieval. Available
from: http://CRAN.R-project.org/package=taxize
Pace NR. 2006. Time for a change. Nature 441:289–289.Available
from: http://ww.cafescicolorado.org/Time%20for%20a%20change.pdf
Wickham H. 2009. ggplot2:
elegant graphics for data analysis. Springer New York Available from: http://had.co.nz/ggplot2/book
Wickham H. 2011. The
Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical
Software [Internet] 40:1–29. Available from: http://www.jstatsoft.org/v40/i01/
Xie Y. 2012. knitr: A general-purpose package for
dynamic report generation in R. Available from: http://CRAN.R-project.org/package=knitr