Organizing a reproducible scientific project
Roger Schürch
Ever since I took over our college’s graduate level Research and Information Systems in the Life Sciences course in 2018, I have been pushing it more and more into the direction of reproducible research and Open Data. In the course, I try to instill in my students the importance of creating audit trails for their data (to avoid falling under suspicion of manipulation and fraud; think Marc Hauser, or Jonathan Pruitt) and to then have a unbroken link from their data to the finished presentation or paper.
The typical students takes this course in the first semester of their graduate career, and so my class is usually composed of fresh field biologists all experiencing their first foray into the command-line and scripted data analysis concurrently, which is often rather chaotic. In my experience, the students initially do not always quite get why I am making them collect data with a Google Form and ODK collect instead of Excel, or why I insist on scripting the analyses instead of clicking around in a graphical user interface of their statistical software of choice. The students do not understand that same journals they are reading to prepare their literature reviews are now moving to require data and code be provided alongside a manuscript (e.g., The American Naturalist now asks for code). It is therefore the more rewarding when they come back later to tell me how much my course is helping them in the final stages of their graduate studies.
Since the course is helpful to some students, I thought it would be a good time to write up what is currently my "best" practice for project organization, data analysis, and publishing of data and analysis scripts. I do this in the hope that it might be useful for others as well. I also feel compelled to write this post because recently we have had a number of paper submissions where I put some of these practices to the test. Rather than having a general beginners guide (e.g., A Beginner's Guide to Conducting Reproducible Research), I want to work through a specific example that people could take and expand on. I think that approach is probably most useful for readers who already have some R and RMarkdown experience.
A Toy Project
To illustrate these best practices, we will be working through a toy project. We are assuming that the project was conducted roughly following the scientific method:
- initial observation and/or literature research leads to a research question
- the research question is distilled into a testable hypothesis
- experiments or observations are conducted to collect data for or against the hypothesis
- data analysis provides a statistical test of the hypothesis
- results are communicated
- replication
Importantly, the replication in Step 6 describes a process where we or other people repeat Steps 3–5. This is different from a statistical replicate (multiple trials within the same experiment). The degree of replication crucially hinges upon how well you have written a protocol during Step 3: A well written protocol can be used by a third party to replicate results. In contrast, poorly written up experimental protocols are at the heart of the replication crisis. Here, I want to focus on reproducibility, where given the data collected in Step 3 and scripts for analysis written in Step 4, the results in Step 5 can be reproduced.
The toy project we are going to use for this blog-post was published initially in Science [1]. The authors re-visited an earlier study, also published in Science [2]. The data are available from the faraway
package. We shall also follow Faraway [3,4] in the analysis, and keep it simple.
The initial observations and research question come directly from Hamilton et al. [2]:
"The factors regulating numbers of species on member islands of an archipelago can be divided into those that impede or promote inter-island dispersal of individuals (for example, degree of isolation as measured by distance between islands) and those relating to successful establishment of natural populations on adjacent islands (for example, ecologic diversity as estimated by area, elevation, and other factors) (1). A major goal of evolutionary biology is to measure these factors and thus to determine whether they can be used independently or interdependently to predict population variables such as species numbers for naturally isolated areas (2)."
In other words, we are interested if factors such as island area, elevation, etc. will have an influence on the number of species you will find on an island of an archipelago. Since we have the data already in the faraway
package, we can skip Step 3. Additionally, since we are mostly following Faraway in his analysis, here we can focus specifically on how to organize the project on our computer and how to get a final analysis report (or manuscript) into a dedicated reports folder. If this is done well, sharing data and code with a data repository, as it is now becoming standard, is then trivial.
Project Folder Organization
We want a clean folder structure. Additionally, each of my project folders' organization into subfolders reflects the above scientific method. I have a 01_proposal
folder that holds the initial ideas and research questions. This is often formalized into research proposals submitted to agencies. The 02_protocol
folder holds more detailed protocols for each experiment. Once a project is on its way, the 03_admin
folder could contain anything that is needed to manage the project (e.g., HR, finances). The 04_data
folder then holds any data generated by the project during experiments. The 05_code
folder holds analysis code, in my case mostly in R. It also contains RMarkdown files that will then be rendered into the 06_reports
folder to be communicated to peers. By the way, the reports need not be generated with RMarkdown. In fact, personally, I prefer Emacs / Org-mode. However, I do think that the learning curve for students is less steep with RStudio and RMarkdown. The 07_background_literature
holds a BibTeX bibliography file that is cross-referenced from many different places (grants, protocols, manuscripts, etc.). That is why it is last (of course it could just as well go first). The file tree is depicted in the following figure:
Having a standard folder structure across all projects makes it easier for me to find my way around. Numbering the folders consistently additionally allows me to navigate to them quickly when working on the command-line.
The 05_code
folder definitely holds the meat of all my projects. Even though I will not cover it here, I do highly recommend using a version control system to keep your code folder organized and to minimize the risk of exploding number of files (i.e., "script_final.R", "script_final1.R", "script_truefinal.R", etc). Most of my projects only need 3–4 files. We need a main file (here called 000_main.R
) that is akin to the main()
function in a C program. It will load all necessary libraries, call up all other script files, and then render the results. For our toy project, we could even load the data right in that file, but most projects need a bit of code to load the data and ensure that all variables are in the proper format. That is why, in most cases, it is reasonable to have another file, 001_data_prep.R
. Optionally, you might need a file to perform data validation (check for outliers, obvious data entry errors, etc.). When the data are all set up, analyses are conducted in the 002_data_analysis.R
. Importantly, this file also prepares objects that hold the analyses' results for later use when rendering any reports, conference slides, or manuscripts.
Why do I not pack everything that is needed for a nice report into one monolithic RMarkdown file? I want to be able to re-use any objects that I create in the analysis in multiple targets. For example, I might first render a quick and dirty statistical analysis report for colleagues, then I go to a conference, and finally I might want to publish a manuscript, first at one journal, then after initial rejection, in another journal. The four reports share the data, and probably also some of the analyses. But they look different from one anther. In analogy to software development, I have found that dividing the code up into business logic (load data, do analysis) and a graphical user interface (the reports) makes projects more manageable for me. By the way, this is the same approach that ggplot2
uses, where you just apply your graphical theme once you have constructed the figure. Here we are doing something similar.
One Script to Rule Them All
Now let us look at all the files in detail. Here is the 000_main.R
file for our toy project (you can download it here):
## 000_main.R
## author: Roger Schürch
## description: the main file to govern
## data prep, validation, and analysis;
## then render reports
## set working directory, should probably
## not run if working in RStudio
setwd("../../")
knitr::opts_knit$set(root.dir = getwd())
## load libraries
## note, this fails if packages need user input,
## e.g., when compiling for now: manually
## install packages from this list if they
## fail to install
default_w <- getOption("warn")
options(warn = -1)
required_packages <- c(
"emmeans", ## to extract mariginal means
"pander", ## for nice table output
"faraway") ## for Galapagos plant data
currently_installed <- installed.packages()
for (i in required_packages) {
if (!(i %in% currently_installed[, "Package"])) {
## change the repo to whatever you like;
## as a Swiss I am sticking with the ETH
install.packages(i,
repos = "https://stat.ethz.ch/CRAN/")
}
library(i, character.only = TRUE)
}
options(warn = default_w)
## prepare data
source("./05_code/R/001_data_prep.R")
## do analysis
source("./05_code/R/002_data_analysis.R")
## tidy up old images from previous report
## rendering
old_img_path <- "./05_code/Rmarkdown/images/"
file.remove(file.path(old_img_path,
dir(old_img_path)))
new_img_path <- "./06_reports/stat_analysis_report/images/"
file.remove(file.path(new_img_path,
dir(new_img_path)))
## render a document, here the statistical analysis report
rmarkdown::render(
"./05_code/Rmarkdown/stat_analysis_report.Rmd",
output_dir = "./06_reports/stat_analysis_report/")
## move newly rendered images to report folder
file.copy(from = file.path(old_img_path,
dir(old_img_path)),
to = new_img_path)
I think this is mostly self-explanatory. The first two lines of code (ignoring the comments) set the working directory. I work in Emacs, and the default working directory is the location of the R file from where I start the R session. I want all paths to be relative to my project directory, not the code directory. If you look at the location of the R file, we have to move two levels up to reach root.
The next section of code loads libraries. If the libraries are not installed, it will install them. There is one snag though … if you are using non-standard libraries that need to be installed from source, and that involve user input, the for
-loop fails. If you have any suggestions how to improve on that, please feel free to let me know.
Next up, we source the data preparation and analysis files.
The last section deletes any images from the previously exported report, renders the document (here a statistical analysis report), and then copies any image files created from the Rmarkdown to the respective subfolder in 06_reports
.
The Data Analysis
Preparing the data set for this toy example is trivial. Once the faraway
library is loaded in the main file, the data preparation file looks like this (find the file here):
## 001_data_prep.R
## author: Roger Schürch
## description: load the gala data set
data(gala)
The data set contains seven numeric variables, and they are all in the right format. In a real-world problem, we would now have to make sure in the same script that the data types are recognized correctly (e.g., categorical variables should be of type factor
, or dates and times should be recognized as such. We might have to aggregate the data appropriately before analysis, or split data into intention-to-treat or per-protocol data sets. We might even have to impute data. So, while the data preparation file looks deceptively simple here, the data preparation step is the most laborious task in real world data analysis. Still, so far the simplicity of my data has allowed me to keep this all in one file. If projects become more complex, I would consider making subfolders for the different tasks. Anyhow, here data(gala)
is all that is needed. From the help file we get this information about the data set:
There are 30 Galapagos islands and 7 variables in the dataset. The relationship between the number of plant species and several geographic variables is of interest. The original dataset contained several missing values which have been filled for convenience.
In his first book, Faraway uses a Normal linear model to determine if any of the geographic variables predicts the number of plant species on the islands [3]. The second book models the count in a Poisson generalized linear model [4]. We will be using the latter. The analysis file could look like this (available here):
## 002_data_analysis.R
## author: Roger Schürch
## description: glms for the gala data
fit.galapagos.glm.1 <-
glm(Species ~ log(Area) +
Elevation +
Nearest +
Scruz +
Adjacent,
data = gala,
family = "poisson")
## do we have overdispersion?
observed_variance <-
(gala$Species - fitted(fit.galapagos.glm.1)) ^ 2
plot(log(observed_variance) ~
log(fitted(fit.galapagos.glm.1)),
pch = 21, bg = "black",
xlab = expression(hat(mu)),
ylab = expression((y - hat(mu)) ^ 2))
abline(0, 1)
## calculate dispersion paramater
dp <- sum(residuals(fit.galapagos.glm.1,
type = "pearson") ^ 2 /
fit.galapagos.glm.1$df.residual)
summary(fit.galapagos.glm.1,
dispersion = dp)
## get coefficients and confidence intervals for params
fit_coefs <-
cbind(exp(coef(fit.galapagos.glm.1, dispersion = dp)),
exp(confint(fit.galapagos.glm.1, dispersion = dp)))
xes <- seq(min(gala$Area),
max(gala$Area),
length.out = 10000)
new_data <-
expand.grid(Area = xes,
Elevation = mean(gala$Elevation),
Nearest = mean(gala$Nearest),
Scruz = mean(gala$Scruz),
Adjacent = mean(gala$Adjacent))
yes_fit <- predict(fit.galapagos.glm.1,
newdata = new_data,
type = "link",
se.fit = TRUE,
dispersion = dp)$fit
yes_se <- predict(fit.galapagos.glm.1,
newdata = new_data,
type = "link",
se.fit = TRUE,
dispersion = dp)$se.fit
yes <- fit.galapagos.glm.1$family$linkinv(yes_fit)
yes_lwr <- fit.galapagos.glm.1$family$linkinv(yes_fit -
qnorm(0.975) * yes_se)
yes_upr <- fit.galapagos.glm.1$family$linkinv(yes_fit +
qnorm(0.975) * yes_se)
The details of the model are not very important. We do not perform model selection. I have taken the log of area because it gives a better fit than the analysis that was presented in Faraway [4]. We have overdispersion, so we need to deal with that. I then use the predict()
function to get my prediction and confidence intervals for area while holding every other variable at their means.
That concludes the business logic, and we can now figure out how to present the data and the analysis.
The Representation of the Analysis in Reports
As mentioned above, I can render the results from the analysis into many different formats. For this, I am writing Rmarkdown. Let us pretend we are writing the manuscript for the Johnson and Raven article [1] for the first time, but it could just as well be a manuscript to be submitted to a peer-reviewed journal, or a PowerPoint presentation for a conference. Compared to the previous script files, the Rmarkdown file might be a little longer, so let us go through section by section (but you can download the whole thing here). The header of the Rmarkdown file would look like this:
---
title: "Species Number and Endemism: The Galápagos Archipelago Revisited"
author: Johnson MP and Raven PH
date: 1973
bibliography: "../../07_background_literature/galapagos.bib"
csl: "../../07_background_literature/science.csl"
output: word_document
abstract: |
"Regression analyses to determine plant species
number are repeated ..."
---
I am setting up the title, the author list, and the date. I am also pointing to a BibTeX file that holds the bibliographic information, and I choose the appropriate citation style. And of course there will be an abstract. For the purposes of this exercise here, I am not going to write up the whole abstract or paper, but you get the idea. You write the prose just as you always would. Because it is a Science article, there are not any headings, and instead we are launching right into the prose.
For each document that you write, there is a particular look you are trying to achieve. For peer-reviewed articles, the format is dictated by the journal to which you submit. It is probably best to define that look somewhere near the top of your Rmarkdown. What do we need to define? We are certainly going to present parameter estimates from our model in the text, or in a table. That means point estimates and confidence intervals. We also will want a consistent look for test statistics (e.g., t and F values) and p-values. I wrap the formatting of p-values into a handy function, and translate the journal's instructions on how many digits to give for test statistics into re-usable constants. Now Science instructions are not very specific, so here is just a generic version of how this might look:
## ci bounds
lower_ci_boundary <- 0.025
upper_ci_boundary <- 0.975
ci_format_string <- "%.1f (%.1f to %.1f)"
## functions to format p-values
format_pvals <- function(pval, digits = 3) {
if (pval > 1 | pval < 0) {
stop("P values cannot be greater than 1 or smaller than 0")
}
result <-
ifelse(pval < 1 / 10 ^ digits,
paste("_p_", "<", 1 / 10 ^ digits),
paste("_p_", "=", round(pval, digits)))
return(result)
}
## formating of test statistic values
f_digits <- 2
The original article does not have any figures, which I think is quite a shame. We are definitely going to add figures, If I wanted to go for colorful figures, I might have defined a number of colors to be reused throughout the report as well. Here, our figures are so simple that this is not needed.
With these preliminaries out of the way, we can start writing the prose. Johnson and Raven start like this:
The major factors determining the number of plant species on the
Galápagos Archipelago were analyzed and discussed in Hamilton _et al._
@Hamilton1963. [...]
[...]
Again, we are not going to write the whole paper here, but we want to write the results. The authors show the whole data table, which is possible because of the limited number of islands. I think we can do that here as well:
pander(gala,
split.cells = 20,
split.table = 200,
caption = "Table 1. List of islands and variables.")
However, instead of the other two tables in the original, we are going to present text results and make a figure. The correlation matrix presented in Table 2 is important in determining the "right" model. The high collinearity may cause issues in modeling, but they are a nuisance, and may not be of interest per se. And instead of the list of variables entering the model and the R ², I am more interested in how much the different variables affect the number of species. I think a figure will provide a visual that makes these relationships clear at a glance.
Let us start with the text results. From the model, we still need to make a presentable output. For example, from running the summary, we know that only log(Area) is a significant predictor of our response variable (number of species). We should probably present the magnitude of the effect. That is, we want to show how many more species there are for a 1 unit increase in area. We can use the coef()
and confint()
functions on the model to get that. Because Poisson model use a log-link, we also need to back-transform these to the response scale. We had prepared all that in the analysis script, and here we can just format it to the specific needs. While we are at it, we also prepare the test statistic and p-values.
area_result_str <-
do.call("sprintf",
args = c(ci_format_string,
as.list(fit_coefs["log(Area)", ])))
area_F <-
round(drop1(fit.galapagos.glm.1,
test = "F")["log(Area)", "F value"],
f_digits)
area_p <-
format_pvals(drop1(fit.galapagos.glm.1,
test = "F")["log(Area)", "Pr(>F)"])
Then we can use it in the prose:
[...]
The results differ markedly from those of Hamilton _et al._ Once we
account for overdispersion, our Poisson GLM results indicate that only
area makes a significant contribution to the determination of species
number. For each increase in log(Area), we see a increase of `r
area_result_str` fold (_F_ = `r area_F`, `r area_p`; see Fig. 1).
On running the main script, this will insert a reproducible result. Copy paste errors are impossible. Now of course we also need to produce the figure we reference! We want to show the original data, and then the fit. We had prepared the fitted lines already in the analysis script, so here we just put it all together:
plot(Species ~ Area,
data = gala,
log = "x",
pch = 19)
lines(xes, yes)
lines(xes, yes_lwr, lty = 2)
lines(xes, yes_upr, lty = 2)
Of course there are other things we could look at, but I think this is enough to illustrate the reproducibility of this approach. I am just going to end with the title for the references and then end the exercise here.
[...]
## References and Notes
Sharing Data and Scripts
All that is needed now is running the 000_main.R
file from start to finish to produce reports. Hopefully, at some point we get a manuscript accepted into a journal, and then we need to make available our data and code to the public. Here at Virginia Tech, our library offers a data repository. In principle, all we have to do is upload a ZIP archive of our project folder (here is the ZIP from this project).
Unfortunately, in practice, we still will have to make sure that our script works on various platforms. I like to ship the ZIP archive first to co-workers that have different platforms compared to me. I am on GNU/Linux, lab members have OS X and Windows. Sometimes, exotic packages do not install equally well on all platforms, and some editing might be necessary. At other times, my long-term service distribution of Ubuntu is running so far behind the cutting edge of R development that ways of how to use certain packages have changed. In that case, decisions will have to be made what version is acceptable to be used in your scripts. It might be worth annotating your scripts with what version of R was used on which platform, so that consumers of the data have a chance to reproduce the analysis.
One last note of caution … If your data contains identifying information of undergraduate students, for example in the form of user IDs that identify who collected data, you should make sure to delete these before upload. Other than that, once the project runs on multiple platforms, you can upload the ZIP archive of your project folder to the data provider of your choosing.
Conclusion
In summary, if we set up a consistent folder structure for each project, and if we separate the business logic from the graphical representation of results, sharing of projects becomes a little easier. I am not claiming that the way to handle a reproducible workflow is the best possible workflow, but it has worked for a number of papers that we submitted recently. I am very keen to keep improving on this, so if you have suggestions, please let me know.
References
- Johnson MP and Raven PH (1973). Species Number and Endemism: The Galápagos Archipelago Revisited. Science 179(4076): 893–895. 10.1126/science.179.4076.893
- Hamilton TH, Rubinoff I, Barth, Jr. RH, and Bush GL (1963). Species Abundance: Natural Regulation of Insular Variation. Science 142(3599): 1575–1577. 10.1126/science.142.3599.1575
- Faraway J (2006). The Linear Model. CRC Press.
- Faraway J (2007). Extensions to the Linear Model. CRC Press.
P.S.:
Once you are sure that your code works it might be a good idea to turn off warnings, as they are propagated through to the output.
P.P.S.:
The blog post is in itself reproducible. It was written in Orgmode for this Hugo website, and the code blocks tangled from the Orgmode file, then tested in R.