GitHub and Gradschool: Can You Reproduce My Results?

One of the fundamental virtues we are taught to expect from science is the reproducibility of its results. For such a vanilla concept, reproducibility is a heated topic in the scientific community, especially in the wake of the open-source-software movement. It is in this spirit that I would like to share my user experience with GitHub, a tool I have found to be immensely useful for my own data organization and analysis process.

You do not have to be a scientist to use GitHub – in fact, as a scientist and not a software developer, most of my own use of the site falls under what Professor Jennifer Bryan (UBC) calls the ‘seriously off-label’ category. Really all you need to benefit from GitHub is a messy home folder on your computer and a growing disgust with the track changes tool in Microsoft Office. Sound familiar? Great.

Let’s say you have a project you’ve been working on. This project contains all the ingredients you’d find in a typical “file salad” – word documents, spreadsheets, maybe some code scripts, a pdf or several, and, if you’re planning on posting some component of the project on the internet, a bunch of associated images, figures, and html docs. Maybe you have multiple versions of these files from collaborators, and if you’re like me you struggle to remember which one is the latest. GitHub offers an end to the file salad through the magic of version control: you hand over your entire bowl of file salad to GitHub, which hosts it for you in the form of a central repository. From then on, GitHub remembers what each version of each file in the salad bowl ever looked like. If you or your collaborators want to make a change to a project, you ‘pull’ the most recent version to your local computer, make your changes, and then ‘push’ them back to the central repository on GitHub.

As a graduate student, the way I use GitHub is usually in tandem with my work in R, which is to say in RStudio with RMarkdown. I load my data into R, tidy it, analyze it, visualize it, and knit up the work I’ve done in one or more RMarkdown documents. Then I push everything, code and all, to a GitHub repository until I need to edit it again. I can testify firsthand that GitHub facilitates replicability, because looking at code I wrote three months ago is often like looking at the work of a total stranger. It is endlessly useful to have my code and all its documentation in one place, in several visual formats. I can only imagine how helpful it would be to have this tool at my disposal if I were trying to reproduce someone else’s results in a new context.

To be clear, I agree that scientists need to be careful about accidentally championing replicability where we mean the reproducibility of results, and it is my current opinion that GitHub facilitates the former more than the latter. But I also think that efforts in reproducibility are enriched when researchers have access to transparent records of analyses in the form of code, if code there was. Evaluating another scientist’s code, when freely shared, ensures that any comparison of reproduced results in a new experimental context is a fair one. I think this creates an obvious place for GitHub and other open-source tools that promote transparency.

Note: This post came about as a natural extension of the workshop I gave at the 2015 Biennial Regional Animal Behavior Student Conference at UC Davis. The workshop slides can be viewed here, and their associated GitHub repo is here…why not make it your first fork on GitHub?

GitHub and Gradschool: Can You Reproduce My Results?

Myfanwy Johnston

April 7, 2015