A reproducible research workflow: GitLab, R Markdown, and Open Science Framework

Author

Gjalt-Jorn Peters

Published

April 3, 2019

Disclaimer: like style guides, this is necessarily ‘opinionated’, and perhaps less like style guides, it’s very much a work in progress. It reflects a set of best practices, but one lives, one learns. Also, please feel free to submit pull requests with suggestions!

Tools

This workflow is based on the following tools (apparently the cool kids call it a ‘stack’). Most can be switched for an alternative; these are my preferences based on Open Science principles, how the tools interact, and personal preference.

  • Open Science Framework: An open source repository for supporting all things Open Science. A freely usable installation is located at https://osf.io.
  • Git: An open source version control system, enabling minimum-effort documentation of changes over time. It can be downloaded and installed from https://git-scm.com/. A collection of directories and files that is kept track of by git is called a repository, or repo for short.
  • GitLab”: An open source repository manager for git repositories with a bunch of extra features. A freely usable installation is located at https://gitlab.com.
  • R: An open source and extremely versatile and extensible statistics package. It can be downloaded from https://r-project.org.
  • R Studio: An open source R interface that makes working with R considerably more pleasant. It can be downloaded from https://rstudio.com.
  • Markdown: A convention for adding markup to plain text files, prioritising human-readability over versatility and power. Unlike HTML, Markdown is easy to learn for, um, digitally challenged people. The original specification is available at https://daringfireball.net/projects/markdown/syntax.
  • YAML: A convention for encoding data in plain text files that is designed to optimize human-readability and editability (technically a JSON superset). The specification is available from https://yaml.org/spec/.
  • R Markdown: A convention where Markdown files (i.e.g plain text files) contain YAML front-matter and R chunks to create fully reproducible reports. Such R Markdown (Rmd) files can be rendered to a variety of formats, such as HTML. Introductory documents, examples, and tutorials are available from https://rmarkdown.rstudio.com/.

Preparation

  • Download the required software.
  • R Studio should find git if you have it installed. There’s an extensive tutorial at https://happygitwithr.com/ to get you started.
  • Create an account at the GitLab repo manager you will use (e.g. https://gitlab.com).
  • Create an account at the Open Science Framework implementation you will use (e.g. https://osf.io).

From hereon out, I assume you have everything working.

If you’re not familiar with some or all of the technologies listed here, be prepared to learn them; this is not so much a tutorial as it is an overview.

Workflow

  1. On https://gitlab.com (or whichever gitlab repo manager), create a public GitLab repository and initiate it with a README.md Markdown file. Make sure that the ‘url slug’ (the identifier of the repo that is appended to the git repo manager URL and as such becomes the identifying part of the repo’s online presence) is clear, sufficiently specific, and as short as possible. Only use lowercase letters and dashes; also see the conventions for the recommended directory structure, below. Also, note that at the time of writing this (2019-04-03), there’s a bug prohibiting OSF to sync with GitLab repo’s that have a group or a subgroup as a parent. Therefore, create the GitLab repo in you user’s account (you can always set up mirrors later if you want). Feel free to check out my GitLab repos at https://gitlab.com/matherion.

  2. In RStudio, create a new project, select version control and then git, and clone the GitLab repo you just created (the URL of the repository). This repo will just contains the README.md file and for now, be empty for the rest.

  3. Edit the “.gitignore” file. This contains regular expressions, and any directories or files which match any of these regular expressions, are excluded from synchronization by git (google ‘regular expressions’ if you don’t know what they are but want to know).

  • In this file, add a line that contains “\\[PRIVATE]” (without the double quotes) to exclude all files and directories with [PRIVATE] in their name.

  • Also add a line containing only “manuscripts\” so that the directory holding manuscript versions will not be published to GitLab, and a line containing only “private\” so that you have a directory for private files that you don’t want to sync.

  • Add any other directories or filename patterns you want to exclude.

There are also some lines that RStudio adds, and so a minimal version could look like this:

.Rproj.user
.Rhistory
.RData
.Ruserdata
\\[PRIVATE]
manuscripts\
private\
  1. Create a number of directories to hold your files. I recommend the following structure:

     repo
      |-- manuscripts
      |-- methods--ethics
      |-- methods--operationalisations
      |-- methods--protocols
      |-- private
      |-- results--data-raw
      |-- results--data-processed
      |-- results--intermediate-results
      °-- scripts

Note the conventions: only lower case letters; no spaces (but dashes instead); and double dashes to separate ‘sections’ and single dashes to separate words.

  1. Create an R Markdown file in the scripts directory. I tend to give the main script file (which is usually the only script file) the same name as the git repository itself, i.e. the URL slug you created in step 1.

  2. In this R Markdown file, add your sample size computations.

  3. Render the R Markdown file to an HTML file.

  4. In the repository’s root (the same directory where the “.gitignore” is located), create a file called “.gitlab-ci.yml” (without the double quotes). As contents, copy-paste this:

pages:
  stage: build
  image: alpine:latest
  script:
    - mkdir public
    - cp scripts/YOURFILENAMEHERE public/index.html
  artifacts:
    paths:
    - public
  only:
  - master

Replace YOURFILENAMEHERE with the name of the rendered HTML version of your main script file that was created in step 7.

  1. Commit and push all changes to the online GitLab repository. Pretty soon, in GitLab, you will be able to see that the GitLab Continuous Integration functionality will have successfully performed the tasks in the “.gitlab-ci.yml” file. Soon, your GitLab pages file will become available (can take an hour or so).

  2. Head over to the Open Science Framework (OSF), create an OSF project, and add the contributors.

  3. On the Add-ons page, link it to the GitLab repo.

  4. On the wiki page, edit the ‘home’ wiki page and add a link to the GitLab pages version that hosts the rendered version of your main Rmd file (and potentially link to the GitLab repo itself).

  5. On the Registrations page, add a new registration to add a preregistration. I recommend using the “OSF Preregistration”, which is quite comprehensive and works for many different types of studies. The “Open-Ended Registration” is the other extreme: it just consists of a text field. This second form is useful at later points (see step 15). Save the preregistration as a draft until you agree with all co-authors (see step 15).

  6. Add all files you have available at this point in the directory structure you created in step 4. Extend this structure as needed, adding more sibling directories and/or subdirectories. For example, add:

  • The questionnaires, stimuli, and computer task source code files you intend to use to collect data;
  • Protocols and communications with your participants (e.g. emails, recruitment texts, etc);
  • Any ethical approval documents, such as your request for ethical approval as well as the letter of approval, if you have it available already.
  1. Once you added everything and all co-authors agree, finalize the preregistration. This will create a frozen version of the form including all files in the repository. This is important, because if you unlink GitLab from OSF at a later stage, the synchronization will break, and the ‘live’ set of files will dissappear from OSF. Therefore, periodically registering the state of your project, for example when you received peer review comments if this is a registered report and you update the plans before starting data collection, and for each submission of your manuscript, is a wise idea.

Examples

For examples of repositories set up using this workflow (or earlier versions of it), see: