Reproducible Data Science
This digital book contains the material for the Graduate Special Topics course WILD 6900: Reproducible Data Science. The aim of the course is to provide students with practical skills to manage and process their data throughout their life cycle, from the moment they are entered into a computer to the moment they are used in a publication, document, presentation, etc. The content is organized in the following Chapters:
- Chapter 1, Project Organization
- Chapter 2, Version Control with Git
- Chapter 3, Collaborative Science with GitHub
- Chapter 4, Best Practices in the Use of Spreadsheets
- Chapter 5, Relational Databases
- Chapter 6, Basics of SQL Language
- Chapter 7, Linking Databases and R with RSQLite
- Chapter 8, Dynamic Documents with RMarkdown
- Chapter 9, Automatically Generated Websites with GitHub Pages
- Chapter 10, Introduction to R
- Chapter 11, Troubleshooting in R
- Chapter 12, Working Environments in R
- Chapter 13, Data Wrangling with tidyverse
- Chapter 14, Data Visualization with ggplot2
- Chapter 15, Dates and Times in R
- Chapter 16, Introduction to Geospatial Data in R
0.1 Software Requirements and Installation Instructions
Required software is listed below along with installation instructions for different operating systems.
Git is a distributed version control system. It is free and open source. To install Git, follow instructions for your operating system below. Also, make sure you create a GitHub account on https://github.com/.
Download from the Git website: go to https://git-scm.com/download/win and the download will start automatically.
0.1.1.2 Mac OS
On Mavericks (10.9) or above, when you try to run a Git command from the Terminal for the first time, the installation will start automatically if you don’t already have Git installed. Type the following in the terminal:
And follow the instructions on the installation wizard.
0.1.2 Spreadsheet Editor
Most people will already have Excel installed on their computer. However, any spreadsheet editor will work for the purpose of this course. If you don’t have access to an Office License, LibreOffice or OpenOffice are free, perfectly viable alternatives to Excel. Download the installer for your operating system:
SQLite is a lightweight relational database management system. To install it, follow these steps:
Go to https://www.sqlite.org/download.html and find your operating system in the list. You are looking for a category called “Precompiled Binaries”. For example, if you are on Windows, look for “Precompiled Binaries for Windows”. From this list, chose the file whose name starts with “sqlite-tools”. The description will read something like, “A bundle of command-line tools for managing SQLite database files, including the command-line shell program, the sqldiff.exe program, and the sqlite3_analyzer.exe program”
In your file explorer, create a new folder called “sqlite” (e.g., on Windows, C:)
Extract the .zip file you downloaded into this new folder.
Download SQLiteStudio (this is a GUI, or Graphical User Interface, that we are going to use to run our SQL commands) here: https://github.com/pawelsalawa/sqlitestudio/releases. Download the file whose name starts with “Install” and choose the .exe extension if you’re working on Windows, .dmg if you’re on Mac OS, and the one without extension if you’re on Linux.
If these instructions weren’t clear, you can find more details (with screenshots) at this link: https://www.sqlitetutorial.net/download-install-sqlite/
R is a free software environment for statistical computing and graphics. Note that installing or updating R is a separate, independent process from installing or updating RStudio! If you already have R installed, make sure you have the latest available version. Follow installation or update instructions for your operating system below.
Download the latest version of R at https://cran.r-project.org/bin/windows/base/
0.1.4.2 Mac OS
Download the latest version of R at https://cran.r-project.org/bin/macosx/
These instructions are for Ubuntu 18.04. If you are running a different version of Debian/Ubuntu, there are some small adjustments to make (see below). In the command line, add the GPG Key:
Add the R repository (here is where you have to replace the appropriate release name if you’re working with a different version of Ubuntu; you can find the complete list here: https://cloud.r-project.org/bin/linux/ubuntu/):
Update package lists:
RStudio is a free Integrated Development Environment (IDE) for R. Note that installing or updating RStudio is a separate, independent process from installing or updating R! If you already have RStudio installed, make sure you have the latest available version. Otherwise, go ahead and download it from here: https://rstudio.com/products/rstudio/download/#download (choose the appropriate version for your operating system.)
0.1.6 Required R Packages
Throughout the course, we will be using the following R packages: RSQLite, rmarkdown, bookdown, renv, tidyverse, lubridate, raster, and sf. All these packages are on CRAN and can be installed (along with their dependencies) by running the following code in R: