Reproducible Data Science
2024-08-20
Overview
This digital book contains the material for the Graduate Course WLF 553 Reproducible Data Science that I teach at the University of Idaho. The aim of the course is to provide students with practical skills to manage and process their data throughout their life cycle, from the moment they are entered into a computer to the moment they are used in a publication, report, presentation, or other document. The content is organized in the following Chapters:
- Chapter 1, Project Organization
- Chapter 2, Version Control with Git
- Chapter 3, Collaborative Science with GitHub
- Chapter 4, Best Practices in the Use of Spreadsheets
- Chapter 5, Relational Databases
- Chapter 6, Basics of SQL Language
- Chapter 7, Linking Databases and R with RSQLite
- Chapter 8, Dynamic Documents with RMarkdown
- Chapter 9, Automatically Generated Websites with GitHub Pages
- Chapter 10, Introduction to R
- Chapter 11, Troubleshooting in R
- Chapter 12, Working Environments in R
- Chapter 13, Data Wrangling with tidyverse
- Chapter 14, Data Visualization with ggplot2
- Chapter 15, Dates and Times in R
- Chapter 16, Introduction to Geospatial Data in R
0.1 Software Requirements and Installation Instructions
Required software is listed below along with installation instructions for different operating systems.
0.1.1 Git
Git is a distributed version control system. It is free and open source. To install Git, follow instructions for your operating system below. Also, make sure you create a GitHub account on https://github.com/.
0.1.1.1 Windows
Download from the Git website: go to https://git-scm.com/download/win and the download will start automatically.
0.1.2 Spreadsheet Editor
Most people will already have Excel installed on their computer. However, any spreadsheet editor will work for the purpose of this course. If you don’t have access to an Office License, LibreOffice or OpenOffice are free, perfectly viable alternatives to Excel. Download the installer for your operating system:
- LibreOffice: https://www.libreoffice.org/download/download/
- OpenOffice: https://www.openoffice.org/download/
0.1.3 SQLite
SQLite is a lightweight relational database management system. To install it, follow these steps:
Go to https://www.sqlite.org/download.html and find your operating system in the list. You are looking for a category called “Precompiled Binaries”. For example, if you are on Windows, look for “Precompiled Binaries for Windows”. From this list, chose the file whose name starts with “sqlite-tools”. The description will read something like, “A bundle of command-line tools for managing SQLite database files, including the command-line shell program, the sqldiff.exe program, and the sqlite3_analyzer.exe program”
In your file explorer, create a new folder called “sqlite” (e.g., on Windows, C:)
Extract the .zip file you downloaded into this new folder.
Download SQLiteStudio (this is a GUI, or Graphical User Interface, that we are going to use to run our SQL commands) here: https://github.com/pawelsalawa/sqlitestudio/releases. Download the file whose name starts with “Install” and choose the .exe extension if you’re working on Windows, .dmg if you’re on Mac OS, and the one without extension if you’re on Linux.
If these instructions weren’t clear, you can find more details (with screenshots) at this link: https://www.sqlitetutorial.net/download-install-sqlite/
0.1.4 R
R is a free software environment for statistical computing and graphics. Note that installing or updating R is a separate, independent process from installing or updating RStudio! If you already have R installed, make sure you have the latest available version. Follow installation or update instructions for your operating system below.
0.1.4.1 Windows
Download the latest version of R at https://cran.r-project.org/bin/windows/base/
0.1.4.2 Mac OS
Download the latest version of R at https://cran.r-project.org/bin/macosx/
0.1.4.3 Linux
These instructions are for Ubuntu 18.04. If you are running a different version of Debian/Ubuntu, there are some small adjustments to make (see below). In the command line, add the GPG Key:
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
Add the R repository (here is where you have to replace the appropriate release name if you’re working with a different version of Ubuntu; you can find the complete list here: https://cloud.r-project.org/bin/linux/ubuntu/):
Update package lists:
Install R:
0.1.5 RStudio
RStudio is a free Integrated Development Environment (IDE) for R. Note that installing or updating RStudio is a separate, independent process from installing or updating R! If you already have RStudio installed, make sure you have the latest available version. Otherwise, go ahead and download it from here: https://posit.co/download/rstudio-desktop/.
University of Idaho, spicardi@uidaho.edu↩︎