Chapter 7 Interfacing Databases in R with RSQLite

7.1 The RSQLite package

RSQLite is an R package that provides an interface with SQLite such that you can interact with the database from within R. Everything we have done so far on the database can be done from R without ever opening SQLite Studio. You could actually do all of those things without even having SQLite installed on your computer, because the R package itself contains SQLite.

In turn, RSQLite relies on another R package called DBI (which stands for database interface). DBI is a package that provides generic functions for interfacing databases with R, and RSQLite makes adjustments that are specific to SQLite. So, first things first, we are going to load the DBI package:

library(DBI)

7.2 Establising a database connection

We start by connecting to a database, either an existing one or one we create new. The dbConnect function takes as input a file path for the database. If the path exists, it will connect to the existing database; if it does not, it will create it. We also specify that the type of database connection is SQLite:

dragons_db <- dbConnect(RSQLite::SQLite(), "data/dragons.db")

7.3 Sending queries to the database

Now that the connection is established, we can start sending queries to the database. Any time we perform actions that affect the database (creating or deleting tables, inserting data, etc.) we use the function dbExecute. This function takes as input a database connection and the SQL code that we want to run (as a character string). For example, we can copy-paste the code we used in Chapter 6 to create the dragons table:

dbExecute(dragons_db, "CREATE TABLE dragons (
dragon_id varchar(5) NOT NULL,
sex char(1) CHECK (sex IN ('M', 'F')),
age_class varchar(8) CHECK (age_class IN ('Juvenile', 'Subadult', 'Adult')),
species varchar(50),
PRIMARY KEY (dragon_id)
);")

Now, instead of manually importing the data by pointing and clicking, we can load the data from .csv files into R and insert it into the newly created table.

dragons <- read.csv("data/dragons.csv", 
                   stringsAsFactors = FALSE) 

names(dragons)

## [1] "dragon"    "sex"       "age_class" "species"

Note that the column names in the .csv differ slightly from the column names we assigned when creating the dragons table. This is a problem because RSQLite won’t recognize the columns and won’t be able to insert the data. We can prevent this by changing any names that do not match so that they match the database:

names(dragons)[1] <- "dragon_id"

Now we can enter the data from the .csv into the dragons table. The function dbWriteTable takes as input the database connection, the name of the table we want to fill (in quotes), and the name of the data frame we want to input. Note that I’m using append = TRUE because otherwise RSQLite will overwrite the current table with whatever is in the .csv, and any constraints we have enforced (primary key, foreign key, etc.) will be lost:

dbWriteTable(dragons_db, "dragons", dragons, append = TRUE)

Now we can send a query to the database to check that the data were inserted correctly. Because this is a query that retrieves data, not a query that modifies the database, we use dbGetQuery instead of dbExecute:

dbGetQuery(dragons_db, "SELECT * FROM dragons LIMIT 10;")

##    dragon_id  sex age_class             species update_timestamp
## 1         D1    F  Subadult     Hebridean Black             <NA>
## 2         D2 <NA>  Juvenile   Romanian Longhorn             <NA>
## 3         D3    F     Adult     Hebridean Black             <NA>
## 4         D4    F     Adult Peruvian Vipertooth             <NA>
## 5         D5 <NA>  Juvenile Ukrainian Ironbelly             <NA>
## 6         D6    F     Adult Norwegian Ridgeback             <NA>
## 7         D7    M     Adult     Hebridean Black             <NA>
## 8         D8 <NA>  Juvenile Peruvian Vipertooth             <NA>
## 9         D9    F     Adult Norwegian Ridgeback             <NA>
## 10       D10    F     Adult  Common Welsh Green             <NA>

Now we can repeat this process with the other tables:

dbExecute(dragons_db, "CREATE TABLE tags (
tag_id char(6) NOT NULL PRIMARY KEY,
brand varchar(50),
status varchar(20)
);")

tags <- read.csv("data/tags.csv")

dbWriteTable(dragons_db, "tags", tags, append = TRUE)

dbGetQuery(dragons_db, "SELECT * FROM tags LIMIT 10;")

##    tag_id              brand      status
## 1  N33529 Heatwave Telemetry Refurbished
## 2  J47978          Undertone         New
## 3  R61684 Heatwave Telemetry         New
## 4  B35524         Aerotronic         New
## 5  T55954         Aerotronic         New
## 6  O31688          Undertone         New
## 7  B61925          Undertone         New
## 8  P47677         Aerotronic         New
## 9  P35987         Aerotronic         New
## 10 I41609 Heatwave Telemetry Refurbished

dbExecute(dragons_db, "CREATE TABLE capture_sites (
site char(3) NOT NULL PRIMARY KEY,
utm_x double,
utm_y double
);")

capture_sites <- read.csv("data/capture_sites.csv")

names(capture_sites)[2:3] <- c("utm_x", "utm_y")

dbWriteTable(dragons_db, "capture_sites", capture_sites, append = TRUE)

dbGetQuery(dragons_db, "SELECT * FROM capture_sites;")

##   site  utm_x   utm_y
## 1  LOG 430659 4620909
## 2  SLC 424790 4512583
## 3  MOA 626325 4270426
## 4  STG 271064 4108535
## 5  DLT 364106 4357045

Remember how in SQLite we had to find a workaround to add auto-incremental primary keys? Well, using RSQLite to build the database instead means we do not need that trick anymore. We can just add the auto-incremental serial number to the data and append it to the table while keeping all the structure and the constraints we have enforced. Let’s try this with the captures table. First, we create the table:

dbExecute(dragons_db, "CREATE TABLE captures (
capture_id INTEGER PRIMARY KEY AUTOINCREMENT,
dragon_id varchar(5),
date text,
site char(3),
FOREIGN KEY(dragon_id) REFERENCES dragons(dragon_id)
FOREIGN KEY(site) REFERENCES capture_sites(site)
);")

Then we read in the .csv file and add a column called capture_id with the serial number we’ll use as primary key:

captures <- read.csv("data/captures.csv") 

captures$capture_id <- 1:nrow(captures)

head(captures)

##   dragon capture_date capture_site capture_id
## 1     D1   2012-08-10          MOA          1
## 2     D2   2013-07-23          MOA          2
## 3     D3   2016-10-19          SLC          3
## 4     D4   2001-01-10          LOG          4
## 5     D5   2000-09-28          MOA          5
## 6     D6   2016-02-25          LOG          6

Not only the column names need to match the names in the database table, but the order has to match too. So we re-order the columns:

captures <- captures[, c("capture_id", "dragon", "capture_date", "capture_site")]

We fix the names so they match:

names(captures)[2:4] <- c("dragon_id", "date", "site")

And we append the data frame to the database table:

dbWriteTable(dragons_db, "captures", captures, append = TRUE)

Let’s check that everything worked:

dbGetQuery(dragons_db, "SELECT * FROM captures LIMIT 10;")

##    capture_id dragon_id       date site
## 1           1        D1 2012-08-10  MOA
## 2           2        D2 2013-07-23  MOA
## 3           3        D3 2016-10-19  SLC
## 4           4        D4 2001-01-10  LOG
## 5           5        D5 2000-09-28  MOA
## 6           6        D6 2016-02-25  LOG
## 7           7        D7 2012-11-05  MOA
## 8           8        D8 2009-07-14  LOG
## 9           9        D9 2001-08-17  STG
## 10         10       D10 2017-01-17  LOG

Now we can repeat this process with the remaining tables:

dbExecute(dragons_db, "CREATE TABLE morphometrics (
measurement_id INTEGER PRIMARY KEY AUTOINCREMENT,
dragon_id varchar(5),
date text,
total_body_length_cm float,
wingspan_cm float,
tail_length_cm float,
tarsus_length_cm float,
claw_length_cm float,
FOREIGN KEY (dragon_id) REFERENCES dragons(dragon_id)
);")

# Load csv file
morphometrics <- read.csv("data/morphometrics.csv") 

# Add auto-incremental number
morphometrics$measurement_id <- 1:nrow(morphometrics)

# Re-order columns
morphometrics <- morphometrics[, c("measurement_id", "dragon", "date",
                                   "total_body_length_cm", "wingspan_cm",
                                   "tail_length_cm", "tarsus_length_cm", 
                                   "claw_length_cm")]

# Change column names to match
names(morphometrics)[2] <- "dragon_id"

# Append to database table
dbWriteTable(dragons_db, "morphometrics", morphometrics, append = TRUE)

dbGetQuery(dragons_db, "SELECT * FROM morphometrics LIMIT 10;")

##    measurement_id dragon_id       date total_body_length_cm wingspan_cm
## 1               1       D96 2012-10-10            1069.8965   1389.5527
## 2               2      D400 2003-07-03             333.4600    634.9109
## 3               3      D316 2017-09-08             866.8935   1052.3702
## 4               4      D317 2016-09-05            1146.9708   1356.8084
## 5               5      D484 2016-12-04            1032.0520   1720.8641
## 6               6      D149 2012-02-13             919.9908   1533.5991
## 7               7      D285 2016-03-23             304.8285    698.8157
## 8               8      D256 2013-09-07             358.9701    652.0053
## 9               9      D283 2007-06-21            1698.1918   1387.1943
## 10             10      D213 2001-12-12             353.6952    670.5283
##    tail_length_cm tarsus_length_cm claw_length_cm
## 1        595.2706        121.65175      15.596219
## 2        104.2241         38.10844       4.305086
## 3        373.7619         68.16869      12.719697
## 4        542.5670        172.43663      14.809363
## 5        596.4419        114.05057      11.985672
## 6        563.9201        134.18051      11.005070
## 7        115.8964         42.18657       5.041634
## 8        148.3656         43.57849       4.576640
## 9        666.4246        147.44219      13.189233
## 10       140.4342         43.60513       4.248564

dbExecute(dragons_db, "CREATE TABLE diet (
diet_id INTEGER PRIMARY KEY AUTOINCREMENT,
dragon_id varchar(5),
sample_id varchar(8),
date text,
item_id integer,
item varchar(50),
FOREIGN KEY (dragon_id) REFERENCES dragons(dragon_id)
);")

diet <- read.csv("data/diet.csv") 

diet$diet_id <- 1:nrow(diet)

diet <- diet[, c("diet_id", "dragon", "sample_id", "sample_dates", "item_id",
                 "item")]

names(diet)[c(2, 4)] <- c("dragon_id", "date")

dbWriteTable(dragons_db, "diet", diet, append = TRUE)

dbGetQuery(dragons_db, "SELECT * FROM diet LIMIT 10;")

##    diet_id dragon_id sample_id       date item_id          item
## 1        1      D221    D221-1 2017-11-22       1  Domestic cow
## 2        2      D221    D221-1 2017-11-22       2 Domestic goat
## 3        3      D221    D221-1 2017-11-22       3        Coyote
## 4        4      D221    D221-1 2017-11-22       4  Domestic cow
## 5        5      D221    D221-1 2017-11-22       5  Domestic cow
## 6        6      D221    D221-1 2017-11-22       6     Mule deer
## 7        7      D119    D119-1 2012-06-28       1     Mule deer
## 8        8      D119    D119-1 2012-06-28       2        Cougar
## 9        9      D119    D119-1 2012-06-28       3 Domestic goat
## 10      10      D119    D119-1 2012-06-28       4 Mountain goat

dbExecute(dragons_db, "CREATE TABLE deployments (
deployment_id INTEGER PRIMARY KEY AUTOINCREMENT,
dragon_id varchar(5),
tag_id char(6),
start_deployment text,
end_deployment text,
FOREIGN KEY(dragon_id) REFERENCES dragons(dragon_id)
FOREIGN KEY(tag_id) REFERENCES tags(tag_id)
);")

deployments <- read.csv("data/deployments.csv") 

deployments$deployment_id <- 1:nrow(deployments)

deployments <- deployments[, c("deployment_id", "dragon", "tag", 
                               "start_deploy", "end_deploy")]

names(deployments)[2:5] <- c("dragon_id", "tag_id", "start_deployment", "end_deployment")

dbWriteTable(dragons_db, "deployments", deployments, append = TRUE)

dbGetQuery(dragons_db, "SELECT * FROM deployments LIMIT 10;")

##    deployment_id dragon_id tag_id start_deployment end_deployment
## 1              1      D486 N33529       2006-12-20     2007-07-04
## 2              2      D393 J47978       2008-01-13     2008-10-13
## 3              3       D88 R61684       2004-09-19     2004-12-01
## 4              4      D330 B35524       2015-11-21     2016-08-02
## 5              5      D478 T55954       2002-05-07     2002-10-26
## 6              6      D300 O31688       2006-01-20     2006-08-23
## 7              7      D380 B61925       2015-01-27     2015-10-21
## 8              8      D315 P47677       2006-03-03     2006-06-06
## 9              9      D209 P35987       2007-02-26     2007-05-31
## 10            10      D357 I41609       2013-09-09     2014-03-08

dbExecute(dragons_db, "CREATE TABLE gps_data_raw (
gps_id INTEGER PRIMARY KEY AUTOINCREMENT,
tag_id char(6),
timestamp text, 
utm_x double,
utm_y double,
FOREIGN KEY(tag_id) REFERENCES tags(tag_id)
);")

gps_data_raw <- read.csv("data/telemetry_raw.csv") 

gps_data_raw$gps_id <- 1:nrow(gps_data_raw)

gps_data_raw <- gps_data_raw[, c("gps_id", "tag", "timestamp", "x", "y")]

names(gps_data_raw)[c(2, 4, 5)] <- c("tag_id", "utm_x", "utm_y")

dbWriteTable(dragons_db, "gps_data_raw", gps_data_raw, append = TRUE)

dbGetQuery(dragons_db, "SELECT * FROM gps_data_raw LIMIT 10;")

##    gps_id tag_id           timestamp    utm_x   utm_y
## 1       1 N33529 2006-12-20 00:02:03 626479.9 4270480
## 2       2 N33529 2006-12-20 03:02:01 628321.6 4271404
## 3       3 N33529 2006-12-20 06:00:45 628111.1 4271812
## 4       4 N33529 2006-12-20 08:59:56 628294.4 4271940
## 5       5 N33529 2006-12-20 12:02:14 628642.2 4273090
## 6       6 N33529 2006-12-20 15:00:12 628750.1 4273879
## 7       7 N33529 2006-12-20 18:00:29 625775.1 4272723
## 8       8 N33529 2006-12-20 20:59:54 625682.6 4272723
## 9       9 N33529 2006-12-21 00:02:12 625493.0 4272917
## 10     10 N33529 2006-12-21 03:01:18 628051.1 4272672

Now we are ready to generate the gps_data table based on information from existing tables. We can create and populate the table in 2 steps using dbExecute:

dbExecute(dragons_db, "CREATE TABLE gps_data (
loc_id INTEGER PRIMARY KEY,
tag_id char(6),
dragon_id varchar(5),
timestamp text,
utm_x double,
utm_y double,
FOREIGN KEY (tag_id) REFERENCES tags(tag_id)
FOREIGN KEY (dragon_id) REFERENCES dragons(dragon_id)
);")

dbExecute(dragons_db, "INSERT INTO gps_data (
tag_id, dragon_id, timestamp, utm_x, utm_y)
SELECT
deployments.tag_id,
deployments.dragon_id,
gps_data_raw.timestamp,
gps_data_raw.utm_x,
gps_data_raw.utm_y
FROM deployments LEFT JOIN gps_data_raw USING (tag_id)
WHERE gps_data_raw.tag_id = deployments.tag_id AND
(
    (
    (strftime(gps_data_raw.timestamp) >= strftime(deployments.start_deployment)) AND
    (strftime(gps_data_raw.timestamp) <= strftime(deployments.end_deployment))
    )
OR 
    (
    (gps_data_raw.timestamp >= deployments.start_deployment) AND
    (deployments.end_deployment IS NULL)
    )
);")

dbGetQuery(dragons_db, "SELECT * FROM gps_data LIMIT 10;")

##    loc_id tag_id dragon_id           timestamp    utm_x   utm_y
## 1       1 N33529      D486 2006-12-20 00:02:03 626479.9 4270480
## 2       2 N33529      D486 2006-12-20 03:02:01 628321.6 4271404
## 3       3 N33529      D486 2006-12-20 06:00:45 628111.1 4271812
## 4       4 N33529      D486 2006-12-20 08:59:56 628294.4 4271940
## 5       5 N33529      D486 2006-12-20 12:02:14 628642.2 4273090
## 6       6 N33529      D486 2006-12-20 15:00:12 628750.1 4273879
## 7       7 N33529      D486 2006-12-20 18:00:29 625775.1 4272723
## 8       8 N33529      D486 2006-12-20 20:59:54 625682.6 4272723
## 9       9 N33529      D486 2006-12-21 00:02:12 625493.0 4272917
## 10     10 N33529      D486 2006-12-21 03:01:18 628051.1 4272672

7.4 A note on reproducibility

When considering the utility of interfacing a database with R, evaluating the benefits in terms of reproducibility is an important criterion. As far as the code to create the database, we could have saved a SQL script in the SQLite query editor and that would have been comparable to saving it as an R script. However, by loading the data into the database in R we eliminated the need for any pointing and clicking. We were able to fully build and populate the database using a single R script. This gives a major advantage in terms of reproducibility. Moreover, using R as an interface for the database means we can seamlessly transition from querying the data to processing it and analyzing it within the same programming framework, all while taking advantages of database-specific functionalities that R does not provide on its own.