Chapter 7 Interfacing Databases in R with RSQLite
7.1 The RSQLite package
RSQLite is an R package that provides an interface with SQLite such that you can interact with the database from within R. Everything we have done so far on the database can be done from R without ever opening SQLite Studio. You could actually do all of those things without even having SQLite installed on your computer, because the R package itself contains SQLite.
In turn, RSQLite relies on another R package called DBI (which stands for database interface). DBI is a package that provides generic functions for interfacing databases with R, and RSQLite makes adjustments that are specific to SQLite. So, first things first, we are going to load the DBI package:
7.2 Establising a database connection
We start by connecting to a database, either an existing one or one we create
new. The dbConnect
function takes as input a file path for the database. If
the path exists, it will connect to the existing database; if it does not, it
will create it. We also specify that the type of database connection is SQLite:
7.3 Sending queries to the database
Now that the connection is established, we can start sending queries to the
database. Any time we perform actions that affect the database (creating or
deleting tables, inserting data, etc.) we use the function dbExecute
. This
function takes as input a database connection and the SQL code that we want to
run (as a character string). For example, we can copy-paste the code we used in
Chapter 6 to create the dragons table:
dbExecute(dragons_db, "CREATE TABLE dragons (
dragon_id varchar(5) NOT NULL,
sex char(1) CHECK (sex IN ('M', 'F')),
age_class varchar(8) CHECK (age_class IN ('Juvenile', 'Subadult', 'Adult')),
species varchar(50),
PRIMARY KEY (dragon_id)
);")
Now, instead of manually importing the data by pointing and clicking, we can load the data from .csv files into R and insert it into the newly created table.
## [1] "dragon" "sex" "age_class" "species"
Note that the column names in the .csv differ slightly from the column names we assigned when creating the dragons table. This is a problem because RSQLite won’t recognize the columns and won’t be able to insert the data. We can prevent this by changing any names that do not match so that they match the database:
Now we can enter the data from the .csv into the dragons table. The function
dbWriteTable
takes as input the database connection, the name of the table we
want to fill (in quotes), and the name of the data frame we want to input. Note
that I’m using append = TRUE
because otherwise RSQLite will overwrite the
current table with whatever is in the .csv, and any constraints we have enforced (primary key, foreign key, etc.) will be lost:
Now we can send a query to the database to check that the data were inserted
correctly. Because this is a query that retrieves data, not a query that
modifies the database, we use dbGetQuery
instead of dbExecute
:
## dragon_id sex age_class species update_timestamp
## 1 D1 F Subadult Hebridean Black <NA>
## 2 D2 <NA> Juvenile Romanian Longhorn <NA>
## 3 D3 F Adult Hebridean Black <NA>
## 4 D4 F Adult Peruvian Vipertooth <NA>
## 5 D5 <NA> Juvenile Ukrainian Ironbelly <NA>
## 6 D6 F Adult Norwegian Ridgeback <NA>
## 7 D7 M Adult Hebridean Black <NA>
## 8 D8 <NA> Juvenile Peruvian Vipertooth <NA>
## 9 D9 F Adult Norwegian Ridgeback <NA>
## 10 D10 F Adult Common Welsh Green <NA>
Now we can repeat this process with the other tables:
dbExecute(dragons_db, "CREATE TABLE tags (
tag_id char(6) NOT NULL PRIMARY KEY,
brand varchar(50),
status varchar(20)
);")
## tag_id brand status
## 1 N33529 Heatwave Telemetry Refurbished
## 2 J47978 Undertone New
## 3 R61684 Heatwave Telemetry New
## 4 B35524 Aerotronic New
## 5 T55954 Aerotronic New
## 6 O31688 Undertone New
## 7 B61925 Undertone New
## 8 P47677 Aerotronic New
## 9 P35987 Aerotronic New
## 10 I41609 Heatwave Telemetry Refurbished
dbExecute(dragons_db, "CREATE TABLE capture_sites (
site char(3) NOT NULL PRIMARY KEY,
utm_x double,
utm_y double
);")
capture_sites <- read.csv("data/capture_sites.csv")
names(capture_sites)[2:3] <- c("utm_x", "utm_y")
dbWriteTable(dragons_db, "capture_sites", capture_sites, append = TRUE)
## site utm_x utm_y
## 1 LOG 430659 4620909
## 2 SLC 424790 4512583
## 3 MOA 626325 4270426
## 4 STG 271064 4108535
## 5 DLT 364106 4357045
Remember how in SQLite we had to find a workaround to add auto-incremental primary keys? Well, using RSQLite to build the database instead means we do not need that trick anymore. We can just add the auto-incremental serial number to the data and append it to the table while keeping all the structure and the constraints we have enforced. Let’s try this with the captures table. First, we create the table:
dbExecute(dragons_db, "CREATE TABLE captures (
capture_id INTEGER PRIMARY KEY AUTOINCREMENT,
dragon_id varchar(5),
date text,
site char(3),
FOREIGN KEY(dragon_id) REFERENCES dragons(dragon_id)
FOREIGN KEY(site) REFERENCES capture_sites(site)
);")
Then we read in the .csv file and add a column called capture_id
with the
serial number we’ll use as primary key:
## dragon capture_date capture_site capture_id
## 1 D1 2012-08-10 MOA 1
## 2 D2 2013-07-23 MOA 2
## 3 D3 2016-10-19 SLC 3
## 4 D4 2001-01-10 LOG 4
## 5 D5 2000-09-28 MOA 5
## 6 D6 2016-02-25 LOG 6
Not only the column names need to match the names in the database table, but the order has to match too. So we re-order the columns:
We fix the names so they match:
And we append the data frame to the database table:
Let’s check that everything worked:
## capture_id dragon_id date site
## 1 1 D1 2012-08-10 MOA
## 2 2 D2 2013-07-23 MOA
## 3 3 D3 2016-10-19 SLC
## 4 4 D4 2001-01-10 LOG
## 5 5 D5 2000-09-28 MOA
## 6 6 D6 2016-02-25 LOG
## 7 7 D7 2012-11-05 MOA
## 8 8 D8 2009-07-14 LOG
## 9 9 D9 2001-08-17 STG
## 10 10 D10 2017-01-17 LOG
Now we can repeat this process with the remaining tables:
dbExecute(dragons_db, "CREATE TABLE morphometrics (
measurement_id INTEGER PRIMARY KEY AUTOINCREMENT,
dragon_id varchar(5),
date text,
total_body_length_cm float,
wingspan_cm float,
tail_length_cm float,
tarsus_length_cm float,
claw_length_cm float,
FOREIGN KEY (dragon_id) REFERENCES dragons(dragon_id)
);")
# Load csv file
morphometrics <- read.csv("data/morphometrics.csv")
# Add auto-incremental number
morphometrics$measurement_id <- 1:nrow(morphometrics)
# Re-order columns
morphometrics <- morphometrics[, c("measurement_id", "dragon", "date",
"total_body_length_cm", "wingspan_cm",
"tail_length_cm", "tarsus_length_cm",
"claw_length_cm")]
# Change column names to match
names(morphometrics)[2] <- "dragon_id"
# Append to database table
dbWriteTable(dragons_db, "morphometrics", morphometrics, append = TRUE)
## measurement_id dragon_id date total_body_length_cm wingspan_cm
## 1 1 D96 2012-10-10 1069.8965 1389.5527
## 2 2 D400 2003-07-03 333.4600 634.9109
## 3 3 D316 2017-09-08 866.8935 1052.3702
## 4 4 D317 2016-09-05 1146.9708 1356.8084
## 5 5 D484 2016-12-04 1032.0520 1720.8641
## 6 6 D149 2012-02-13 919.9908 1533.5991
## 7 7 D285 2016-03-23 304.8285 698.8157
## 8 8 D256 2013-09-07 358.9701 652.0053
## 9 9 D283 2007-06-21 1698.1918 1387.1943
## 10 10 D213 2001-12-12 353.6952 670.5283
## tail_length_cm tarsus_length_cm claw_length_cm
## 1 595.2706 121.65175 15.596219
## 2 104.2241 38.10844 4.305086
## 3 373.7619 68.16869 12.719697
## 4 542.5670 172.43663 14.809363
## 5 596.4419 114.05057 11.985672
## 6 563.9201 134.18051 11.005070
## 7 115.8964 42.18657 5.041634
## 8 148.3656 43.57849 4.576640
## 9 666.4246 147.44219 13.189233
## 10 140.4342 43.60513 4.248564
dbExecute(dragons_db, "CREATE TABLE diet (
diet_id INTEGER PRIMARY KEY AUTOINCREMENT,
dragon_id varchar(5),
sample_id varchar(8),
date text,
item_id integer,
item varchar(50),
FOREIGN KEY (dragon_id) REFERENCES dragons(dragon_id)
);")
diet <- read.csv("data/diet.csv")
diet$diet_id <- 1:nrow(diet)
diet <- diet[, c("diet_id", "dragon", "sample_id", "sample_dates", "item_id",
"item")]
names(diet)[c(2, 4)] <- c("dragon_id", "date")
dbWriteTable(dragons_db, "diet", diet, append = TRUE)
## diet_id dragon_id sample_id date item_id item
## 1 1 D221 D221-1 2017-11-22 1 Domestic cow
## 2 2 D221 D221-1 2017-11-22 2 Domestic goat
## 3 3 D221 D221-1 2017-11-22 3 Coyote
## 4 4 D221 D221-1 2017-11-22 4 Domestic cow
## 5 5 D221 D221-1 2017-11-22 5 Domestic cow
## 6 6 D221 D221-1 2017-11-22 6 Mule deer
## 7 7 D119 D119-1 2012-06-28 1 Mule deer
## 8 8 D119 D119-1 2012-06-28 2 Cougar
## 9 9 D119 D119-1 2012-06-28 3 Domestic goat
## 10 10 D119 D119-1 2012-06-28 4 Mountain goat
dbExecute(dragons_db, "CREATE TABLE deployments (
deployment_id INTEGER PRIMARY KEY AUTOINCREMENT,
dragon_id varchar(5),
tag_id char(6),
start_deployment text,
end_deployment text,
FOREIGN KEY(dragon_id) REFERENCES dragons(dragon_id)
FOREIGN KEY(tag_id) REFERENCES tags(tag_id)
);")
deployments <- read.csv("data/deployments.csv")
deployments$deployment_id <- 1:nrow(deployments)
deployments <- deployments[, c("deployment_id", "dragon", "tag",
"start_deploy", "end_deploy")]
names(deployments)[2:5] <- c("dragon_id", "tag_id", "start_deployment", "end_deployment")
dbWriteTable(dragons_db, "deployments", deployments, append = TRUE)
## deployment_id dragon_id tag_id start_deployment end_deployment
## 1 1 D486 N33529 2006-12-20 2007-07-04
## 2 2 D393 J47978 2008-01-13 2008-10-13
## 3 3 D88 R61684 2004-09-19 2004-12-01
## 4 4 D330 B35524 2015-11-21 2016-08-02
## 5 5 D478 T55954 2002-05-07 2002-10-26
## 6 6 D300 O31688 2006-01-20 2006-08-23
## 7 7 D380 B61925 2015-01-27 2015-10-21
## 8 8 D315 P47677 2006-03-03 2006-06-06
## 9 9 D209 P35987 2007-02-26 2007-05-31
## 10 10 D357 I41609 2013-09-09 2014-03-08
dbExecute(dragons_db, "CREATE TABLE gps_data_raw (
gps_id INTEGER PRIMARY KEY AUTOINCREMENT,
tag_id char(6),
timestamp text,
utm_x double,
utm_y double,
FOREIGN KEY(tag_id) REFERENCES tags(tag_id)
);")
gps_data_raw <- read.csv("data/telemetry_raw.csv")
gps_data_raw$gps_id <- 1:nrow(gps_data_raw)
gps_data_raw <- gps_data_raw[, c("gps_id", "tag", "timestamp", "x", "y")]
names(gps_data_raw)[c(2, 4, 5)] <- c("tag_id", "utm_x", "utm_y")
dbWriteTable(dragons_db, "gps_data_raw", gps_data_raw, append = TRUE)
## gps_id tag_id timestamp utm_x utm_y
## 1 1 N33529 2006-12-20 00:02:03 626479.9 4270480
## 2 2 N33529 2006-12-20 03:02:01 628321.6 4271404
## 3 3 N33529 2006-12-20 06:00:45 628111.1 4271812
## 4 4 N33529 2006-12-20 08:59:56 628294.4 4271940
## 5 5 N33529 2006-12-20 12:02:14 628642.2 4273090
## 6 6 N33529 2006-12-20 15:00:12 628750.1 4273879
## 7 7 N33529 2006-12-20 18:00:29 625775.1 4272723
## 8 8 N33529 2006-12-20 20:59:54 625682.6 4272723
## 9 9 N33529 2006-12-21 00:02:12 625493.0 4272917
## 10 10 N33529 2006-12-21 03:01:18 628051.1 4272672
Now we are ready to generate the gps_data
table based on information from
existing tables. We can create and populate the table in 2 steps using
dbExecute
:
dbExecute(dragons_db, "CREATE TABLE gps_data (
loc_id INTEGER PRIMARY KEY,
tag_id char(6),
dragon_id varchar(5),
timestamp text,
utm_x double,
utm_y double,
FOREIGN KEY (tag_id) REFERENCES tags(tag_id)
FOREIGN KEY (dragon_id) REFERENCES dragons(dragon_id)
);")
dbExecute(dragons_db, "INSERT INTO gps_data (
tag_id, dragon_id, timestamp, utm_x, utm_y)
SELECT
deployments.tag_id,
deployments.dragon_id,
gps_data_raw.timestamp,
gps_data_raw.utm_x,
gps_data_raw.utm_y
FROM deployments LEFT JOIN gps_data_raw USING (tag_id)
WHERE gps_data_raw.tag_id = deployments.tag_id AND
(
(
(strftime(gps_data_raw.timestamp) >= strftime(deployments.start_deployment)) AND
(strftime(gps_data_raw.timestamp) <= strftime(deployments.end_deployment))
)
OR
(
(gps_data_raw.timestamp >= deployments.start_deployment) AND
(deployments.end_deployment IS NULL)
)
);")
## loc_id tag_id dragon_id timestamp utm_x utm_y
## 1 1 N33529 D486 2006-12-20 00:02:03 626479.9 4270480
## 2 2 N33529 D486 2006-12-20 03:02:01 628321.6 4271404
## 3 3 N33529 D486 2006-12-20 06:00:45 628111.1 4271812
## 4 4 N33529 D486 2006-12-20 08:59:56 628294.4 4271940
## 5 5 N33529 D486 2006-12-20 12:02:14 628642.2 4273090
## 6 6 N33529 D486 2006-12-20 15:00:12 628750.1 4273879
## 7 7 N33529 D486 2006-12-20 18:00:29 625775.1 4272723
## 8 8 N33529 D486 2006-12-20 20:59:54 625682.6 4272723
## 9 9 N33529 D486 2006-12-21 00:02:12 625493.0 4272917
## 10 10 N33529 D486 2006-12-21 03:01:18 628051.1 4272672
7.4 A note on reproducibility
When considering the utility of interfacing a database with R, evaluating the benefits in terms of reproducibility is an important criterion. As far as the code to create the database, we could have saved a SQL script in the SQLite query editor and that would have been comparable to saving it as an R script. However, by loading the data into the database in R we eliminated the need for any pointing and clicking. We were able to fully build and populate the database using a single R script. This gives a major advantage in terms of reproducibility. Moreover, using R as an interface for the database means we can seamlessly transition from querying the data to processing it and analyzing it within the same programming framework, all while taking advantages of database-specific functionalities that R does not provide on its own.