The naturecounts
package restricts users to three types
of data filters in order not to overload the NatureCounts server.
However, it is fairly straightforward to apply data filters after the
download.
In this vignette we will go over several different types of filtering
using the tidyverse
collection of packages. In particular,
we will make extensive use of the filter()
function from
the dplyr
package. For more information about this
function, see the chapter on Data
Transformations in R for Data Science by Hadley Wickham and Garrett
Grolemund.
For each example, we will show case how to filter your data either directly downloaded as a data frame, or obtained from a SQLite database.
First let’s get a collection of chickadee data to use. We’ll collect all public data on Carolina, black-capped and mountain chickadees (see the vignette on species codes) for more information on how to find species codes).
The following examples use the “testuser” user which is not available to you. You can quickly sign up for a free account of your own to access and play around with these examples. Simply replace
testuser
with your own username.
Setup
Packages
Downloading data directly to a Data Frame
boreal_chickadee_df <- nc_data_dl(species = 14320, collection = "BBS", verbose = FALSE,
username = "testuser", info = "nc_vignette")
Downloading data to a SQLite DataBase
(chickadee.nc
)
con <- nc_data_dl(species = 14320, collection = "BBS", sql_db = "boreal_chickadee",
verbose = FALSE, username = "testuser", info = "nc_vignette")
boreal_chickadee_db <- tbl(con, "naturecounts")
Or, if starting a new session with an existing SQLite database, first create the connection to the database
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "boreal_chickadee.nc")
boreal_chickadee_db <- tbl(con, "naturecounts")
When using a database, you can choose to collect()
your
data into a data frame (in which case it will be identical to
chickadee_df
, but if your database is large, it will be
faster to leave it as a database as long as possible.
boreal_chickadee_df <- collect(boreal_chickadee_db)
We will also use a helper function to add columns for
date
and doy
(day of year) to our data.
For data frames, you directly modify the data frame object.
boreal_chickadee_df <- format_dates(boreal_chickadee_df, overwrite = TRUE)
For databases, you’ll modify the connection, then update your table reference.
Note: This results in on-disk additions to the
SQLite database. If you do not wish to make add fields to your SQLite
database, you will have to collect()
the data into a data
frame first.
con <- format_dates(con)
boreal_chickadee_db <- tbl(con, "naturecounts")
Not all dates can be created if survey_year
,
survey_month
or survey_day
are missing.
## # A tibble: 4 × 4
## survey_year survey_month survey_day date
## <int> <int> <int> <date>
## 1 2001 NA NA NA
## 2 2007 NA NA NA
## 3 2007 NA NA NA
## 4 1987 NA NA NA
Unless stated otherwise, in the following examples, we demonstrate code that works for both data frames and databases.
Categorical filters
The simplest way to filter your data is to pick out categories to include or exclude. In each case there are three steps.
- What are the categories available?
- Filter the categories you a) wish to keep, or b) wish to exclude
- Confirm the results
These steps work for categories such as collection
,
species
, survey_year
, iba_site
,
bcr
or any variable that can be expressed as a
category.
- What categories are available?
## # Source: SQL [1 x 1]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
## collection
## <chr>
## 1 BBS
2a) Filter the categories you wish to keep
2b) Filter the categories you wish to exclude with !
- Confirm
## # Source: SQL [1 x 1]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
## collection
## <chr>
## 1 BBS
Numerical filters
Another way to filter your data is to specify numerical ranges include or exclude. In each case there are the same three steps as above.
- What is the numerical range available?
- Filter the variable to the range you a) wish to keep, or b) wish to exclude
- Confirm the results
These steps work for categories such as survey_year
,
survey_month
, survey_day
, doy
or
any variable that can be expressed as a number.
- What categories are available?
boreal_chickadee_db |>
select(survey_year) |>
summarize(min = min(survey_year, na.rm = TRUE),
max = max(survey_year, na.rm = TRUE))
## # Source: SQL [1 x 2]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
## min max
## <int> <int>
## 1 1966 2010
2a) Filter the variable to the range you wish to keep (note that the
,
is an implied &
)
recent <- filter(boreal_chickadee_db, survey_year >= 2005)
2b) Filter the variable to the range you wish to exclude with
!
(note that here we need to use the &
explicitly)
recent <- filter(boreal_chickadee_db, !(survey_year >= 1960 & survey_year < 2000))
- Confirm
recent |>
select(survey_year) |>
summarize(min = min(survey_year, na.rm = TRUE),
max = max(survey_year, na.rm = TRUE))
## # Source: SQL [1 x 2]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
## min max
## <int> <int>
## 1 2000 2010
Date filters
Filtering by dates is usually slightly more complex, but fortunately,
the filter()
function automatically detects dates which
means it’s pretty much the same as a numerical filter.
- What categories are available?
boreal_chickadee_db |>
select(date) |>
summarize(min = min(date, na.rm = TRUE),
max = max(date, na.rm = TRUE))
## # Source: SQL [1 x 2]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
## min max
## <chr> <chr>
## 1 1966-06-02 2010-07-07
2a) Filter the variable to the range you wish to keep (note that the
,
is an implied &
)
field_season <- filter(boreal_chickadee_db, date >= "2010-05-10", date <= "2010-09-21")
2b) Filter the variable to the range you wish to exclude with
!
(note that here we need to use the &
explicitly)
omit_field_season <- filter(boreal_chickadee_db, !(date >= "2010-05-10" & date <= "2010-09-21"))
- Confirm
field_season |>
select(date) |>
summarize(min = min(date, na.rm = TRUE),
max = max(date, na.rm = TRUE))
## # Source: SQL [1 x 2]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
## min max
## <chr> <chr>
## 1 2010-05-30 2010-07-07