Filtering data after download

The naturecounts package restricts users to three types of data filters in order not to overload the NatureCounts server. However, it is fairly straightforward to apply data filters after the download.

In this vignette we will go over several different types of filtering using the tidyverse collection of packages. In particular, we will make extensive use of the filter() function from the dplyr package. For more information about this function, see the chapter on Data Transformations in R for Data Science by Hadley Wickham and Garrett Grolemund.

For each example, we will show case how to filter your data either directly downloaded as a data frame, or obtained from a SQLite database.

First let’s get a collection of chickadee data to use. We’ll collect all public data on Carolina, black-capped and mountain chickadees (see the vignette on species codes) for more information on how to find species codes).

The following examples use the “testuser” user which is not available to you. You can quickly sign up for a free account of your own to access and play around with these examples. Simply replace testuser with your own username.

Setup

Packages

library(naturecounts)
library(dplyr)
library(lubridate)

Downloading data directly to a Data Frame

boreal_chickadee_df <- nc_data_dl(species = 14320, collection = "BBS", verbose = FALSE,
                                  username = "testuser", info = "nc_vignette")

Downloading data to a SQLite DataBase (chickadee.nc)

con <- nc_data_dl(species = 14320, collection = "BBS", sql_db = "boreal_chickadee", 
                  verbose = FALSE, username = "testuser", info = "nc_vignette")
boreal_chickadee_db <- tbl(con, "naturecounts")

Or, if starting a new session with an existing SQLite database, first create the connection to the database

con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "boreal_chickadee.nc")
boreal_chickadee_db <- tbl(con, "naturecounts")

When using a database, you can choose to collect() your data into a data frame (in which case it will be identical to chickadee_df, but if your database is large, it will be faster to leave it as a database as long as possible.

boreal_chickadee_df <- collect(boreal_chickadee_db)

We will also use a helper function to add columns for date and doy (day of year) to our data.

For data frames, you directly modify the data frame object.

boreal_chickadee_df <- format_dates(boreal_chickadee_df, overwrite = TRUE)

For databases, you’ll modify the connection, then update your table reference.

Note: This results in on-disk additions to the SQLite database. If you do not wish to make add fields to your SQLite database, you will have to collect() the data into a data frame first.

con <- format_dates(con)
boreal_chickadee_db <- tbl(con, "naturecounts")

Not all dates can be created if survey_year, survey_month or survey_day are missing.

filter(boreal_chickadee_df, is.na(date)) |>
  select(survey_year, survey_month, survey_day, date)

## # A tibble: 4 × 4
##   survey_year survey_month survey_day date  
##         <int>        <int>      <int> <date>
## 1        2001           NA         NA NA    
## 2        2007           NA         NA NA    
## 3        2007           NA         NA NA    
## 4        1987           NA         NA NA

Unless stated otherwise, in the following examples, we demonstrate code that works for both data frames and databases.

Categorical filters

The simplest way to filter your data is to pick out categories to include or exclude. In each case there are three steps.

What are the categories available?
Filter the categories you a) wish to keep, or b) wish to exclude
Confirm the results

These steps work for categories such as collection, species, survey_year, iba_site, bcr or any variable that can be expressed as a category.

What categories are available?

pull() grabs the column of interest, unique() returns only unique values.

boreal_chickadee_db |>
  select(collection) |>
  distinct()

## # Source:   SQL [1 x 1]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
##   collection
##   <chr>     
## 1 BBS

2a) Filter the categories you wish to keep

alberta_atlas <- filter(boreal_chickadee_db, 
                        collection %in% c("ABATLAS1", "ABATLAS2"))

2b) Filter the categories you wish to exclude with !

alberta_atlas <- filter(boreal_chickadee_db, 
                        !collection %in% c("ABBIRDRECS", "RCBIOTABASE"))

Confirm

alberta_atlas |>
  select(collection) |>
  distinct()

## # Source:   SQL [1 x 1]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
##   collection
##   <chr>     
## 1 BBS

Numerical filters

Another way to filter your data is to specify numerical ranges include or exclude. In each case there are the same three steps as above.

What is the numerical range available?
Filter the variable to the range you a) wish to keep, or b) wish to exclude
Confirm the results

These steps work for categories such as survey_year, survey_month, survey_day, doy or any variable that can be expressed as a number.

What categories are available?

pull() grabs the column of interest, summary() returns a numerical summary of the values.

boreal_chickadee_db |>
  select(survey_year) |>
  summarize(min = min(survey_year, na.rm = TRUE), 
            max = max(survey_year, na.rm = TRUE))

## # Source:   SQL [1 x 2]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
##     min   max
##   <int> <int>
## 1  1966  2010

2a) Filter the variable to the range you wish to keep (note that the , is an implied &)

recent <- filter(boreal_chickadee_db, survey_year >= 2005)

2b) Filter the variable to the range you wish to exclude with ! (note that here we need to use the & explicitly)

recent <- filter(boreal_chickadee_db, !(survey_year >= 1960 & survey_year < 2000))

Confirm

recent |>
  select(survey_year) |>
  summarize(min = min(survey_year, na.rm = TRUE), 
            max = max(survey_year, na.rm = TRUE))

## # Source:   SQL [1 x 2]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
##     min   max
##   <int> <int>
## 1  2000  2010

Date filters

Filtering by dates is usually slightly more complex, but fortunately, the filter() function automatically detects dates which means it’s pretty much the same as a numerical filter.

What categories are available?

pull() grabs the column of interest, summary() returns a summary of the values.

boreal_chickadee_db |>
  select(date) |>
  summarize(min = min(date, na.rm = TRUE), 
            max = max(date, na.rm = TRUE))

## # Source:   SQL [1 x 2]
## # Database: sqlite 3.47.1 [/home/runner/work/naturecounts/naturecounts/vignettes/boreal_chickadee.nc]
##   min        max       
##   <chr>      <chr>     
## 1 1966-06-02 2010-07-07