Chapter 4 - Downloading and Filtering Data
2025-04-07
Source:vignettes/articles/1.4-DownloadData.Rmd
1.4-DownloadData.Rmd
Downloading and Filtering Data
This chapter jumps right into downloading NatureCounts data, which builds directly on the skills developed in Chapter 3.
Downloading NatureCounts data
To download data you need to sign up for a free account.
Let’s try fetching observational data across multiple collections
into a local data frame. To do so, we will use the
nc_data_dl()
download function. There is lots of data here,
so this could take a couple minutes.
First find the species id for the target species:
search_species("American Bittern")
## # A tibble: 1 × 5
## species_id scientific_name english_name french_name taxon_group
## <int> <chr> <chr> <chr> <chr>
## 1 2490 Botaurus lentiginosus American Bittern Butor d'Amérique BIRDS
Then use this id with nc_data_dl()
:
bittern <- nc_data_dl(species = 2490, username = "testuser", info = "tutorial example")
The bittern dataframe should now appear in the upper right panel of
RStudio under the Environment tab. Notice we specified the
info
parameter which is a short description of what the
data is being downloaded for. This does not need to be specified when
using the nc_count()
view function.
The code in this Chapter will not work unless you replace
"testuser"
with your actual user name. You will be prompted to enter your password.
Next we demonstrate how to download the Ontario Whip-poor-will collection (WPWI), since it is Open Access (Level 5) and contains relatively few records (n = 3012). Click here for more information on this dataset.
WPWI <- nc_data_dl(collections = "WPWI", username = "testuser", info = "tutorial example")
Our data download was more specific, and way faster! This will depend
on your research goals and data needs. Instead of specifying species id,
we filtered our data download by using the species alpha-numeric code
and the collections
argument.
To access Level 3 or 4 collections you must sign
up for a free account and request permission from
the data custodian. For a complete list of datasets and access level,
visit the NatureCounts
datasets page or use the meta_collection()
function.
Here you can browse information on each dataset prior to requesting
access.
To make a data request, use the NatureCounts Download Data
query tool. For step-by-step visual instructions, we encourage you
to watch: NatureCounts: An
Introductory Tutorial. You will receive an email confirmation when
your request has been approved, which will contain your
request_id
. This number will be used to download your newly
acquired dataset into R. There is also a build in function that allows
you to check the status of your request and retrieve your
request_id
:
nc_requests(username = "testuser")
Here is sample code to download Access Level 3 or 4 data:
my_data <- nc_data_dl(request_id = 000000, username = "USER",
info = "MY REASON")
Note: This code is not functional and is here to serve as a
structural example. You will need to insert your
request_id
, username
, and info
in
the code chunk above to make this work. If you applied filters to the
web portal data request (e.g., species, region, year), you will only
receive the subset of the dataset you requested.
Applying Filters
The filters applied in Chapter 3 when using the
nc_count()
view function are also used for the
nc_data_dl()
download function. Again, these options
include: collections
, project_id
,
species
, years
, doy
(day-of-year), region
, and site_type
. You may
also wish to specify which fields/columns to return using the
field_set
and fields
options in the
nc_data_dl()
function. For help with this feature, see the
naturecounts article ‘Selecting
columns and fields to download’.
Please review the resources provided on filter metadata prior to proceeding.
The users can specify up to 3 filter options in the download process
and should try to limit redundancies in filters. In most cases, one of
these options will be collections
, since authorization is
given independently for each dataset (i.e., collection). If the user
chooses to select species
, then only collections the user
has authorization to access will be used to retrieve
species-specific data. This differs from the nc_count()
function, which by default shows you all the records available
in each collection.
Examples
Here are a few examples for you to work through to become familiar
with the nc_data_dl()
function.
Example 1: We will use the Ontario Whip-poor-will collection (WPWI) again for this example. After reviewing the online material, here, you notice that the main survey ran from 2010 to 2012, but that additional surveys were conducted between 2009 and 2013. For your research, you only want data collected during the main survey window. Further, the survey was Ontario-wide, but most effort and records are from southern Ontario. Let’s further limit the data to those collected south of the Canadian Shield, which is approximated with BCR 13.
WPWI_filter <- nc_data_dl(collections = "WPWI", years = c(2010, 2012),region = list(bcr = "13"), username = "testuser", info = "tutorial example")
You will notice that the number of records in the filtered WPWI
download (WPWI_filter
= 754) is substantially less than the
number of records in the full dataset (WPWI
= 3012)
downloaded in Chapter 3.
Example 2: Rather than using BCR 13 to approximate the study area, you are interested in just looking at records collected on the Bruce Peninsula for the full time period. You decide to use spatial data to filter observations. Specifically, you create a bounding box, using the Within Coordinates tool to help you retrieve custom coordinates for your data query and/or download.
WPWI_bp <- nc_data_dl(collections = "WPWI",
region = list(bbox = c(left = -81.7, bottom = 44.5,
right = -80.9, top = 45.3)),
username = "testuser", info = "tutorial example")
You will notice there are now even fewer observation records
(WPWI_bp
= 72)
Reminder that, through the naturecounts R package, data is downloaded with a specific set of fields/columns by default. However, for more advanced applications, users may wish to specify which fields/columns to return.
Exercises
Now apply your newly acquired skills!
Exercise 1: You are from the Northwest Territories and
interested in learning more about breeding birds in your region. First,
you identify the NatureCounts dataset most suitable for this exercise
using the nc_count()
function. Next, you decide to focus
your download to only include Blackpoll Warbler data collected over the
past 5 years (2015-2020). How many observation records did you
download?
Answer: BBS50-CAN, 1756
Exercise 2: You are birding in the Beaverhill Lake Important Bird Area (iba) in May. You think you hear a Bobolink! You are curious if this species has been detected here in the month of May. You choose to download the records you have authorization to freely access from NatureCounts. How many observation records did you download? What year are these records from?
Answer: 3, 1988 & 1987
Answers to the exercises can be found in Chapter 7.