Chapter 5 - Data Manipulation
2025-04-07
Source:vignettes/articles/1.5-ManipData.Rmd
1.5-ManipData.Rmd
Data Manipulation
You have successfully downloaded your NatureCounts dataset and are
ready to explore and summarise the data. In this chapter we will
demonstrate how to do some basic data manipulations and summaries. The
possibilities are endless, so we try to focus on examples we think would
be most valuable to users. We intend to develop collection
and protocol_id
specific data manipulation and analysis
code in the future. If you have specific requests or would like to
contribute your existing code, please contact dethier@birdscanada.org.
The code in this Chapter will not work unless you replace
"testuser"
with your actual user name. You will be prompted to enter your password.
Basic data wrangling
Recall in Chapter 2 you installed the tidyverse package, which included dplyr for data manipulations. You are encouraged to learn more about this function by reviewing the Data Transformations chapter in “R for Data Science” by Hadley Wickham and Garrett Grolemund. We also recommend you download a copy of the RStudio Data Wrangling cheat sheet as a reference document.
We are going to apply three key dplyr functions which will give you some of the basic tools needed to solve the vast majority of your data manipulation challenges. These functions include:
-
select()
: Pick variables by their names (columns) -
filter()
: Pick observations by their values (rows) -
summarise()
: Collapse many values down to a single summary (often withgroup_by()
)
Let’s continue using the full Ontario Whip-poor-will collection for this chapter.
WPWI <- nc_data_dl(collections = "WPWI", username = "testuser", info = "tutorial example")
Select
You will notice that NatureCounts datasets have many fields (i.e.,
columns) available. Generally, you will only want a few of these fields
for your summary or analysis. Recall the complete version of the BMDE includes 265 fields. The number of variables
downloaded from the BMDE will depend on the fields_set
,
which is by default “minimum” (57 fields). You can narrow in on the
variables you are interested in with the select()
function,
which allows you to subset the dataframe based on the names of the
variables.
For example, we are going to select()
a subset of
variables from the WPWI dataset that we need for our summary:
WPWI_select <- select(WPWI, "SurveyAreaIdentifier",
"RouteIdentifier", "species_id", "latitude",
"longitude", "bcr", "survey_day", "survey_month",
"survey_year")
Filter
Often there are observational records in a database that are not
needed. We could have filtered these records out using the
nc_data_dl()
filters. However, we
can also use the dplyr::filter()
function, which allows us
to subset observations based on their row values. Observations are
selected using:
- Comparison operators:
>
,>=
,<
,<=
,!=
(not equal),==
(equal) - Logical operators:
&
(and),|
(or),!
(is not)
There are worked examples provided in the naturecounts article Filtering data after download to get you started with applying filters. These include:
- Categorical filters
- Numerical filters
- Date filters
Here we provide a few additional examples for you to work with.
First, lets apply a simple filter()
that subsets the data
based on a single survey month:
WPWI_June <- filter(WPWI_select, survey_month == 6)
Now lets try multiple survey months by adding a logical operator:
WPWI_JJ <- filter(WPWI_select, survey_month == 6 | survey_month == 7)
#Alternatively this can be written as:
WPWI_JJ <- filter(WPWI_select, survey_month %in% c(6,7))
We can continue to add to the complexity of our filter:
Summarise
Now that we have selected the columns we need for our analysis we
will demonstrate how to summarise the data. The summarise()
function is most useful if paired with the group_by()
argument, because this changes the unit of analysis from the complete
dataset to individual groups.
For example, say we want to determine how many point count stops were
on each WPWI Route. We will want to group observations by
RouteIdentifier
and summarise the number of distinct
SurveyAreaIdentifier
entries using the
n_distinct()
function and creating a new column named
Nstops
.
WPWI_Route <- group_by(WPWI_select, RouteIdentifier)
WPWI_Route <- summarise(WPWI_Route, Nstops = n_distinct(SurveyAreaIdentifier))
However, now that we’re getting on to more complex operations, we’ll
introduce the use of the pipe %>%
. Piping can be used to
chain a sequence of functions, in order to avoid creating multiple
intermediate objects which can clutter our Environment! This allows you
to pass (pipe) the output of one line as input to the next line.
Therefore, the previous code can be re-written as:
WPWI_Route <- WPWI_select %>%
group_by(RouteIdentifier) %>%
summarise(Nstops = n_distinct(SurveyAreaIdentifier))
View(WPWI_Route)
Note that there is a pipe (%>%
) between each set of
lines and that the data (WPWI_select
) is only referred to
once, at the very start. These two different codes achieve the exact
same result.
Back to our example, we might also want to know how many routes were run in each year.
WPWI_Year <- WPWI_select %>%
group_by(survey_year) %>%
summarise(Nroute = n_distinct(RouteIdentifier))
View(WPWI_Year)
Finally, lets look to see how many Observations of WPWI were made in
each year. Since each row of data represents a survey point, we want to
count the number of rows containing species_id = “7871” within each
unique year. We can do this by first removing the ‘NA’ using the
filter
function, and then use the group_by
and
summarise
to determine the ‘length’ of the dataset (i.e.,
the number of rows with WPWI sightings).
Combine datasets
There are many reasons to join two datatables. For example, if you
have species observations across several years you might want to
associate these with weather covariates to assess if they are
correlated. Or you might have observations over a specific geographic
region, which you wish to associate with land cover covariates. To do
this, you will want to join (merge) two tables. Looking at the Data
Wrangling cheat sheet, you will notice there are several
join
functions. In all instances, you will need to specify
the key variable, which is found in both tables, which
is used for binding the tables together.
Here we provide an example using metadata available to you in naturecounts.
When you pull collections
with the “minimum”
fields_set
(default) you will notice you get the species_id
column, but not alpha species code. For example, let’s pull a small
amount of data from the second Alberta Breeding Bird Atlas (ABATLAS2)
from Beaverhill Lake Important Bird Area (AB001) in 2005.
AB_Atlas<-nc_data_dl(collections="ABATLAS2", region=list(iba = "AB001"), years=2005, username = "testuser", info = "tutorial example")
Notice there are 57 variables, including is a numeric species_id, but no alpha species code.
To rectify this, there are two options:
- Set the
fields_set
argument to extended.
AB_Atlas2<-nc_data_dl(collections="ABATLAS2", region=list(iba = "AB001"), years=2005, fields_set="extended", username = "testuser", info = "tutorial example")
Notice you are now pulling 285 variables, including the CommonName. However, for this particular dataset, there is still no alpha code (SpeciesCode). Most datasets will have these in the BMDE, this one does not. Lets add them using the an auxiliary table.
- Create a species code table using the available metadata and then join this to your existing data table.
First, identify the appropriate taxonomic authority using the metadata.
authority<-meta_species_authority()
View(authority)
Second, pull the species code table, and filter for the “ABATLAS2” authority.
species<-meta_species_codes()
View(species)
species<-species %>% filter(authority=="ABATLAS2")
Third, join your tables so that it includes the desired species information.
AB_Atlas_sp<-left_join(AB_Atlas, species, by="species_id")
Now your datatable contains the species code, which is familiar to most birders. You can also try joining the Atlas data to the meta_species_taxonomy() table if you wish to capture the english common, french, or scientific names, among other fields.
Helper functions
There are a few additional helper functions built into the naturecounts R package that you may find useful. At present, they include:
-
format_dates()
:Creates and adds date and day-of-year (doy) field/columns to data -
formate_zero_fill()
: Zero-fill the species presence data by adding zero observations counts (absences) to an existing NatureCounts dataset
The zero-fill function is particularly important if you want to ensure your dataset is complete!
Exercies
Exercise 1: You are doing a research project using the fall migration monitoring data collected at Vaseux Lake Bird Observatory, British Columbia. You request the open access data from 2017-2020. After you request this subset of the collection, you need to determine the number of unique days Gray catbirds were records in each year?
Answer: 2017 = 54 2018 = 53
Exercise 2: Building off the same dataset used in Exercise 1, you now want to zero-fill the dataframe to ensure it is complete for your study on Gray catbirds. How many records (rows) are in the zero-fill Gray catbird dataframe?
Answer: 152
Answers to the exercises can be found in Chapter 7.