Chapter 5 Data Manipulation
You have successfully downloaded your NatureCounts dataset and are ready to explore and summarise the data. In this chapter we will demonstrate how to do some basic data manipulations and summaries. The possibilities are endless, so we try to focus on examples we think would be most valuable to users. We intend to develop collection
and protocol_id
specific data manipulation and analysis code in the future. If you have specific requests or would like to contribute your existing code, please contact dethier@birdscanada.org.
The code in this Chapter will not work unless you replace
"testuser"
with your actual user name. You will be prompted to enter your password.
5.1 Basic data wrangling
Recall in Chapter 2 you installed the tidyverse package, which included dplyr for data manipulations. You are encouraged to learn more about this function by reviewing the Data Transformations chapter in “R for Data Science” by Hadley Wickham and Garrett Grolemund. We also recommend you download a copy of the RStudio Data Wrangling cheat sheet as a reference document.
We are going to apply three key dplyr functions which will give you some of the basic tools needed to solve the vast majority of your data manipulation challenges. These functions include:
select()
: Pick variables by their names (columns)filter()
: Pick observations by their values (rows)summarise()
: Collapse many values down to a single summary (often withgroup_by()
)
Let’s continue using the full Ontario Whip-poor-will collection for this chapter.
<- nc_data_dl(collections = "WPWI", username = "testuser", info = "tutorial example") WPWI
5.1.1 Select
You will notice that NatureCounts datasets have many fields (i.e., columns) available. Generally, you will only want a few of these fields for your summary or analysis. Recall the complete version of the BMDE includes 265 fields. The number of variables downloaded from the BMDE will depend on the fields_set
, which is by default “minimum” (57 fields). You can narrow in on the variables you are interested in with the select()
function, which allows you to subset the dataframe based on the names of the variables.
For example, we are going to select()
a subset of variables from the WPWI dataset that we need for our summary:
<- select(WPWI, "SurveyAreaIdentifier",
WPWI_select "RouteIdentifier", "species_id", "latitude",
"longitude", "bcr", "survey_day", "survey_month",
"survey_year")
5.1.2 Filter
Often there are observational records in a database that are not needed. We could have filtered these records out using the nc_data_dl()
filters. However, we can also use the dplyr::filter()
function, which allows us to subset observations based on their row values. Observations are selected using:
- Comparison operators:
>
,>=
,<
,<=
,!=
(not equal),==
(equal) - Logical operators:
&
(and),|
(or),!
(is not)
There are worked examples provided in the naturecounts article Filtering data after download to get you started with applying filters. These include:
- Categorical filters
- Numerical filters
- Date filters
Here we provide a few additional examples for you to work with. First, lets apply a simple filter()
that subsets the data based on a single survey month:
<- filter(WPWI_select, survey_month == 6) WPWI_June
Now lets try multiple survey months by adding a logical operator:
<- filter(WPWI_select, survey_month == 6 | survey_month == 7)
WPWI_JJ
#Alternatively this can be written as:
<- filter(WPWI_select, survey_month %in% c(6,7)) WPWI_JJ
We can continue to add to the complexity of our filter:
<- filter(WPWI_select,
WPWI_multi %in% c(6,7) & bcr == 12 &
survey_month >= 2010 & survey_year <= 2012) survey_year
5.1.3 Summarise
Now that we have selected the columns we need for our analysis we will demonstrate how to summarise the data. The summarise()
function is most useful if paired with the group_by()
argument, because this changes the unit of analysis from the complete dataset to individual groups.
For example, say we want to determine how many point count stops were on each WPWI Route. We will want to group observations by RouteIdentifier
and summarise the number of district SurveyAreaIdentifier
.
<- group_by(WPWI_select, RouteIdentifier)
WPWI_Route <- summarise(WPWI_Route, Nstops = n_distinct(SurveyAreaIdentifier)) WPWI_Route
However, now that we’re getting on to more complex operations, we’ll introduce the use of the pipe %>%
. This allows you to pass (pipe) the output of one line as input to the next line. Therefore, the previous code can be re-written as:
<- WPWI_select %>%
WPWI_Route group_by(RouteIdentifier) %>%
summarise(Nstops = n_distinct(SurveyAreaIdentifier))
View(WPWI_Route)
Note that there is a pipe (%>%
) between each set of lines and that the data (WPWI_select
) is only referred to once, at the very start. These two different codes achieve the exact same result.
Back to our example, we might also want to know how many routes where run in each year.
<- WPWI_select %>%
WPWI_Year group_by(survey_year) %>%
summarise(Nroute = n_distinct(RouteIdentifier))
View(WPWI_Year)
Finally, lets look to see how many Observations of WPWI were made in each year. Since each row of data represents a survey point, we want to count the number of rows containing species_id = “7871” within each unique year. We can do this by first removing the ‘NA’ using the filter
function, and then use the group_by
and summarise
to determine the ‘length’ of the dataset (i.e., the number of rows with WPWI sightings).
<- WPWI_select %>% filter (species_id != "NA") %>%
WPWI_Obs group_by(survey_year) %>%
summarise(SumObs=length(species_id))
View(WPWI_Obs)
5.2 Combine datasets
There are many reasons to join two datatables. For example, if you have species observations across several years you might want to associate these with weather covariates to assess if they are correlated. Or you might have observations over a specific geographic region, which you wish to associate with land cover covariates. To do this, you will want to join (merge) two tables. Looking at the Data Wrangling cheat sheet, you will notice there are several join
functions. In all instances, you will need to specify the key variable, which is found in both tables, which is used for binding the tables together.
Here we provide an example using metadata available to you in naturecounts.
When you pull collections
with the “minimum” fields_set
(default) you will notice you get the species_id column, but not alpha species code. For example, let’s pull a small amount of data from the second Alberta Breeding Bird Atlas (ABATLAS2) from Beaverhill Lake Important Bird Area (AB001) in 2005.
<-nc_data_dl(collections="ABATLAS2", region=list(iba = "AB001"), years=2005, username = "testuser", info = "tutorial example") AB_Atlas
Notice there are 57 variables, including is a numeric species_id, but no alpha species code.
To rectify this, there are two options:
- Set the
fields_set
argument to extended.
<-nc_data_dl(collections="ABATLAS2", region=list(iba = "AB001"), years=2005, fields_set="extended", username = "testuser", info = "tutorial example") AB_Atlas2
Notice you are now pulling 285 variables, including the CommonName. However, for this particular dataset, there is still no alpha code (SpeciesCode). Most datasets will have these in the BMDE, this one does not. Lets add them using the an auxiliary table.
- Create a species code table using the available metadata and then join this to your existing data table.
First, identify the appropriate taxonomic authority using the metadata.
<-meta_species_authority()
authorityView(authority)
Second, pull the species code table, and filter for the “ABATLAS2” authority.
<-meta_species_codes()
speciesView(species)
<-species %>% filter(authority=="ABATLAS2") species
Third, join your tables so that it includes the desired species information.
<-left_join(AB_Atlas, species, by="species_id") AB_Atlas_sp
Now your datatable contains the species code, which is familar to most birders. You can also try joining the Atlas data to the meta_species_taxonomy() table if you wish to capture the english common, french, or scientific names, amoung other fields.
5.3 Helper functions
There are a few additional helper functions built into the naturecounts R package that you may find useful. At present, they include:
format_dates()
:Creates and adds date and day-of-year (doy) field/columns to dataformate_zero_fill()
: Zero-fill the species presence data by adding zero observations counts (absences) to an existing NatureCounts dataset
The zero-fill function is particularly important if you want to ensure your dataset is complete!
5.4 Exercies
Exercise 1: You are doing a research project using the fall migration monitoring data collected at Vaseux Lake Bird Observatory, British Columbia. You request the open access data from 2017-2020. After you request this subset of the collection, you need to determine the number of unique days Gray catbirds were records in each year?
Answer: 2017 = 54 2018 = 53
Exercise 2: Building off the same dataset used in Exercise 1, you now want to zero-fill the dataframe to ensure it is complete for your study on Gray catbirds. How many records (rows) are in the zero-fill Gray catbird dataframe?
Answer: 152
Answers to the exercises can be found in Chapter 7.