Movebank Part I: database summary and downloading a study
Caryn Johansen
The next biological data source that we are going to investigate is Movebank. They say what they are best:
We help animal tracking researchers to manage, share, protect, analyze, and archive their data. The animal tracking data in Movebank belongs to researchers all over the world who choose whether and how to share their data with the public.
Movebank is hosted by the Max Planck Institude for Ornithology, and, anecdotally, many of the data sets available are of birds species. The general model of Movebank is that the researchers who do submit their data retain ownership and have a choice to make their data publicly available. And, the owners of the data specify the terms of use.
This is a little different than some other databases I’ve seen, but also often the data involved can represent years of work in the field. It also made accessing and interacting with the data on a bulk scale difficult and next to impossible, but we’ll get to that.
The data in Movebank is cool. I was inspired by the book Where Animals Go, which one my advisers in graduate school gave to me as a parting gift. It’s an incredible book and I was struck by how beautiful and informative the tracking data for animals was. It’s an intimate look at how animals behave, and presents some cool opportunities for data visualization, statistical analysis, and modeling.
The breakdown of accessing and analyzing Movebank data will take place in several parts. Part I, this post, will discuss how to access and download data from Movebank, and show what type of data variables you can expect to find. Later parts will discuss parsing date and time data, and increase the complexity of data analysis and modeling.
Become a Movebank user
In order to download data from Movebank directly, you must be a user. Become a user is free, and you can sign up here.
To follow along with this series of posts, we will make the data available for the rest of the tutorial.
The data we’re using falls under the general Movebank Terms of Use, which you can read here. We are not using this data for any commercial use, and only intend it to show what Movebank has to offer, and to show some cool R libraries for data analysis and visualizations.
Accessing Movebank Data
Installing move
Note: If you just want to download the data and don’t care about how we accessed Movebank, skip to the summarizing data section. We just use move
to download data, and not for analysis.
To access data on movebank, you can interact with the Movebank Tracking Data Map, which is quite fun.
And there is also an R package that will interact with the website called, appropriately, move
.
However, it has a dependency that was not straightforward for me to download. There is an R package called rgdal, which requires the software GDAL, Geospatial Data Abstraction Library.
To install this, I used homebrew, which is an incredible tool to install software if you are not familiar with it.
I used the brew recipe. Type in your terminal (not in R):
brew install gdal
This worked well for me, and after this I was able to install rgdal
and move
without a problem:
install.packages("rgdal")
install.packages("move")
Post any issues you have installing these libraries in the comments bellow and we’ll try to help you out!
If everything has been installed correctly - load the move library:
library(move)
Getting to know move
As Ciera did with neomata and taxize, I’m going to take a few moments to show what I learned while getting familiar with the library move
.
This is a pretty big part of becoming good at certain functions and libraries - to read the manuals, do any provided or discovered tutorials, and to just play with the library.
move
provides the ability to access Movebank, which is useful, and it also uses other libraries and the GDAL to analyze the GPS data, which is new to me.
Here’s a link to manual.
Create an object to store your Movebank login information.
#fill in your username and passwords here
loginStored <- movebankLogin(username = <username>, password = <password>)
The organization of the data in Movebank is centralized around each study, and researchers can upload their data from various different types of sensors. We are going to be focusing on GPS sensors, but there are also accelerometers, radio transmitters, Argos Doppler, barometers, bind ring (I don’t know what that is but it sounds cool), solar geolocator, and good ol’ natural mark recording with human eyeballs.
It’s useful to know the scope of the database you are searching.
I initially thought I could get clever and get all the animals found in Movebank by getting all the study ids (with a combination of the functions getMovebankStudies()
and getMovebankID()
), but you cannot get around their data sharing policy that way.
Any study you may want to download, you must go online and agree to the terms for using that data before completing the download here in R.
But we can get basic information on all of the studies that allow some kind of access to their data using the getMovebank()
function:
all_studies <- getMovebank(entity_type = "study", login=loginStored)
dim(all_studies)
## [1] 2413 28
head(all_studies$name)
## [1] Army Ant colony movement Touchton
## [2] Christmas Island Red Crabs
## [3] Sooty Falcon Javed UAE
## [4] bird
## [5] Dalmatian Pelicans Ortaç Onmuş Turkey
## [6] Taita Hills birds project, Kenya
## 2350 Levels: _dcd_test_dup_location_1sec_offset ...
Hmmm, looking at some of these study names, the authors of those studies don’t seem to have practiced any type of repeating, standard naming procedure. That would make it difficult to parse out all animal names - and so could be a project for a later date.
Something we can do is map where these studies generally take place.
library(ggplot2)
world <- map_data("world")
ggplot() +
geom_polygon(data = world,
fill = "grey38",
aes(x = long, y = lat, group = group)) +
geom_point(data = all_studies,
aes(x = main_location_long, y = main_location_lat, color = as.factor(there_are_data_which_i_cannot_see))) +
scale_color_discrete(guide = guide_legend(title = "Data Restrictions"))
## Warning: Removed 3 rows containing missing values (geom_point).
The color of these points indicates if there are any restrictions on the availability of the data. Anything labeled false has all the data available. There are 435 studies that have all of their data available for download (after agreeing to the general Movebank privacy).
We can also examine the different types of sensors used in Movebank:
getMovebank("tag_type", loginStored)
## description external_id id is_location_sensor
## 1 NA bird-ring 397 true
## 2 NA gps 653 true
## 3 NA radio-transmitter 673 true
## 4 NA argos-doppler-shift 82798 true
## 5 NA natural-mark 2365682 true
## 6 NA acceleration 2365683 false
## 7 NA solar-geolocator 3886361 true
## 8 NA accessory-measurements 7842954 false
## 9 NA solar-geolocator-raw 9301403 false
## 10 NA barometer 77740391 false
## 11 NA magnetometer 77740402 false
## name
## 1 Bird Ring
## 2 GPS
## 3 Radio Transmitter
## 4 Argos Doppler Shift
## 5 Natural Mark
## 6 Acceleration
## 7 Solar Geolocator
## 8 Accessory Measurements
## 9 Solar Geolocator Raw
## 10 Barometer
## 11 Magnetometer
This returns a table with the different types of sensors, and a numeric tag assigned to each sensor. Unfortunately, at this time I don’t see how to filter the studies by sensor type without first selecting a specific study. Which is a bummer, because I’m pretty curious about these bird-ring sensors, and would like to filter the studies based on that!
Besides filtering the downloaded database above, we can also use the searchMovebankStudies()
function to search for terms in study names.
searchMovebankStudies(x="coyote", login=loginStored)
## [1] "Site fidelity in cougars and coyotes, Utah/Idaho USA (data from Mahoney et al. 2016)"
head(searchMovebankStudies(x="oose", loginStored))
## [1] "ABoVE: NPS Moose in the Upper Koyokuk Alaska"
## [2] "ABoVE: Peters Hebblewhite Alberta-BC Moose"
## [3] "Barnacle goose (Greenland) Larry Griffin/David Cabot"
## [4] "Barnacle goose (Svalbard) Larry Griffin"
## [5] "Bean Goose Anser fabalis Finnmark."
## [6] "Bean Goose QIA KoEco China 2015 FAO"
head(searchMovebankStudies(x="bird", login = loginStored))
## [1] "2018 Magnificent Frigatebird - Cayman Islands "
## [2] "African riparian birds Kenya"
## [3] "AKseabird"
## [4] "All CTT birds deployed BOEM"
## [5] "Antbirds, Panama"
## [6] "Arctic breeding shorebirds; Rausch; various Canadian arctic locations"
So many studies to chose from!
Get the study ID to see how many animals are in the study.
ID <- getMovebankID(study = "Black-backed jackal, Etosha National Park, Namibia", login=loginStored)
head(getMovebankAnimals(study = ID, login = loginStored))
## individual_id tag_id sensor_type_id tag_local_identifier
## 1 346406120 304884475 653 AU200
## 2 346406121 304884486 653 AU205
## 3 346406122 304884490 653 AU206
## 4 346406123 304884493 653 AU215
## 5 346406124 304884496 653 AU210
## 6 346406125 304884499 653 AU211
## deployment_id comments death_comments earliest_date_born
## 1 346406151 NA NA NA
## 2 346406161 NA NA NA
## 3 346406150 NA NA NA
## 4 346406167 NA NA NA
## 5 346406158 NA NA NA
## 6 346406156 NA NA NA
## exact_date_of_birth latest_date_born local_identifier nick_name ring_id
## 1 NA NA CM05 NA NA
## 2 NA NA CM08 NA NA
## 3 NA NA CM09 NA NA
## 4 NA NA CM10 NA NA
## 5 NA NA CM11 NA NA
## 6 NA NA CM15 NA NA
## sex taxon_canonical_name animalName
## 1 m NA CM05_346406151
## 2 m NA CM08_346406161
## 3 m NA CM09_346406150
## 4 m NA CM10_346406167
## 5 m NA CM11_346406158
## 6 m NA CM15_346406156
I like this study because it has many individuals, and their genders are identified. Many studies on Movebank are of a small number, one or two, and the information about the animals is not always available. Also, I’ve always kind of liked jackals, in a romantic, scrappy under-dog way. I really like coyotes, and they kind of seem like coyotes to me. Good enough! Let’s download.
# be patient
jackals <- getMovebankData(study = "Black-backed jackal, Etosha National Park, Namibia", login = loginStored, removeDuplicatedTimestamps = TRUE)
If you’re following along, this was a pretty slow step for me.
Also, there is a warning about the option removeDuplicatedTimestamps = TRUE
, however the data download fails if this is set to false.
This step returned a move
object which seems to have a lot of excess information. Right now, all I want is information about the individuals they retrieved data from, and the location data over time, and the information to connect them.
idData <- jackals@idData
jackal_movement <- jackals@data
jackal_id_movement <- jackals@trackId
jackal_movement$local_identifier <- jackal_id_movement
str(jackal_movement)
## 'data.frame': 130686 obs. of 10 variables:
## $ sensor_type_id : int 653 653 653 653 653 653 653 653 653 653 ...
## $ location_lat : num -19.1 -19.1 -19.1 -19.1 -19 ...
## $ location_long : num 15.8 15.8 15.8 15.8 15.8 ...
## $ timestamp : POSIXct, format: "2009-02-07 00:01:20" "2009-02-07 01:00:38" ...
## $ update_ts : Factor w/ 1 level "2017-09-05 17:41:27.589": 1 1 1 1 1 1 1 1 1 1 ...
## $ deployment_id : int 346406151 346406151 346406151 346406151 346406151 346406151 346406151 346406151 346406151 346406151 ...
## $ event_id : num 3.65e+09 3.65e+09 3.65e+09 3.65e+09 3.65e+09 ...
## $ tag_id : int 304884475 304884475 304884475 304884475 304884475 304884475 304884475 304884475 304884475 304884475 ...
## $ sensor_type : Factor w/ 1 level "GPS": 1 1 1 1 1 1 1 1 1 1 ...
## $ local_identifier: Factor w/ 22 levels "CM05","CM08",..: 1 1 1 1 1 1 1 1 1 1 ...
This has created a data frame what has 130,686 rows, where each row contains the GPS location of an individual at a point in time.
Overall, I found interacting with the move
library to be kind of frustrating.
I really want to search through what a database has available and download a data frame without any bells and whistles. If you know what data set you want, going to the website and downloading the data through their data browsers gives you a flat comma separated file that is much easier to work with.
Summarizing and cleaning data
Download the dataset I just created as a comma separated file here: black-backed-jackal-Namibia.csv
Create the jackal_movement
data frame by reading the dataset into R:
jackal_movement <- read.table(file = "black-backed-jackal-Namibia.csv", header=T, sep=",")
Once you have acquired the data, the next step is to get a sense of what you can expect from the data, and if there are any missing values or weird-looking outlier data.
First, I usually like to get an idea of what types of values my data frame currently contains, and if I need to transform those values at all.
str(jackal_movement)
## 'data.frame': 130686 obs. of 10 variables:
## $ sensor_type_id : int 653 653 653 653 653 653 653 653 653 653 ...
## $ location_lat : num -19.1 -19.1 -19.1 -19.1 -19 ...
## $ location_long : num 15.8 15.8 15.8 15.8 15.8 ...
## $ timestamp : POSIXct, format: "2009-02-07 00:01:20" "2009-02-07 01:00:38" ...
## $ update_ts : Factor w/ 1 level "2017-09-05 17:41:27.589": 1 1 1 1 1 1 1 1 1 1 ...
## $ deployment_id : int 346406151 346406151 346406151 346406151 346406151 346406151 346406151 346406151 346406151 346406151 ...
## $ event_id : num 3.65e+09 3.65e+09 3.65e+09 3.65e+09 3.65e+09 ...
## $ tag_id : int 304884475 304884475 304884475 304884475 304884475 304884475 304884475 304884475 304884475 304884475 ...
## $ sensor_type : Factor w/ 1 level "GPS": 1 1 1 1 1 1 1 1 1 1 ...
## $ local_identifier: Factor w/ 22 levels "CM05","CM08",..: 1 1 1 1 1 1 1 1 1 1 ...
My data frame is a mix of integers, numeric values, factors, and POSIXct values, which we will explain in more detail in a later post.
It’s really important to check for missing values, and to get a sense of the range of values in your data set. As Ciera introduced in the previous posts, the skimr
package is really wonderful for this.
library(skimr)
skim(jackal_movement)
## Skim summary statistics
## n obs: 130686
## n variables: 10
##
## Variable type: factor
## variable missing complete n n_unique
## local_identifier 0 130686 130686 22
## sensor_type 0 130686 130686 1
## update_ts 0 130686 130686 1
## top_counts ordered
## CM0: 11827, CM1: 11508, CM4: 11358, CM1: 11148 FALSE
## GPS: 130686, NA: 0 FALSE
## 201: 130686, NA: 0 FALSE
##
## Variable type: integer
## variable missing complete n mean sd p0
## deployment_id 0 130686 130686 3.5e+08 5.42 3.5e+08
## sensor_type_id 0 130686 130686 653 0 653
## tag_id 0 130686 130686 3e+08 23.66 3e+08
## p25 median p75 p100 hist
## 3.5e+08 3.5e+08 3.5e+08 3.5e+08 ▅▂▃▇▆▆▇▂
## 653 653 653 653 ▁▁▁▇▁▁▁▁
## 3e+08 3e+08 3e+08 3e+08 ▃▇▃▇▃▆▁▅
##
## Variable type: numeric
## variable missing complete n mean sd p0
## event_id 0 130686 130686 3.7e+09 37726.81 3.7e+09
## location_lat 0 130686 130686 -19.11 0.066 -19.31
## location_long 0 130686 130686 15.87 0.12 15.46
## p25 median p75 p100 hist
## 3.7e+09 3.7e+09 3.7e+09 3.7e+09 ▇▇▇▇▇▇▇▇
## -19.16 -19.12 -19.06 -18.81 ▁▂▇▆▅▁▁▁
## 15.78 15.89 15.96 16.09 ▁▁▁▃▂▇▂▃
##
## Variable type: POSIXct
## variable missing complete n min max median
## timestamp 0 130686 130686 2009-02-07 2011-01-16 2009-10-11
## n_unique
## 112242
skim()
automatically splits up my data by variable type and summarizes each type, and packs a ton of information into the summary.
What I can quickly see is that I have no missing values for any of my variables - this is good! But you’re not done quite yet.
Reading in data can make quite a few assumptions about what is and what is not labeled a missing value.
Here, a missing value is an NA
, but what if someone spelled out not available
and put that into the data set? Or missing
or just left it blank. We’ll quickly check for these situations in a quick and dirty way.
table(jackal_movement$local_identifier)
##
## CM05 CM08 CM09 CM10 CM11 CM15 CM18 CM20 CM23 CM26 CM33 CM36
## 1873 3374 11827 1235 11508 11148 884 712 1481 10795 5522 3128
## CM40 CM42 CM44 CM47 CM62 CM69 CM70 CM72 CM83 CM95
## 7311 2662 9914 11358 9532 10025 3982 2875 8101 1439
I use table to visually see what factors I have for a variable, and it’s helpful to see the distribution of data for each factor. Here, we know that all of the local identifiers have some even and predictable naming scheme, that there are no totally mis-labeled factors, and that the data is not evenly distributed between the individual jackals. Some jackals had their GPS trackers on for much longer than others.
I’m not going to check the other factors with table()
, because I can see that they only have one value and that it is complete.
I can also use the skim()
results to to check on the integer and numeric values, and to get a sense of their distributions.
Later, we’re mainly going to be using local identifier, event id, and the latitude and longitude information.
Deployment id and tag id seem to be associated with a specific GPS device, and provides redundant information as the local identifier.
The sensor type is a single value, which probably means GPS in this situation. (Many data sets on Movebank have different types of sensors, so that is something to pay attention to if you are downloading other data sets.)
The event id is a sequential number, where each row of my data set has a unique event id. This will be incredibly useful later.
I can use range()
to verify what skim()
appears to be showing for latitude and longitude - that there do not appear (at this level of analysis) to be weird ouliers.
range(jackal_movement$location_lat)
range(jackal_movement$location_long)
## [1] -19.30973 -18.80921
## [1] 15.45767 16.08849
It should be said that Movebank likely has researchers clean their data before uploading it, which made this part pretty easy. It’s never safe to assume that your data is clean! Always check it out.
Wrap Up
To wrap up this post, I’m going to do some pretty simple visualizations - because knowing you have a clean data set is pretty satisfying but not that visually compelling.
library(tidyverse)
library(ggmap)
jackal_box <- make_bbox(lon = location_long, lat = location_lat, data=jackal_movement, f= 0.1)
jackal_map <- get_map(location = jackal_box, source = 'google', maptype = 'terrain')
ggmap(jackal_map) + geom_point(data=jackal_movement, aes(x=location_long, y=location_lat))
ggmap(jackal_map) + geom_point(data=jackal_movement, aes(x=location_long, y=location_lat, color=local_identifier), alpha = 0.5)
In the next post, I will parse out the timestamp data and the fun will start with some analysis!
Session Info:
Sessioninfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
## other attached packages:
## [1] maps_3.2.0 ggplot2_2.2.1 move_3.0.2 rgdal_1.2-18 raster_2.6-7 sp_1.2-7
## [7] geosphere_1.5-7
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 knitr_1.20 xml2_1.2.0 magrittr_1.5 munsell_0.4.3
## [6] colorspace_1.3-2 lattice_0.20-35 R6_2.2.2 rlang_0.2.0 plyr_1.8.4
## [11] stringr_1.3.0 httr_1.3.1 tools_3.4.0 parallel_3.4.0 grid_3.4.0
## [16] gtable_0.2.0 htmltools_0.3.6 lazyeval_0.2.1 yaml_2.1.18 digest_0.6.15
## [21] rprojroot_1.3-2 tibble_1.4.2 curl_3.1 evaluate_0.10.1 rmarkdown_1.9
## [26] labeling_0.3 stringi_1.1.7 pillar_1.2.1 compiler_3.4.0 scales_0.5.0
## [31] backports_1.1.2