Dinosaurs of America
Zoe Liu
In this post, we will be examining data from the Paleobiology Database (PBDB). Specifically, we will be looking at dinosaur and plant fossils found in the United States of America and observing how the areas where different fossils are found and if their locations are correlated. This notebook also aims to be a guide to the process of Paleobiodogy Database exploratory analysis and data cleaning.
What is the Paleobiology Database?
PBDB is a public database of paleontological data that anyone can use, maintained by an international non-governmental group of paleontologists. One of its main features is its navigator, which allows a user to sort data by geological time, taxa, authorizer, stratigraphy, and more. PBDB is run by the Department of Geoscience at the University of Wisconsin-Madison. The project team consists of Shanan Peters, Michael McClennan, and John Czaplewski.
How do you access the data?
PBDB is free to use and has no requirements for access. After sorting through the PBDB navigator and finding the dataset you want to download, click on the button to the left called “save map data”. A window will appear, giving you two choices. You can either download the data as a CSV, JSON, TSV, or RIS file, or you can obtain a URL that can be used for external scripts such as R or Python. If you choose to download the data as a file, it can be used automatically for analysis. However, accessing the data by making HTTP requests is a little more intensive. This tutorial will teach you how to obtain the data desired by using the URL, and will require installation of Python and Jupyter. Download instructions can be found here for Python and here for Jupyter. In addition, documentation for the data service (including data recorded in the file and instructions on usage) can be found here.
Why PBDB?
The Paleobiology Database has an extensive dataset of different types of plant and animal fossils, and its navigator is visually stunning and well-designed. What drew me to PBDB was its large collection of dinosaur fossils, which is a topic I’ve always wanted to learn more about. I’ve heard it said that 3rd graders and scientists know the most about dinosaurs in the world. In this notebook, I will attempt to reach their level of dinsoaur mastery by analyzing the PBDB dinosaur dataset.
Following along with the tutorial
Now we will get started with the coding! In order to follow along you can
- just continue reading
- clone the repository onto your computer
- Explore using mybinder: . Please be patient while the enviroment builds (about 10 min), this tutorial is pretty large.
Part I: Dinosaurs
First, let’s gather the data we want to further examine. We will do this by making a HTTP request with the URL corresponding to the location of the data we want to look at.
In order to obtain the URL:
- Access the PBDB navigator
- Choose the specimen you want to examine
- Resize the map to get your desired coordinates
By running the cell below, we will store the data obtained through the HTTP request, convert the data to a json file, and write the file to “dino_NA.json”.
import requests
import json
URL = "https://paleobiodb.org/data1.2/occs/list.json?lngmin=-142.2070&lngmax=-40.9570&latmin=23.8054&latmax=53.0676&base_id=52775&show=coords,attr,loc,prot,time,strat,stratext,lith,lithext,geo,rem,ent,entname,crmod&datainfo"
r = requests.get(url=URL) #store the data obtained through the HTTP request in r
data = r.json() #converts the data in r to a json file
with open("dino_NA.json", "w") as write_file:
json.dump(data, write_file) #creates file dino_NA.json"
Now, we have to normalize the data by converting the structured json file into a flat table. We do this by importing the pandas library, which has a data structure called a DataFrame where we can store our data.
import pandas as pd
from pandas.io.json import json_normalize
df = pd.DataFrame.from_dict(json_normalize(data))
df
access_time | data_license | data_provider | data_source | data_url | documentation_url | elapsed_time | license_url | parameters.base_id | parameters.latmax | parameters.latmin | parameters.lngmax | parameters.lngmin | parameters.show | parameters.taxon_status | parameters.timerule | records | title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Tue 2019-04-02 19:50:55 GMT | Creative Commons CC-BY | The Paleobiology Database | The Paleobiology Database | http://paleobiodb.org/data1.2/occs/list.json?l... | http://paleobiodb.org/data1.2/occs/list_doc.html | 1.75 | http://creativecommons.org/licenses/by/4.0/ | 52775 | 53.0676 | 23.8054 | -40.957 | -142.207 | coords,attr,loc,prot,time,strat,stratext,lith,... | all | major | [{'oid': 'occ:139242', 'eid': 'rei:24752', 'ci... | PBDB Data Service |
Scrolling through this flattened table, we see that the column “records” has a nested list contained within it. Let’s see what it shows:
dino_df = pd.DataFrame.from_dict(json_normalize(data, ["records"]))
dino_df.head()
ath | ati | cc2 | cid | cny | cxi | dcr | dmd | eag | eid | ... | srb | sro | srs | ssc | stp | szn | tdf | tec | tid | tna | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | M. Carrano | prs:14 | CA | col:11890 | NaN | 39 | 2011-05-13 07:45:44 | 2011-05-12 16:46:47 | 83.5 | rei:24752 | ... | 4.135 | bottom to top | Dinosaur Park | bed | Alberta | NaN | subjective synonym of | NaN | txn:53194 | Gorgosaurus libratus |
1 | M. Carrano | prs:14 | CA | col:11892 | NaN | 39 | 2001-09-18 13:58:56 | 2013-05-10 07:22:24 | 83.5 | NaN | ... | NaN | NaN | NaN | bed | Alberta | NaN | NaN | NaN | txn:38755 | Hadrosauridae |
2 | M. Carrano | prs:14 | CA | col:11893 | NaN | 39 | 2010-07-27 08:36:19 | 2010-07-27 10:37:07 | 83.5 | rei:23301 | ... | NaN | NaN | NaN | bed | Alberta | NaN | NaN | NaN | txn:53194 | Gorgosaurus libratus |
3 | M. Carrano | prs:14 | CA | col:11894 | NaN | 39 | 2001-09-18 14:04:56 | 2006-04-21 12:43:50 | 83.5 | NaN | ... | 4.03 | NaN | Dinosaur Park | bed | Alberta | NaN | NaN | NaN | txn:63911 | Centrosaurus apertus |
4 | M. Carrano | prs:14 | CA | col:11895 | NaN | 39 | 2010-07-27 08:39:37 | 2010-07-27 10:39:58 | 83.5 | rei:23302 | ... | NaN | NaN | NaN | bed | Alberta | NaN | NaN | NaN | txn:53194 | Gorgosaurus libratus |
5 rows × 61 columns
dino_df.shape
(10976, 61)
We see that this table, which has the raw data we want concerning dinosaurs, has 10948 entries and 61 features in total. We can start cleaning our data now by renaming unintuitive column names, converting data values into something more workable (i.e. the time and date format of the “dcr” and “dmd” columns), and examining missing values.
def df_rename(df):
df.rename({"ath": "authorizer",
"ati": "authorizer_no",
"cc2": "country",
"cid": "collection_no",
"cny": "county",
"cxi": "cx_int_no",
"dcr": "created",
"dmd": "modified",
"eag": "max_ma",
"eid": "reid_no",
"smb": "member",
"ssc": "stratscale",
"stp": "state",
"tdf": "diference",
"tid": "accepted_no",
"slb": "local_bed",
"slo": "local_order",
"sls": "local_section",
"env": "environment",
"ent": "enterer",
"ggc": "geog_comments",
"gsc": "geog_scale",
"gcm": "geology_comments",
"ff1": "fossils_from_1",
"ff2": "fossils_from_2",
"idn": "identified_name",
"la1": "lith_adj_1",
"la2": "lith_adj_2",
"lag": "min_ma",
"iid": "identified_no",
"ldc": "lith_descript",
"lf1": "lithification_1",
"lf2": "lithification_2",
"lm1": "minor_lithology_1",
"lm2": "minor_lithology_2",
"mdf": "modifier",
"lt1": "lithology_1",
"lt2": "lithology_2",
"ocm": "occurance_comments",
"oei": "early_interval",
"oli": "late_interval",
"ptd": "protected",
"scm": "strat_comments",
"srs": "regional_section",
"sfm": "formation",
"sgr": "stratgroup",
"tna": "accepted_name"}, inplace=True, axis="columns")
Here we have defined a rename function that will rename any dataframe with these specific column names. Because any set of data sourced from PBDB have these exact columns, this may be handy if we choose to look at more data from PBDB.
df_rename(dino_df)
dino_df.head()
authorizer | authorizer_no | country | collection_no | county | cx_int_no | created | modified | max_ma | reid_no | ... | srb | sro | regional_section | stratscale | state | szn | diference | tec | accepted_no | accepted_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | M. Carrano | prs:14 | CA | col:11890 | NaN | 39 | 2011-05-13 07:45:44 | 2011-05-12 16:46:47 | 83.5 | rei:24752 | ... | 4.135 | bottom to top | Dinosaur Park | bed | Alberta | NaN | subjective synonym of | NaN | txn:53194 | Gorgosaurus libratus |
1 | M. Carrano | prs:14 | CA | col:11892 | NaN | 39 | 2001-09-18 13:58:56 | 2013-05-10 07:22:24 | 83.5 | NaN | ... | NaN | NaN | NaN | bed | Alberta | NaN | NaN | NaN | txn:38755 | Hadrosauridae |
2 | M. Carrano | prs:14 | CA | col:11893 | NaN | 39 | 2010-07-27 08:36:19 | 2010-07-27 10:37:07 | 83.5 | rei:23301 | ... | NaN | NaN | NaN | bed | Alberta | NaN | NaN | NaN | txn:53194 | Gorgosaurus libratus |
3 | M. Carrano | prs:14 | CA | col:11894 | NaN | 39 | 2001-09-18 14:04:56 | 2006-04-21 12:43:50 | 83.5 | NaN | ... | 4.03 | NaN | Dinosaur Park | bed | Alberta | NaN | NaN | NaN | txn:63911 | Centrosaurus apertus |
4 | M. Carrano | prs:14 | CA | col:11895 | NaN | 39 | 2010-07-27 08:39:37 | 2010-07-27 10:39:58 | 83.5 | rei:23302 | ... | NaN | NaN | NaN | bed | Alberta | NaN | NaN | NaN | txn:53194 | Gorgosaurus libratus |
5 rows × 61 columns
dino_df.columns
Index(['authorizer', 'authorizer_no', 'country', 'collection_no', 'county',
'cx_int_no', 'created', 'modified', 'max_ma', 'reid_no', 'eni',
'enterer', 'environment', 'fossils_from_1', 'fossils_from_2', 'flg',
'geology_comments', 'geog_comments', 'geog_scale', 'identified_name',
'idr', 'identified_no', 'lith_adj_1', 'lith_adj_2', 'min_ma', 'lat',
'lith_descript', 'lithification_1', 'lithification_2',
'minor_lithology_1', 'minor_lithology_2', 'lng', 'lithology_1',
'lithology_2', 'modifier', 'mdi', 'occurance_comments',
'early_interval', 'oid', 'late_interval', 'prc', 'protected', 'rid',
'rnk', 'strat_comments', 'formation', 'stratgroup', 'local_bed',
'local_order', 'local_section', 'member', 'srb', 'sro',
'regional_section', 'stratscale', 'state', 'szn', 'diference', 'tec',
'accepted_no', 'accepted_name'],
dtype='object')
I’ve tried my best to rename columns from the PBDB API documentation and from the csv file I downloaded. However, I couldn’t find column names for every column, and some others were near indistinguishable from others. Columns such as “reid_no” and “authorizer_no” contain identifying information that aren’t particularly anything of interest. That being said, let’s take a look at some interesting data and see what we find.
interesting = dino_df[["accepted_name", "state", "lat", "lng", "environment", "created", "country", "early_interval", "late_interval"]]
interesting.head()
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | |
---|---|---|---|---|---|---|---|---|---|
0 | Gorgosaurus libratus | Alberta | 50.740726 | -111.528732 | "channel" | 2011-05-13 07:45:44 | CA | Late Campanian | NaN |
1 | Hadrosauridae | Alberta | 50.753296 | -111.461914 | terrestrial indet. | 2001-09-18 13:58:56 | CA | Late Campanian | NaN |
2 | Gorgosaurus libratus | Alberta | 50.737015 | -111.549347 | "channel" | 2010-07-27 08:36:19 | CA | Late Campanian | NaN |
3 | Centrosaurus apertus | Alberta | 50.737297 | -111.528931 | terrestrial indet. | 2001-09-18 14:04:56 | CA | Late Campanian | NaN |
4 | Gorgosaurus libratus | Alberta | 50.723866 | -111.564636 | channel lag | 2010-07-27 08:39:37 | CA | Late Campanian | NaN |
Some things of note:
1) There are two columns for name, accepted_name and identified_name. I chose accepted_name because it had fewer missing values.
2) The late_interval column seems to be missing a lot of values. Unless otherwise specified, we will date the dinosaurs as belonging to the early_interval.
interesting["country"].unique()
array(['CA', 'US', 'MX', 'UZ', 'BM', 'BS', 'MN'], dtype=object)
It looks like we have dinosaurs found in countries like Canada, Mexico, and Bermuda. Let’s narrow our search to the US, which is what we’re interested in.
interesting = interesting[interesting["country"] == "US"]
print(interesting.shape)
interesting.head()
(9094, 9)
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | |
---|---|---|---|---|---|---|---|---|---|
19 | Theropoda | Connecticut | 41.566666 | -72.633331 | terrestrial indet. | 2011-07-28 02:09:51 | US | Hettangian | Sinemurian |
20 | Camarasaurus grandis | Colorado | 39.068802 | -108.699989 | fluvial-lacustrine indet. | 2017-11-02 14:56:21 | US | Kimmeridgian | Tithonian |
21 | Camarasaurus supremus | Colorado | 39.111668 | -108.717499 | fluvial-lacustrine indet. | 2001-09-19 09:11:44 | US | Kimmeridgian | NaN |
22 | Ankylosaurus magniventris | Montana | 47.637699 | -106.569901 | terrestrial indet. | 2001-09-19 10:03:19 | US | Maastrichtian | NaN |
23 | Titanosauriformes | Oklahoma | 34.180000 | -96.278053 | coastal indet. | 2005-08-25 14:56:00 | US | Late Aptian | Early Albian |
Now we only have 9070 entries.
Going Back in Time
Now let’s take a look at the time component of the data.
interesting["early_interval"].unique()
array(['Hettangian', 'Kimmeridgian', 'Maastrichtian', 'Late Aptian',
'Late Campanian', 'Late Albian', 'Rhaetian', 'Sinemurian',
'Campanian', 'Late Maastrichtian', 'Hemingfordian', 'Orellan',
'Langhian', 'Tortonian', 'Early Coniacian', 'Early Eocene',
'Late Eocene', 'Late Kimmeridgian', 'Bridgerian', 'Harrisonian',
'Late Santonian', 'Piacenzian', 'Aptian', 'Early Tithonian',
'Clarendonian', 'Norian', 'Middle Campanian', 'Lancian',
'Judithian', 'Late Pleistocene', 'Middle Cenomanian',
'Hemphillian', 'Early Campanian', 'Albian', 'Pliensbachian',
'Irvingtonian', 'Wasatchian', 'Early Maastrichtian', 'Santonian',
'Barremian', 'Early Aptian', 'Late Hemphillian', 'Blancan',
'Late Uintan', 'Middle Pleistocene', 'Middle Coniacian',
'Early Cenomanian', 'Late Oxfordian', 'Late Coniacian',
'Late Clarendonian', 'early Early Hemphillian', 'Turonian',
'Oligocene', 'Miocene', 'Late Callovian', 'Cenomanian',
'Tithonian', 'Early Kimmeridgian', 'Late Cretaceous',
'Middle Santonian', 'Valanginian', 'Zanclean', 'Early Jurassic',
'Early Hettangian', 'Late Triassic', 'Middle Turonian',
'Early Cretaceous', 'Early Santonian', 'Middle Tithonian',
'Middle Albian', 'Early Albian', 'Serravallian', 'Lacian',
'Duchesnean', 'Jurassic', 'Rancholabrean', 'Edmontonian',
'Bartonian', 'Priabonian', 'Early Callovian', 'Carnian',
'Middle Callovian', 'Bathonian', 'Late Bajocian', 'Rupelian',
'Chattian', 'Oxfordian', 'Late Jurassic', 'Aquitanian', 'Holocene',
'Late Cenomanian', 'Arikareean', 'Cretaceous', 'Early Uintan',
'Eocene', 'Messinian', 'Late Miocene', 'Coniacian',
'Late Barremian', 'Pleistocene', 'Lysitean', 'Uintan',
'Early Barstovian', 'Barstovian', 'Chadronian', 'Whitneyan',
'Puercan', 'Middle Eocene', 'Late Pliocene', 'Pliocene',
'Late Turonian', 'Clarkforkian', 'Late Chadronian', 'Burdigalian',
'Tiffanian', 'Early Barremian', 'Early Sinemurian',
'Late Paleocene', 'Early Oligocene', 'Early Pliocene', 'Bajocian',
'Early Miocene', 'Middle Miocene', 'Early Hemingfordian',
'Thanetian', 'Late Oligocene', 'Calabrian', 'Early Pleistocene',
'Early Clarendonian', 'Late Berriasian', 'Berriasian',
'Torrejonian', 'Toarcian', 'Smithian', 'Geringian'], dtype=object)
That’s a lot of intervals! Our data is provided at an extremely fine time scale, which is often a good thing. However, this may make it hard for us to visualize our data. Instead, let’s make a separate column, “interval”, that rounds everything from the early interval level up to the nearest era. Below, you see lists containing every sub-category of eras found in our data. We use that data to change the fine interval data to coarser, easier to work with era data.
interesting["interval"] = interesting["early_interval"]
def make_stages(stages, period):
return [stage + " " + age for stage in stages for age in period]
stages = ["Early", "Middle", "Late"]
Cretaceous = ["Maastrichtian", "Campanian", "Santonian", "Coniacian", "Turonian", "Cenomanian", "Albian", "Aptian", "Barremian",
"Hauterivian", "Valanginian", "Berriasian", "Cretaceous", "Lancian", "Judithian", "Edmontonian"]
Cretaceous += make_stages(stages, Cretaceous)
Jurassic = ["Hettangian", "Sinemurian", "Pliensbachian", "Toarcian", "Aalenian", "Bajocian", "Bathonian", "Callovian",
"Oxfordian", "Kimmeridgian", "Tithonian", "Jurassic"]
Jurassic += make_stages(stages, Jurassic)
Triassic = ["Olenekian", "Anisian", "Ladinian", "Carnian", "Norian", "Rhaetian", "Induan", "Triassic", "Smithian", "Alaunian"]
Triassic += make_stages(stages, Triassic)
Neogene = ["Aquitanian", "Burdigalian", "Langhian", "Serravallian", "Tortonian", "Messinian", "Zanclean", "Neogene", "Miocene",
"Pliocene", "Hemingfordian", "Piacenzian", "Clarendonian", "Hemphillian", "Blancan", "Arikareean", "Barstovian"]
Neogene += make_stages(stages, Neogene)
Paleogene = ["Danian", "Selandian", "Thanetian", "Ypresian", "Lutetian", "Bartonian", "Priabonian", "Rupelian", "Chattian",
"Paleogene", "Oligocene", "Eocene", "Paleocene", "Orellan", "Bridgerian", "Harrisonian", "Wasatchian",
"Uintan", "Duchesnean", "Chadronian", "Whitneyan", "Puercan", "Clarkforkian", "Tiffanian", "Torrejonian",
"Geringian", "Lysitean"]
Paleogene += make_stages(stages, Paleogene)
Quaternary = ["Holocene", "Pleistocene", "Quaternary", "Irvingtonian", "Rancholabrean", "Calabrian"]
Quaternary += make_stages(stages, Quaternary)
#here I replace a mistaken entry
interesting["interval"] = interesting["interval"].replace("early Early Hemphillian", "Early Hemphillian")
This function changes the items in “start” with the word in “end” from the dataframe “df”.
def change_geog(start, end, df):
df["interval"] = df['interval'].replace(start, end)
change_geog(Cretaceous, "Cretaceous", interesting)
change_geog(Jurassic, "Jurassic", interesting)
change_geog(Triassic, "Triassic", interesting)
change_geog(Neogene, "Neogene", interesting)
change_geog(Paleogene, "Paleogene", interesting)
change_geog(Quaternary, "Quaternary", interesting)
#One of the early intervals was labeled "Lacian". However, my search for Lacian resulted in nothing, so I went with the
#late interval of Alaunian instead.
interesting["interval"] = interesting["interval"].replace("Lacian", "Triassic")
interesting.head()
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | interval | |
---|---|---|---|---|---|---|---|---|---|---|
19 | Theropoda | Connecticut | 41.566666 | -72.633331 | terrestrial indet. | 2011-07-28 02:09:51 | US | Hettangian | Sinemurian | Jurassic |
20 | Camarasaurus grandis | Colorado | 39.068802 | -108.699989 | fluvial-lacustrine indet. | 2017-11-02 14:56:21 | US | Kimmeridgian | Tithonian | Jurassic |
21 | Camarasaurus supremus | Colorado | 39.111668 | -108.717499 | fluvial-lacustrine indet. | 2001-09-19 09:11:44 | US | Kimmeridgian | NaN | Jurassic |
22 | Ankylosaurus magniventris | Montana | 47.637699 | -106.569901 | terrestrial indet. | 2001-09-19 10:03:19 | US | Maastrichtian | NaN | Cretaceous |
23 | Titanosauriformes | Oklahoma | 34.180000 | -96.278053 | coastal indet. | 2005-08-25 14:56:00 | US | Late Aptian | Early Albian | Cretaceous |
print(interesting["interval"].unique())
len(interesting["early_interval"].unique())
['Jurassic' 'Cretaceous' 'Triassic' 'Neogene' 'Paleogene' 'Quaternary']
135
Now we have another column that makes our data a little more coarse but ultimately easier to work with. Originally, there were 135 different types of periods, epochs, and stages! Now, we’ve combined our data into 6 periods of geology that we can analyze more easily. Our data dates back to the Triassic period (251.9 million years ago) and contains at least one data point for every period since then, up to the Quaternary period (0.5 million years ago).
interesting["interval"].value_counts()
Cretaceous 3592
Quaternary 2668
Jurassic 1544
Neogene 787
Triassic 308
Paleogene 195
Name: interval, dtype: int64
Now let’s visualize this data. To use the seaborn and matplotlib libraries, you first have to download it from the command line by running “pip install seaborn” and “pip install matplotlib”. Then, it’s just a quick import and its ready for use.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6)) #create a figure that is 12 x 6
ax = sns.countplot(x='interval', data=interesting, order=interesting['interval'].value_counts().index.tolist())
plt.title("Dinosaur Fossil Count by Geologic Period")
plt.show()
“Jurassic Park” should have been called “Cretaceous Park”! Dinosaurs lived during the Mesozoic era, which is composed of three periods: the Triassic, Jurrassic and the Cretaceous. It is likely that there are more fossils during the Cretaceous because it was closer in time to present-day. However, it is puzzling that there are so many fossils found during the Quaternary period, given that dinosaurs only lived millions of years before the Quaternary. Let’s attempt to answer this question by looking at the location of dinosaur fossils.
Location of Dinosaur Fossils (and also their decendents - Birds!)
Now, let’s look at the spread of dinosaur fossils across the 50 states.
print(interesting["state"].unique())
print(len(interesting["state"].unique()))
interesting["state"].value_counts()
['Connecticut' 'Colorado' 'Montana' 'Oklahoma' 'New Mexico' 'Arizona'
'Wyoming' 'Utah' 'Florida' 'California' 'Kansas' 'Nebraska' 'Texas'
'South Dakota' 'New Jersey' 'North Dakota' 'Georgia' 'Mississippi'
'Nevada' 'Indiana' 'Maryland' 'Massachusetts' 'Missouri' 'Tennessee'
'Delaware' 'Alabama' 'Arkansas' 'Pennsylvania' 'District of Columbia'
'Idaho' 'Louisiana' 'North Carolina' 'South Carolina' 'Virginia'
'New York' 'Arizona/Utah' 'Michigan' 'Oregon' 'Washington' 'Maine'
'West Virginia' 'Alberta' 'Illinois' 'Ohio']
44
California 1467
Wyoming 1072
Montana 903
Florida 885
New Mexico 882
Utah 642
Colorado 550
Texas 401
Massachusetts 322
Virginia 254
North Dakota 250
South Dakota 153
Arizona 145
Oregon 144
North Carolina 141
New Jersey 131
Connecticut 107
Maryland 73
Alabama 57
Kansas 56
Idaho 55
Georgia 54
Oklahoma 53
Nebraska 42
Ohio 41
Pennsylvania 38
Tennessee 34
Illinois 31
Washington 19
Nevada 18
Maine 15
Mississippi 12
Delaware 11
South Carolina 9
Arkansas 5
Arizona/Utah 4
Michigan 4
Missouri 4
District of Columbia 3
Louisiana 3
Indiana 1
New York 1
Alberta 1
West Virginia 1
Name: state, dtype: int64
Here we have a mis-entry. Alberta is a province of Canada, so we’ll drop this row.
interesting[interesting["state"] == "Alberta"]
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | interval | |
---|---|---|---|---|---|---|---|---|---|---|
7013 | Albertadromeus syntarsus | Alberta | 49.179443 | -110.682777 | terrestrial indet. | 2013-05-08 21:18:11 | US | Campanian | NaN | Cretaceous |
interesting = interesting[interesting.state != "Alberta"]
interesting.shape #as expected, we have one less entry
(9093, 10)
plt.figure(figsize=(12,6))
ax = sns.countplot(x='state', data=interesting, order=interesting['state'].value_counts().index.tolist())
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title("Dinosaur Fossil Count by State")
plt.show()
According to this plot, California has the largest quantity of found dinosaur fossils. This finding is a bit puzzling. During the Mesozoic era, when most dinosaurs lived, California was still covered by the ocean. Let’s look at some of the fossils found in California to see what we find.
interesting[interesting["state"] == "California"].head()
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | interval | |
---|---|---|---|---|---|---|---|---|---|---|
45 | Hadrogyps aigialeus | California | 35.299999 | -118.500000 | shoreface | 2002-02-18 12:54:21 | US | Langhian | NaN | Neogene |
46 | Miomancalla wetmorei | California | 33.561943 | -117.712219 | basinal (siliciclastic) | 2002-02-19 12:56:07 | US | Tortonian | NaN | Neogene |
188 | Saurolophinae | California | 37.466667 | -121.216667 | marine indet. | 2005-10-12 10:57:20 | US | Maastrichtian | NaN | Cretaceous |
218 | Falconiformes | California | 33.228889 | -116.260277 | fluvial indet. | 2002-10-16 16:21:34 | US | Irvingtonian | NaN | Quaternary |
264 | Anatidae | California | 37.056667 | -120.195831 | lacustrine indet. | 2002-11-20 12:50:20 | US | Irvingtonian | NaN | Quaternary |
Interestingly, 4/5 of our first five entries are fossils found in the Neogene and Quaternary periods, millions of years after dinosaurs roamed the earth. Miomancalla wetmorei, shown in row 2, is a species of flightless auk, or penguin-esque birds (picture shown below). According to this page of fossilworks.org, which is a site that describes entries in PBDB, this specimen is a fossilized limb of an auk that lived roughly 7 million - 11 million years ago.
It turns out that our data not only contains dinosaur fossils, but also the fossils of their descendants! Because I took all the results from the Dinosauria clade in the PBDB navigator, Aves, or birds, were also included due to their line of descent from theropoda dinosaurs (dinosaurs classified by their three toes and hollow limbs).
Now let’s look at the environments that the fossils were found in. It looks like a vast majority were found in an indeterminate terrestrial environment.
print(interesting["environment"].unique())
print("")
print("Total unique environment types: ", len(interesting["environment"].unique()))
['terrestrial indet.' 'fluvial-lacustrine indet.' 'coastal indet.'
'"channel"' 'fluvial indet.' 'coarse channel fill' 'wet floodplain'
'"floodplain"' 'shoreface' 'basinal (siliciclastic)' 'marine indet.'
'pond' 'crevasse splay' 'lacustrine indet.' 'lacustrine - small'
'fine channel fill' 'lagoonal' 'levee' 'estuary/bay'
'marginal marine indet.' 'deltaic indet.' 'dune'
'shallow subtidal indet.' 'mire/swamp' 'alluvial fan'
'lagoonal/restricted shallow subtidal' 'paralic indet.' 'channel lag'
'transition zone/lower shoreface' 'offshore' 'fissure fill' 'interdune'
'eolian indet.' 'karst indet.' 'dry floodplain' 'delta plain'
'interdistributary bay' 'spring' 'fluvial-deltaic indet.'
'lacustrine - large' 'peritidal' 'carbonate indet.' 'offshore shelf'
'cave' 'sinkhole' 'glacial' 'lacustrine delta plain' 'foreshore'
'basinal (siliceous)' 'tar' 'deep subtidal shelf' 'offshore indet.'
'deep-water indet.' nan 'open shallow subtidal' 'lacustrine delta front'
'lacustrine deltaic indet.']
Total unique environment types: 57
plt.figure(figsize=(12,6))
ax = sns.countplot(x='environment', data=interesting, order = interesting['environment'].value_counts().index.tolist()[:20])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title("Dinosaur Fossil Count by Environment Found (Top 20)")
plt.show()
The overwhelming majority of environment types is terrestrial indeterminate, which accounts for all land-based discoveries that don’t have any defining features. The second most common type of environemnt is caves, which preserve fossils beautifully by trapping sediment that are washed or blown in by waves or wind.
print(interesting["accepted_name"].unique())
print("")
print("Total unique fossil types: ", len(interesting["accepted_name"].unique()))
['Theropoda' 'Camarasaurus grandis' 'Camarasaurus supremus' ... 'Corvidae'
'Rhinorex condrupus' 'Macroelongatoolithus']
Total unique fossil types: 1532
What is our most common dinosaur fossil?
plt.figure(figsize=(12,6))
ax = sns.countplot(x='accepted_name', data=interesting, order = interesting['accepted_name'].value_counts().index.tolist()[:20])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title("Dinosaur Fossils by Count (Top 20)")
plt.show()
Our top fossil in terms of count is the Hadrosauridae, commonly characterized by their duck-billed beaks. They are descendants of the Upper Jurassic/Lower Cretaceous iguanodontian dinosaurs and were most commonly found in the late Cretaceous era. Although Hadrosauridae can grow very large (28 feet long), they are herbivores, with their duck bills are evolved for grinding plants.
Now let’s look at the distributions of the most common dinosaurs across the geological periods using a facetgrid. To keep our visualization clutter-free, we’ll limit our dinosaurs to the top 10 most common.
top10 = interesting['accepted_name'].value_counts().index.tolist()[:10]
top_dinos = interesting[interesting["accepted_name"].isin(top10)]
top_dinos
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | interval | |
---|---|---|---|---|---|---|---|---|---|---|
19 | Theropoda | Connecticut | 41.566666 | -72.633331 | terrestrial indet. | 2011-07-28 02:09:51 | US | Hettangian | Sinemurian | Jurassic |
26 | Hadrosauridae | Montana | 47.695831 | -106.227776 | "channel" | 2002-01-15 15:16:43 | US | Maastrichtian | NaN | Cretaceous |
36 | Theropoda | Colorado | 38.278610 | -108.930832 | terrestrial indet. | 2002-02-05 14:36:05 | US | Rhaetian | Hettangian | Triassic |
41 | Dinosauria | Utah | 39.267223 | -111.254166 | fluvial-lacustrine indet. | 2002-02-16 15:34:49 | US | Late Maastrichtian | Tertiary | Cretaceous |
58 | Dinosauria | Wyoming | 42.152302 | -105.916199 | pond | 2002-03-01 16:16:35 | US | Kimmeridgian | Tithonian | Jurassic |
59 | Camarasaurus | Wyoming | 42.152302 | -105.916199 | pond | 2002-03-01 16:16:35 | US | Kimmeridgian | Tithonian | Jurassic |
62 | Ceratopsidae | Wyoming | 43.349400 | -104.482002 | "channel" | 2002-07-10 20:49:32 | US | Maastrichtian | NaN | Cretaceous |
63 | Hadrosauridae | Wyoming | 43.349400 | -104.482002 | "channel" | 2002-07-10 20:49:32 | US | Maastrichtian | NaN | Cretaceous |
68 | Camarasaurus | Wyoming | 42.017776 | -106.048615 | coarse channel fill | 2005-04-06 14:02:26 | US | Kimmeridgian | Tithonian | Jurassic |
79 | Theropoda | Montana | 48.633301 | -113.750000 | "floodplain" | 2002-07-10 20:49:32 | US | Campanian | NaN | Cretaceous |
100 | Camarasaurus | Wyoming | 43.630001 | -108.199997 | crevasse splay | 2014-02-24 13:38:15 | US | Late Kimmeridgian | Tithonian | Jurassic |
110 | Theropoda | Montana | 48.479202 | -112.701118 | fluvial indet. | 2002-07-10 20:49:32 | US | Campanian | NaN | Cretaceous |
112 | Theropoda | Montana | 48.910831 | -112.640831 | "floodplain" | 2002-07-10 20:49:32 | US | Late Campanian | NaN | Cretaceous |
114 | Theropoda | Montana | 48.966599 | -112.650002 | lacustrine - small | 2002-07-10 20:49:32 | US | Campanian | NaN | Cretaceous |
117 | Theropoda | Texas | 29.138056 | -103.196945 | fine channel fill | 2002-07-10 20:49:32 | US | Late Campanian | NaN | Cretaceous |
118 | Hadrosauridae | Texas | 29.138056 | -103.196945 | fine channel fill | 2002-07-10 20:49:32 | US | Late Campanian | NaN | Cretaceous |
147 | Hadrosauridae | Montana | 48.966599 | -112.650002 | lacustrine - small | 2002-07-10 20:49:32 | US | Campanian | NaN | Cretaceous |
158 | Theropoda | Montana | 46.465099 | -109.297203 | levee | 2002-08-02 11:17:12 | US | Middle Campanian | Late Campanian | Cretaceous |
163 | Theropoda | Montana | 46.200001 | -109.900002 | "channel" | 2002-08-02 11:19:03 | US | Middle Campanian | Late Campanian | Cretaceous |
177 | Tyrannosauridae | Montana | 47.633331 | -107.383331 | coarse channel fill | 2002-08-03 16:09:51 | US | Lancian | NaN | Cretaceous |
180 | Ceratopsidae | Montana | 47.633331 | -107.383331 | coarse channel fill | 2002-08-03 16:09:51 | US | Lancian | NaN | Cretaceous |
184 | Theropoda | New Jersey | 40.299999 | -74.300003 | estuary/bay | 2002-08-20 14:43:52 | US | Judithian | NaN | Cretaceous |
186 | Hadrosauridae | New Jersey | 40.299999 | -74.300003 | estuary/bay | 2002-08-20 14:43:52 | US | Judithian | NaN | Cretaceous |
198 | Tyrannosauridae | Arizona | 31.666668 | -110.766670 | fluvial indet. | 2017-01-10 14:02:08 | US | Late Campanian | NaN | Cretaceous |
199 | Hadrosauridae | Arizona | 31.666668 | -110.766670 | fluvial indet. | 2013-01-07 01:40:40 | US | Late Campanian | NaN | Cretaceous |
200 | Hadrosauridae | Arizona | 31.666668 | -110.766670 | fluvial indet. | 2002-08-22 13:34:09 | US | Late Campanian | NaN | Cretaceous |
204 | Hadrosauridae | Texas | 29.200001 | -103.550003 | crevasse splay | 2002-08-27 14:16:03 | US | Late Campanian | NaN | Cretaceous |
214 | Theropoda | Montana | 48.650002 | -112.966667 | "floodplain" | 2002-09-19 16:48:38 | US | Middle Campanian | NaN | Cretaceous |
215 | Tyrannosauridae | Montana | 48.650002 | -112.966667 | "floodplain" | 2002-09-19 16:48:38 | US | Middle Campanian | NaN | Cretaceous |
216 | Hadrosauridae | Montana | 48.650002 | -112.966667 | "floodplain" | 2002-09-19 16:48:38 | US | Middle Campanian | NaN | Cretaceous |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10710 | Tyrannosauridae | New Mexico | 36.343887 | -108.101944 | terrestrial indet. | 2018-07-09 16:19:27 | US | Maastrichtian | NaN | Cretaceous |
10713 | Theropoda | New Mexico | 36.343887 | -108.101944 | terrestrial indet. | 2018-07-09 16:19:27 | US | Maastrichtian | NaN | Cretaceous |
10844 | Theropoda | New Mexico | 31.795000 | -106.544167 | marginal marine indet. | 2018-07-23 13:45:00 | US | Late Albian | NaN | Cretaceous |
10845 | Dinosauria | New Mexico | 31.795000 | -106.543053 | marginal marine indet. | 2018-07-23 13:52:03 | US | Late Albian | NaN | Cretaceous |
10847 | Dinosauria | New Mexico | 31.783890 | -106.533333 | terrestrial indet. | 2018-07-23 14:00:21 | US | Late Albian | NaN | Cretaceous |
10857 | Grallator (Eubrontes) | Utah | 38.576462 | -109.518654 | terrestrial indet. | 2018-07-23 14:53:46 | US | Pliensbachian | Toarcian | Jurassic |
10858 | Grallator (Eubrontes) | Utah | 40.442963 | -109.286041 | eolian indet. | 2018-07-24 16:29:30 | US | Pliensbachian | Toarcian | Jurassic |
10859 | Grallator (Eubrontes) | Utah | 40.442963 | -109.286041 | eolian indet. | 2018-07-24 16:30:04 | US | Pliensbachian | Toarcian | Jurassic |
10860 | Grallator (Eubrontes) | Utah | 40.442963 | -109.286041 | eolian indet. | 2018-07-24 16:30:43 | US | Pliensbachian | Toarcian | Jurassic |
10862 | Grallator (Eubrontes) | Utah | 40.442963 | -109.286041 | eolian indet. | 2018-07-24 16:32:44 | US | Pliensbachian | Toarcian | Jurassic |
10865 | Grallator (Eubrontes) | Utah | 40.442963 | -109.286041 | eolian indet. | 2018-07-24 16:33:59 | US | Pliensbachian | Toarcian | Jurassic |
10869 | Grallator (Eubrontes) | Utah | 39.094193 | -109.124847 | terrestrial indet. | 2018-07-24 17:12:33 | US | Rhaetian | NaN | Triassic |
10870 | Grallator (Eubrontes) | Idaho | 42.094444 | -111.262222 | eolian indet. | 2018-07-25 13:47:07 | US | Pliensbachian | Toarcian | Jurassic |
10872 | Theropoda | Utah | 37.217190 | -111.531242 | fluvial indet. | 2018-07-25 14:07:58 | US | Middle Campanian | NaN | Cretaceous |
10884 | Theropoda | Maryland | 39.070869 | -76.868477 | pond | 2018-08-29 06:19:10 | US | Late Aptian | NaN | Cretaceous |
10896 | Theropoda | Texas | 33.107777 | -101.449997 | "channel" | 2018-09-20 07:58:42 | US | Norian | NaN | Triassic |
10897 | Ceratopsidae | New Mexico | 36.186100 | -107.889198 | terrestrial indet. | 2018-09-20 13:28:13 | US | Late Campanian | NaN | Cretaceous |
10898 | Dinosauria | New Mexico | 36.186100 | -107.889198 | terrestrial indet. | 2018-09-20 13:28:13 | US | Late Campanian | NaN | Cretaceous |
10899 | Theropoda | New Mexico | 36.314800 | -108.084198 | terrestrial indet. | 2018-09-20 13:32:14 | US | Late Campanian | NaN | Cretaceous |
10900 | Dinosauria | New Mexico | 36.314800 | -108.084198 | terrestrial indet. | 2018-09-20 13:32:14 | US | Late Campanian | NaN | Cretaceous |
10901 | Ceratopsidae | New Mexico | 36.314800 | -108.084198 | terrestrial indet. | 2018-09-20 13:32:14 | US | Late Campanian | NaN | Cretaceous |
10902 | Hadrosauridae | New Mexico | 36.314800 | -108.084198 | terrestrial indet. | 2018-09-20 13:32:14 | US | Late Campanian | NaN | Cretaceous |
10903 | Dinosauria | New Mexico | 36.314800 | -108.030701 | terrestrial indet. | 2018-09-20 13:49:23 | US | Late Campanian | NaN | Cretaceous |
10905 | Ceratopsidae | New Mexico | 36.086109 | -108.008331 | terrestrial indet. | 2018-09-20 13:54:50 | US | Late Campanian | NaN | Cretaceous |
10906 | Dinosauria | New Mexico | 36.086109 | -108.008331 | terrestrial indet. | 2018-09-20 13:55:45 | US | Late Campanian | NaN | Cretaceous |
10907 | Ceratopsidae | New Mexico | 36.186100 | -107.889198 | terrestrial indet. | 2018-09-20 14:10:57 | US | Late Campanian | NaN | Cretaceous |
10908 | Hadrosauridae | New Mexico | 36.186100 | -107.889198 | terrestrial indet. | 2018-09-20 14:10:57 | US | Late Campanian | NaN | Cretaceous |
10909 | Hadrosauridae | New Mexico | 36.183899 | -107.887901 | terrestrial indet. | 2018-09-20 14:13:11 | US | Late Campanian | NaN | Cretaceous |
10969 | Aves | Utah | 38.151001 | -109.598000 | fluvial-lacustrine indet. | 2019-03-15 18:33:33 | US | Cenomanian | Turonian | Cretaceous |
10975 | Tyrannosauridae | Texas | 29.344444 | -103.591667 | terrestrial indet. | 2019-04-01 20:38:00 | US | Early Campanian | NaN | Cretaceous |
1859 rows × 10 columns
g = sns.FacetGrid(top_dinos, col="interval", col_wrap=2, aspect=2)
g = g.map(sns.countplot, "accepted_name", order=interesting['accepted_name'].value_counts().index.tolist()[:10])
g.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()
It appears that the Neogene, Paleogene, and Quaternary periods only contain Aves, or bird, fossils. This makes sense, as we would expect no dinosaur fossils to exist past the Triassic period. So, as this data shows, the reason behind why Aves fossils are among our top 10 most common “dinosaur” fossils are due to their prevalence amongst the Neogene, Paleogene, and Quaternary periods.
This concludes our exploratory analysis and data cleaning of our dinosaur data.
Part II: Plant Fossils
Now, let’s perform the same data cleaning and exploratory analysis on the PLANT fossils of the United States! First, obtain and perform a URL request.
plant_URL = "https://paleobiodb.org/data1.2/occs/list.json?lngmin=-146.9531&lngmax=-45.7031&latmin=22.6748&latmax=50.4575&base_id=54311&show=coords,attr,loc,prot,time,strat,stratext,lith,lithext,geo,rem,ent,entname,crmod&datainfo"
plant_r = requests.get(url=plant_URL) #store the data obtained through the HTTP request in r
plant_data = plant_r.json() #converts the data in r to a json file
with open("plant_NA.json", "w") as write_file:
json.dump(plant_data, write_file) #creates file dino_NA.json""
plant_df = pd.DataFrame.from_dict(json_normalize(plant_data))
plant_df
access_time | data_license | data_provider | data_source | data_url | documentation_url | elapsed_time | license_url | parameters.base_id | parameters.latmax | parameters.latmin | parameters.lngmax | parameters.lngmin | parameters.show | parameters.taxon_status | parameters.timerule | records | title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Tue 2019-04-02 17:57:22 GMT | Creative Commons CC-BY | The Paleobiology Database | The Paleobiology Database | http://paleobiodb.org/data1.2/occs/list.json?l... | http://paleobiodb.org/data1.2/occs/list_doc.html | 2.64 | http://creativecommons.org/licenses/by/4.0/ | 54311 | 50.4575 | 22.6748 | -45.7031 | -146.9531 | coords,attr,loc,prot,time,strat,stratext,lith,... | all | major | [{'oid': 'occ:3285', 'cid': 'col:324', 'idn': ... | PBDB Data Service |
plant_df = pd.DataFrame.from_dict(json_normalize(plant_data, ["records"]))
plant_df.head()
ath | ati | cc2 | cid | cny | cxi | dcr | dmd | eag | eid | ... | srb | sro | srs | ssc | stp | szn | tdf | tec | tid | tna | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | J. Sepkoski | prs:48 | US | col:324 | NaN | 21 | 1998-11-20 07:59:51 | 2017-01-07 23:13:35 | 460.9 | NaN | ... | NaN | NaN | NaN | NaN | New York | NaN | species not entered | passive margin | txn:5083 | Tetradium |
1 | J. Sepkoski | prs:48 | US | col:335 | NaN | 30 | 1998-11-20 07:59:51 | 2017-01-07 21:38:25 | 470.0 | NaN | ... | NaN | NaN | NaN | NaN | New York | NaN | NaN | passive margin | txn:5083 | Tetradium |
2 | J. Sepkoski | prs:48 | US | col:368 | NaN | 29 | 1998-11-20 07:59:51 | 2017-01-07 16:00:10 | 449.5 | NaN | ... | NaN | NaN | NaN | NaN | Ohio | NaN | NaN | NaN | txn:5083 | Tetradium |
3 | P. Wagner | prs:7 | US | col:374 | Franklin | 29 | 2005-05-03 16:55:33 | 2005-05-03 18:55:33 | 449.5 | rei:12727 | ... | NaN | NaN | NaN | bed | Indiana | NaN | subjective synonym of | NaN | txn:327177 | Tetradium huronense |
4 | J. Sepkoski | prs:48 | US | col:392 | NaN | 29 | 1998-11-20 07:59:51 | 2017-01-02 04:05:42 | 449.5 | NaN | ... | NaN | NaN | NaN | formation | Ohio | NaN | NaN | foreland basin | txn:5083 | Tetradium |
5 rows × 61 columns
Using the function we defined above, we now rename the columns of our plant dataframe.
df_rename(plant_df)
print(plant_df.shape)
plant_df.head()
(18081, 61)
authorizer | authorizer_no | country | collection_no | county | cx_int_no | created | modified | max_ma | reid_no | ... | srb | sro | regional_section | stratscale | state | szn | diference | tec | accepted_no | accepted_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | J. Sepkoski | prs:48 | US | col:324 | NaN | 21 | 1998-11-20 07:59:51 | 2017-01-07 23:13:35 | 460.9 | NaN | ... | NaN | NaN | NaN | NaN | New York | NaN | species not entered | passive margin | txn:5083 | Tetradium |
1 | J. Sepkoski | prs:48 | US | col:335 | NaN | 30 | 1998-11-20 07:59:51 | 2017-01-07 21:38:25 | 470.0 | NaN | ... | NaN | NaN | NaN | NaN | New York | NaN | NaN | passive margin | txn:5083 | Tetradium |
2 | J. Sepkoski | prs:48 | US | col:368 | NaN | 29 | 1998-11-20 07:59:51 | 2017-01-07 16:00:10 | 449.5 | NaN | ... | NaN | NaN | NaN | NaN | Ohio | NaN | NaN | NaN | txn:5083 | Tetradium |
3 | P. Wagner | prs:7 | US | col:374 | Franklin | 29 | 2005-05-03 16:55:33 | 2005-05-03 18:55:33 | 449.5 | rei:12727 | ... | NaN | NaN | NaN | bed | Indiana | NaN | subjective synonym of | NaN | txn:327177 | Tetradium huronense |
4 | J. Sepkoski | prs:48 | US | col:392 | NaN | 29 | 1998-11-20 07:59:51 | 2017-01-02 04:05:42 | 449.5 | NaN | ... | NaN | NaN | NaN | formation | Ohio | NaN | NaN | foreland basin | txn:5083 | Tetradium |
5 rows × 61 columns
plant_df.columns
Index(['authorizer', 'authorizer_no', 'country', 'collection_no', 'county',
'cx_int_no', 'created', 'modified', 'max_ma', 'reid_no', 'eni',
'enterer', 'environment', 'fossils_from_1', 'fossils_from_2', 'flg',
'geology_comments', 'geog_comments', 'geog_scale', 'identified_name',
'idr', 'identified_no', 'lith_adj_1', 'lith_adj_2', 'min_ma', 'lat',
'lith_descript', 'lithification_1', 'lithification_2',
'minor_lithology_1', 'minor_lithology_2', 'lng', 'lithology_1',
'lithology_2', 'modifier', 'mdi', 'occurance_comments',
'early_interval', 'oid', 'late_interval', 'prc', 'protected', 'rid',
'rnk', 'strat_comments', 'formation', 'stratgroup', 'local_bed',
'local_order', 'local_section', 'member', 'srb', 'sro',
'regional_section', 'stratscale', 'state', 'szn', 'diference', 'tec',
'accepted_no', 'accepted_name'],
dtype='object')
plants = plant_df[["accepted_name", "state", "lat", "lng", "environment", "created", "country", "early_interval", "late_interval"]]
plants.head()
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | |
---|---|---|---|---|---|---|---|---|---|
0 | Tetradium | New York | 43.212776 | -75.456108 | shallow subtidal indet. | 1998-11-20 07:59:51 | US | Blackriveran | NaN |
1 | Tetradium | New York | 43.212776 | -75.456108 | deep subtidal shelf | 1998-11-20 07:59:51 | US | Middle Ordovician | NaN |
2 | Tetradium | Ohio | 39.445278 | -83.828613 | offshore | 1998-11-20 07:59:51 | US | Richmondian | NaN |
3 | Tetradium huronense | Indiana | 39.423058 | -85.012779 | shallow subtidal indet. | 2005-05-03 16:55:33 | US | Richmondian | NaN |
4 | Tetradium | Ohio | 39.000000 | -84.000000 | offshore ramp | 1998-11-20 07:59:51 | US | Richmond | Ashgill |
plants["country"].unique()
array(['US', 'MX', 'CA', 'BS'], dtype=object)
As above, we get ride of all the data not contained in the US. We see that there are 16,122 entries.
plants = plants[plants["country"] == "US"]
print(plants.shape)
plants.head()
(16242, 9)
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | |
---|---|---|---|---|---|---|---|---|---|
0 | Tetradium | New York | 43.212776 | -75.456108 | shallow subtidal indet. | 1998-11-20 07:59:51 | US | Blackriveran | NaN |
1 | Tetradium | New York | 43.212776 | -75.456108 | deep subtidal shelf | 1998-11-20 07:59:51 | US | Middle Ordovician | NaN |
2 | Tetradium | Ohio | 39.445278 | -83.828613 | offshore | 1998-11-20 07:59:51 | US | Richmondian | NaN |
3 | Tetradium huronense | Indiana | 39.423058 | -85.012779 | shallow subtidal indet. | 2005-05-03 16:55:33 | US | Richmondian | NaN |
4 | Tetradium | Ohio | 39.000000 | -84.000000 | offshore ramp | 1998-11-20 07:59:51 | US | Richmond | Ashgill |
plants["interval"] = plants["early_interval"]
plants
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | interval | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Tetradium | New York | 43.212776 | -75.456108 | shallow subtidal indet. | 1998-11-20 07:59:51 | US | Blackriveran | NaN | Blackriveran |
1 | Tetradium | New York | 43.212776 | -75.456108 | deep subtidal shelf | 1998-11-20 07:59:51 | US | Middle Ordovician | NaN | Middle Ordovician |
2 | Tetradium | Ohio | 39.445278 | -83.828613 | offshore | 1998-11-20 07:59:51 | US | Richmondian | NaN | Richmondian |
3 | Tetradium huronense | Indiana | 39.423058 | -85.012779 | shallow subtidal indet. | 2005-05-03 16:55:33 | US | Richmondian | NaN | Richmondian |
4 | Tetradium | Ohio | 39.000000 | -84.000000 | offshore ramp | 1998-11-20 07:59:51 | US | Richmond | Ashgill | Richmond |
5 | Tetradium | Ohio | 39.000000 | -84.000000 | deep subtidal ramp | 1998-11-20 07:59:51 | US | Richmondian | Ashgill | Richmondian |
6 | Tetradium | Ohio | 39.000000 | -84.000000 | shallow subtidal indet. | 1998-11-20 07:59:51 | US | Richmondian | Ashgill | Richmondian |
7 | Tetradium | Ohio | 39.000000 | -84.000000 | deep subtidal ramp | 1998-11-20 07:59:51 | US | Richmondian | NaN | Richmondian |
8 | Stacheia | Indiana | 39.500000 | -86.500000 | delta plain | 1999-01-14 15:00:00 | US | Chadian | Arundian | Chadian |
9 | Stacheia | Indiana | 39.750000 | -86.666664 | interdistributary bay | 1999-01-14 15:00:00 | US | Chadian | Arundian | Chadian |
10 | Fourstonella | Iowa | 41.333332 | -92.166664 | offshore shelf | 1999-07-29 06:10:37 | US | Brigantian | NaN | Brigantian |
11 | Tetradium | Tennessee | 35.500000 | -86.500000 | NaN | 1999-09-23 10:03:46 | US | Blackriveran | NaN | Blackriveran |
12 | Tetradium | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-24 09:00:01 | US | Blackriveran | NaN | Blackriveran |
13 | Solenopora | Tennessee | 35.500000 | -86.500000 | carbonate indet. | 1999-09-24 09:40:52 | US | Blackriveran | NaN | Blackriveran |
14 | Tetradium | Tennessee | 35.500000 | -86.500000 | carbonate indet. | 1999-09-24 09:40:52 | US | Blackriveran | NaN | Blackriveran |
15 | Solenopora | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-24 09:48:44 | US | Rocklandian | NaN | Rocklandian |
16 | Tetradium | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-24 09:48:44 | US | Rocklandian | NaN | Rocklandian |
17 | Tetradium | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-24 09:48:44 | US | Rocklandian | NaN | Rocklandian |
18 | Tetradium | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-27 07:45:51 | US | Rocklandian | NaN | Rocklandian |
19 | Solenopora | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-27 08:27:28 | US | Shermanian | NaN | Shermanian |
20 | Tetradium | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-27 08:27:28 | US | Shermanian | NaN | Shermanian |
21 | Solenopora | Tennessee | 35.500000 | -86.500000 | sand shoal | 1999-09-27 08:37:42 | US | Shermanian | NaN | Shermanian |
22 | Tetradium | Tennessee | 35.500000 | -86.500000 | peritidal | 1999-09-27 09:35:02 | US | Shermanian | NaN | Shermanian |
23 | Tetradium | Tennessee | 35.500000 | -86.500000 | peritidal | 1999-09-27 09:35:02 | US | Shermanian | NaN | Shermanian |
24 | Tetradium | Tennessee | 35.500000 | -86.500000 | peritidal | 1999-09-27 09:40:41 | US | Shermanian | NaN | Shermanian |
25 | Solenopora | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-27 09:57:06 | US | Shermanian | NaN | Shermanian |
26 | Tetradium | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-27 09:57:06 | US | Shermanian | NaN | Shermanian |
27 | Solenopora | Tennessee | 35.500000 | -86.500000 | NaN | 1999-09-28 08:53:38 | US | Shermanian | NaN | Shermanian |
28 | Tetradium | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-28 09:25:33 | US | Franklinian | Edenian | Franklinian |
29 | Solenopora | Tennessee | 35.500000 | -86.500000 | shallow subtidal indet. | 1999-09-28 10:01:31 | US | Maysvillian | NaN | Maysvillian |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18032 | Ceanothus precuneatus | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Hemingfordian |
18033 | Colubrina lanceolata | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Hemingfordian |
18034 | Condalia mohavensis | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Hemingfordian |
18035 | Karwinskia californica | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Hemingfordian |
18036 | Rhamnus precalifornica | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18037 | Fremontia lobata | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18038 | Arbutus mohavensis | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18039 | Arbutus prexalapensis | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18040 | Arctostaphylos mohavensis | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18041 | Bumelia florissanti | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18042 | Forestiera | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18043 | Fraxinus | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18044 | Viburnum | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18045 | Phyllites | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Hemingfordian |
18046 | Phyllites | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:24 | US | Hemingfordian | Barstovian | Hemingfordian |
18047 | Phyllites | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:24 | US | Hemingfordian | Barstovian | Hemingfordian |
18048 | Phyllites | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:24 | US | Hemingfordian | Barstovian | Hemingfordian |
18049 | Acer | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Early Oligocene |
18050 | Quercus | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Early Oligocene |
18051 | Fagus | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Early Oligocene |
18052 | Alnus | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Early Oligocene |
18053 | Embryophyta | California | 34.889999 | -119.160004 | marine indet. | 2018-06-22 20:20:49 | US | Eocene | NaN | Eocene |
18054 | Tracheophyta | Utah | 37.294247 | -111.896385 | fluvial indet. | 2018-07-25 14:16:17 | US | Middle Campanian | NaN | Middle Campanian |
18058 | Solenopora | Nevada | 39.611668 | -117.491943 | shallow subtidal indet. | 2018-08-27 07:03:13 | US | Ladinian | NaN | Ladinian |
18059 | Rhodophyceae | Nevada | 39.615555 | -117.504166 | offshore indet. | 2018-08-27 07:03:33 | US | Julian | NaN | Julian |
18060 | Lithophyllum | California | 33.555607 | -117.737595 | shallow subtidal indet. | 2018-09-05 12:56:39 | US | Early Miocene | NaN | Early Miocene |
18061 | Plantae | California | 36.926311 | -118.088516 | marine indet. | 2018-09-06 13:21:13 | US | Mississippian | NaN | Mississippian |
18062 | Paraphyllanthoxylon | Utah | 38.033333 | -110.133331 | terrestrial indet. | 2018-09-26 21:49:43 | US | Turonian | NaN | Turonian |
18079 | Ficus | California | 33.588631 | -117.705620 | shallow subtidal indet. | 2019-02-15 19:35:27 | US | Early Miocene | NaN | Early Miocene |
18080 | Angiospermae | Texas | 29.286388 | -103.550278 | terrestrial indet. | 2019-03-15 14:11:50 | US | Maastrichtian | NaN | Maastrichtian |
16242 rows × 10 columns
plants["interval"].unique()
array(['Blackriveran', 'Middle Ordovician', 'Richmondian', 'Richmond',
'Chadian', 'Brigantian', 'Rocklandian', 'Shermanian',
'Franklinian', 'Maysvillian', 'Chesterian', 'Meramecian',
'Kinderhookian', 'Osagean', 'Frasnian', 'Mississippian',
'Givetian', 'Emsian', 'Late Emsian', 'Early Devonian',
'Late Triassic', 'Westphalian D', 'Desmoinesian', 'Carboniferous',
'Pennsylvanian', 'Oligocene', 'Miocene', 'Wasatchian', 'Tertiary',
'Clarkforkian', 'Missourian', 'Permian', 'Sakmarian', 'Artinskian',
'Asselian', 'Rotliegendes', 'Late Pennsylvanian', 'Bashkirian',
'Podolskian', 'Westphalian', 'Atokan', 'Morrowan',
'Early Pennsylvanian', 'Virgilian', 'Lostcabinian', 'Tiffanian',
'Torrejonian', 'Puercan', 'Graybullian', 'Early Eocene',
'Lutetian', 'Kimmeridgian', 'Tournaisian', 'Late Albian',
'Late Devonian', 'Middle Devonian', 'Famennian', 'Late Frasnian',
'Early Famennian', 'Late Givetian', 'Early Frasnian',
'Early Cenomanian', 'Early Albian', 'Serpukhovian', 'Visean',
'Late Aptian', 'Middle Famennian', 'Middle Eocene', 'Stephanian',
'Early Miocene', 'Ypresian', 'Eocene', 'Priabonian',
'Late Oligocene', 'Moscovian', 'Late Miocene', 'Late Famennian',
'Wordian', 'Carnian', 'Ashgillian', 'Late Cretaceous', 'Leonard',
'Wolfcamp', 'Capitanian', 'Chadronian', 'Uintan', 'Late Jurassic',
'Middle Jurassic', 'Lancian', 'Orellan', 'Late Maastrichtian',
'Late Campanian', 'Rhuddanian', 'Ludlow', 'Lochkovian',
'Early Campanian', 'Bridgerian', 'Late Hemphillian', 'Roadian',
'Judithian', 'Burdigalian', 'Serravallian', 'Late Santonian',
'Late Turonian', 'Cretaceous', 'Eifelian', 'Late Paleocene',
'Campanian', 'Paleocene', 'Middle Albian', 'Albian',
'Early Cretaceous', 'Cenomanian', 'Late Cenomanian', 'Edmontonian',
'Barremian', 'Santonian', 'Maastrichtian', 'Middle Campanian',
'Caradoc', 'Coniacian', 'Caradocian', 'Ashbyan', 'Kasimovian',
'Turonian', 'Early Paleocene', 'Late Pleistocene', 'Chatfieldian',
'Thanetian', 'Gzhelian', 'Llanvirn', 'Aegean', 'Late Eocene',
'Barstovian', 'Middle Miocene', 'Early Oligocene', 'Pliocene',
'Aptian', 'Early Clarendonian', 'Rupelian', 'Late Clarendonian',
'Norian', 'Kashirian', 'Late Pliensbachian', 'Late Uintan',
'Pliensbachian', 'Changhsingian', 'Early Jurassic',
'Middle Turonian', 'Irvingtonian', 'Hettangian', 'Bartonian',
'Early Pleistocene', 'Zanclean', 'Chazy', 'Holocene', 'Namurian',
'Westphalian B', 'Kungurian', 'Rancholabrean', 'Early Aptian',
'Hemphillian', 'Asbian', 'Delamaran', 'Wenlock', 'Rhaetian',
'Mohawkian', 'Late Mississippian', 'Spathian', 'Marjumian',
'Ladinian', 'Early Santonian', 'Middle Cenomanian',
'Late Oxfordian', 'Middle Pennsylvanian', 'Piacenzian',
'Early Maastrichtian', 'Clarendonian', 'Chazyan', 'Sinemurian',
'Langhian', 'Tortonian', 'Messinian', 'Calabrian', 'Whiterockian',
'Early Hettangian', 'Hemingfordian', 'Julian'], dtype=object)
We can use the function we defined in the Dinosaur section to perform a coarse data transformation. This is one example of how functions can be useful!
change_geog(Cretaceous, "Cretaceous", plants)
change_geog(Jurassic, "Jurassic", plants)
change_geog(Triassic, "Triassic", plants)
change_geog(Neogene, "Neogene", plants)
change_geog(Paleogene, "Paleogene", plants)
change_geog(Quaternary, "Quaternary", plants)
Again, we round every interval to the nearest era. However, we find that a lot of these plant data are dated back to Paleozoid era, earlier than any of our dinosaur data. Thus, we will only keep the data found in the Mesozoic and Cenozoic, shown below.
dino_plants = plants[plants["interval"].isin(["Cretaceous", "Jurassic", "Triassic", "Neogene", "Paleogene", "Quaternary"])]
dino_plants
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | interval | |
---|---|---|---|---|---|---|---|---|---|---|
96 | Neocalamites | New Mexico | 35.200001 | -105.783333 | fluvial indet. | 2001-06-08 06:53:35 | US | Late Triassic | NaN | Triassic |
97 | Brachyphyllum | New Mexico | 35.200001 | -105.783333 | fluvial indet. | 2001-06-08 06:53:35 | US | Late Triassic | NaN | Triassic |
98 | Masculostrobus | New Mexico | 35.200001 | -105.783333 | fluvial indet. | 2001-06-08 06:53:35 | US | Late Triassic | NaN | Triassic |
99 | Samaropsis | New Mexico | 35.200001 | -105.783333 | fluvial indet. | 2001-06-08 06:46:34 | US | Late Triassic | NaN | Triassic |
100 | Samaropsis | New Mexico | 35.200001 | -105.783333 | fluvial indet. | 2001-06-08 06:46:34 | US | Late Triassic | NaN | Triassic |
101 | Samaropsis | New Mexico | 35.200001 | -105.783333 | fluvial indet. | 2001-06-08 06:46:34 | US | Late Triassic | NaN | Triassic |
102 | Samaropsis | New Mexico | 35.200001 | -105.783333 | fluvial indet. | 2001-06-08 06:46:34 | US | Late Triassic | NaN | Triassic |
103 | Samaropsis | New Mexico | 35.200001 | -105.783333 | fluvial indet. | 2001-06-08 06:46:34 | US | Late Triassic | NaN | Triassic |
468 | Abies | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
469 | Sequoia | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
470 | Acer | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
471 | Acer | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
472 | Cercidiphyllum | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
473 | Cercis | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
474 | Corylus insignis | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
475 | Koelreuteria | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
476 | Juniperus | Montana | 45.000000 | -113.216667 | lacustrine - large | 2001-06-24 15:34:18 | US | Miocene | NaN | Neogene |
477 | Mimosites | Montana | 45.000000 | -113.216667 | lacustrine - large | 2001-06-24 15:34:18 | US | Miocene | NaN | Neogene |
478 | Cercocarpus | Montana | 45.000000 | -113.216667 | lacustrine - large | 2001-06-24 15:34:18 | US | Miocene | NaN | Neogene |
479 | Crataegus | Montana | 45.000000 | -113.216667 | lacustrine - large | 2001-06-24 15:34:18 | US | Miocene | NaN | Neogene |
480 | Dalbergia | Montana | 45.000000 | -113.216667 | lacustrine - large | 2001-06-24 15:34:18 | US | Miocene | NaN | Neogene |
481 | Ulmus | Montana | 45.000000 | -113.216667 | lacustrine - large | 2001-06-24 15:34:18 | US | Miocene | NaN | Neogene |
482 | Metasequoia occidentalis | Montana | 45.000000 | -113.216667 | lacustrine - large | 2001-06-24 15:34:18 | US | Miocene | NaN | Neogene |
495 | Acrovena laevis | North Dakota | 45.500000 | -103.250000 | "channel" | 2001-06-25 08:48:30 | US | Wasatchian | NaN | Paleogene |
496 | Betula hesterna | North Dakota | 45.500000 | -103.250000 | "channel" | 2001-06-25 08:48:30 | US | Wasatchian | NaN | Paleogene |
497 | Carpolithes bryangosus | North Dakota | 45.500000 | -103.250000 | "channel" | 2001-06-25 08:48:30 | US | Wasatchian | NaN | Paleogene |
498 | Cyperacites | North Dakota | 45.500000 | -103.250000 | "channel" | 2001-06-25 08:48:30 | US | Wasatchian | NaN | Paleogene |
499 | Equisetum magnum | North Dakota | 45.500000 | -103.250000 | "channel" | 2001-06-25 08:48:30 | US | Wasatchian | NaN | Paleogene |
500 | Glyptostrobus europaeus | North Dakota | 45.500000 | -103.250000 | "channel" | 2001-06-25 08:48:30 | US | Wasatchian | NaN | Paleogene |
501 | Platycarya americana | North Dakota | 45.500000 | -103.250000 | "channel" | 2006-12-15 09:45:24 | US | Wasatchian | NaN | Paleogene |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18030 | Dodonaea californica | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Neogene |
18031 | Ceanothus precrassifolius | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Neogene |
18032 | Ceanothus precuneatus | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Neogene |
18033 | Colubrina lanceolata | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Neogene |
18034 | Condalia mohavensis | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Neogene |
18035 | Karwinskia californica | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:46:15 | US | Hemingfordian | Barstovian | Neogene |
18036 | Rhamnus precalifornica | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18037 | Fremontia lobata | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18038 | Arbutus mohavensis | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18039 | Arbutus prexalapensis | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18040 | Arctostaphylos mohavensis | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18041 | Bumelia florissanti | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18042 | Forestiera | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18043 | Fraxinus | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18044 | Viburnum | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18045 | Phyllites | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:01 | US | Hemingfordian | Barstovian | Neogene |
18046 | Phyllites | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:24 | US | Hemingfordian | Barstovian | Neogene |
18047 | Phyllites | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:24 | US | Hemingfordian | Barstovian | Neogene |
18048 | Phyllites | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:47:24 | US | Hemingfordian | Barstovian | Neogene |
18049 | Acer | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Paleogene |
18050 | Quercus | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Paleogene |
18051 | Fagus | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Paleogene |
18052 | Alnus | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Paleogene |
18053 | Embryophyta | California | 34.889999 | -119.160004 | marine indet. | 2018-06-22 20:20:49 | US | Eocene | NaN | Paleogene |
18054 | Tracheophyta | Utah | 37.294247 | -111.896385 | fluvial indet. | 2018-07-25 14:16:17 | US | Middle Campanian | NaN | Cretaceous |
18058 | Solenopora | Nevada | 39.611668 | -117.491943 | shallow subtidal indet. | 2018-08-27 07:03:13 | US | Ladinian | NaN | Triassic |
18060 | Lithophyllum | California | 33.555607 | -117.737595 | shallow subtidal indet. | 2018-09-05 12:56:39 | US | Early Miocene | NaN | Neogene |
18062 | Paraphyllanthoxylon | Utah | 38.033333 | -110.133331 | terrestrial indet. | 2018-09-26 21:49:43 | US | Turonian | NaN | Cretaceous |
18079 | Ficus | California | 33.588631 | -117.705620 | shallow subtidal indet. | 2019-02-15 19:35:27 | US | Early Miocene | NaN | Neogene |
18080 | Angiospermae | Texas | 29.286388 | -103.550278 | terrestrial indet. | 2019-03-15 14:11:50 | US | Maastrichtian | NaN | Cretaceous |
9943 rows × 10 columns
dino_plants.shape
(9943, 10)
It looks like a majority of our US plant data concurr with the eras of our dinosaur data. Above, we found that the total number of entries in our US data was 16122. After filtering out everything in the Paleozoic era, we are left with 9959 entries.
plt.figure(figsize=(12,6))
ax = sns.countplot(x='interval', data=dino_plants, order=dino_plants['interval'].value_counts().index.tolist())
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title("Plant Fossils by Era")
plt.show()
Contrary to our data for dinosaur fossils by era, it looks like most of our plants are from the Paleogene (our dinosaurs were the fewest in the Paleogene). It looks like in general, the further back we go, the less likely there is for plant fossils to exist. This makes sense, since we would expect dinosaurs to fossilize better than plants.
plt.figure(figsize=(12,6))
ax = sns.countplot(x='accepted_name', data=dino_plants, order=dino_plants['accepted_name'].value_counts().index.tolist()[:20])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title("Plant Fossils by Count (Top 20)")
plt.show()
We see that Magnoliopsida is the most common plant fossil found in our data. After a quick google search, we find that Magnoliopsida is a class of flowering plant that includes the common magnolia. Today, it is considered a living fossil due to its age and lack of change through the centuries.
Now, let’s look at the distribution of plants across the eras.
top10_plants = dino_plants['accepted_name'].value_counts().index.tolist()[:10]
top_plants = dino_plants[dino_plants["accepted_name"].isin(top10_plants)]
top_plants
accepted_name | state | lat | lng | environment | created | country | early_interval | late_interval | interval | |
---|---|---|---|---|---|---|---|---|---|---|
470 | Acer | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
471 | Acer | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
472 | Cercidiphyllum | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-06-24 15:34:18 | US | Oligocene | Miocene | Paleogene |
482 | Metasequoia occidentalis | Montana | 45.000000 | -113.216667 | lacustrine - large | 2001-06-24 15:34:18 | US | Miocene | NaN | Neogene |
513 | Acer | North Dakota | 46.883331 | -102.783333 | "channel" | 2001-06-25 09:42:07 | US | Clarkforkian | NaN | Paleogene |
535 | Metasequoia occidentalis | North Dakota | 46.983334 | -102.050003 | "channel" | 2001-06-25 10:56:12 | US | Clarkforkian | NaN | Paleogene |
553 | Metasequoia occidentalis | North Dakota | 47.306389 | -102.112503 | "channel" | 2001-06-25 11:36:56 | US | Clarkforkian | NaN | Paleogene |
565 | Metasequoia occidentalis | North Dakota | 47.000000 | -101.550003 | "channel" | 2001-06-25 12:10:52 | US | Clarkforkian | NaN | Paleogene |
832 | Metasequoia occidentalis | North Dakota | 47.000000 | -101.550003 | "channel" | 2001-07-08 17:54:30 | US | Wasatchian | NaN | Paleogene |
836 | Metasequoia occidentalis | North Dakota | 46.750000 | -102.750000 | "channel" | 2001-07-08 18:04:24 | US | Wasatchian | NaN | Paleogene |
841 | Metasequoia occidentalis | North Dakota | 47.133331 | -101.816666 | "channel" | 2001-07-08 18:11:26 | US | Clarkforkian | NaN | Paleogene |
846 | Metasequoia occidentalis | North Dakota | 46.971668 | -102.084442 | "channel" | 2001-07-08 18:20:46 | US | Clarkforkian | NaN | Paleogene |
850 | Acer | North Dakota | 46.916668 | -102.583336 | "channel" | 2001-07-08 21:24:05 | US | Clarkforkian | NaN | Paleogene |
855 | Metasequoia occidentalis | North Dakota | 46.916668 | -102.583336 | "channel" | 2001-07-08 21:24:05 | US | Clarkforkian | NaN | Paleogene |
863 | Metasequoia occidentalis | North Dakota | 47.000000 | -102.666664 | "channel" | 2001-07-08 21:29:27 | US | Clarkforkian | NaN | Paleogene |
882 | Acer | North Dakota | 46.966667 | -102.666664 | "channel" | 2001-07-08 21:56:59 | US | Clarkforkian | NaN | Paleogene |
886 | Metasequoia occidentalis | North Dakota | 46.966667 | -102.666664 | "channel" | 2001-07-08 21:56:59 | US | Clarkforkian | NaN | Paleogene |
891 | Metasequoia occidentalis | North Dakota | 48.479721 | -102.722221 | "channel" | 2001-07-08 22:05:41 | US | Clarkforkian | NaN | Paleogene |
905 | Metasequoia occidentalis | North Dakota | 47.816666 | -102.766670 | "channel" | 2001-07-08 22:35:02 | US | Clarkforkian | NaN | Paleogene |
914 | Metasequoia occidentalis | North Dakota | 47.366669 | -103.016670 | "channel" | 2001-07-08 22:45:12 | US | Clarkforkian | NaN | Paleogene |
917 | Metasequoia occidentalis | North Dakota | 47.377499 | -103.257500 | "channel" | 2001-07-08 22:54:21 | US | Clarkforkian | NaN | Paleogene |
1075 | Quercus | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-07-23 11:12:44 | US | Oligocene | Miocene | Paleogene |
1076 | Quercus | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-07-23 11:12:44 | US | Oligocene | Miocene | Paleogene |
1077 | Quercus | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-07-23 11:12:44 | US | Oligocene | Miocene | Paleogene |
1079 | Salix | Montana | 45.150002 | -113.116669 | "floodplain" | 2001-07-23 11:12:44 | US | Oligocene | Miocene | Paleogene |
1754 | Metasequoia occidentalis | Wyoming | 45.000000 | -109.000000 | NaN | 2001-08-14 07:28:06 | US | Tiffanian | NaN | Paleogene |
1758 | Ficus | Wyoming | 45.000000 | -109.000000 | NaN | 2001-08-14 07:33:56 | US | Puercan | NaN | Paleogene |
1759 | Metasequoia occidentalis | Wyoming | 45.000000 | -109.000000 | NaN | 2001-08-14 07:33:56 | US | Puercan | NaN | Paleogene |
1762 | Ficus | Wyoming | 45.000000 | -109.000000 | NaN | 2001-08-14 07:36:25 | US | Puercan | NaN | Paleogene |
1763 | Metasequoia occidentalis | Wyoming | 45.000000 | -109.000000 | NaN | 2001-08-14 07:36:25 | US | Puercan | NaN | Paleogene |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17340 | Magnoliopsida | North Dakota | 46.418560 | -103.976707 | fluvial indet. | 2012-06-22 03:05:17 | US | Early Paleocene | NaN | Paleogene |
17343 | Magnoliopsida | North Dakota | 46.418560 | -103.976707 | fluvial indet. | 2012-06-22 03:05:17 | US | Early Paleocene | NaN | Paleogene |
17346 | Magnoliopsida | North Dakota | 46.418560 | -103.976707 | fluvial indet. | 2012-06-22 03:05:17 | US | Early Paleocene | NaN | Paleogene |
17347 | Magnoliopsida | North Dakota | 46.418560 | -103.976707 | fluvial indet. | 2012-06-22 03:05:17 | US | Early Paleocene | NaN | Paleogene |
17348 | Magnoliopsida | North Dakota | 46.418560 | -103.976707 | fluvial indet. | 2012-06-22 03:05:17 | US | Early Paleocene | NaN | Paleogene |
17349 | Magnoliopsida | North Dakota | 46.418560 | -103.976707 | fluvial indet. | 2012-06-22 03:06:36 | US | Early Paleocene | NaN | Paleogene |
17351 | Magnoliopsida | North Dakota | 46.418560 | -103.976707 | fluvial indet. | 2012-06-22 03:06:36 | US | Early Paleocene | NaN | Paleogene |
17743 | Quercus | South Carolina | 33.214443 | -80.448059 | fluvial indet. | 2013-10-05 11:34:55 | US | Irvingtonian | NaN | Quaternary |
17759 | Quercus | Florida | 27.336000 | -82.514702 | lagoonal | 2014-03-17 19:08:09 | US | Piacenzian | NaN | Neogene |
17792 | Dryophyllum | Montana | 45.888802 | -104.552696 | fine channel fill | 2015-08-31 14:14:45 | US | Late Maastrichtian | NaN | Cretaceous |
17837 | Acer | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 15:47:00 | US | Langhian | NaN | Neogene |
17845 | Quercus | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 15:47:00 | US | Langhian | NaN | Neogene |
17856 | Salix | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 15:47:00 | US | Langhian | NaN | Neogene |
17860 | Acer | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 15:52:45 | US | Langhian | NaN | Neogene |
17870 | Quercus | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 15:52:46 | US | Langhian | NaN | Neogene |
17882 | Salix | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 15:52:46 | US | Langhian | NaN | Neogene |
17886 | Acer | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 15:58:39 | US | Serravallian | NaN | Neogene |
17898 | Quercus | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 15:58:39 | US | Serravallian | NaN | Neogene |
17906 | Salix | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 15:58:39 | US | Serravallian | NaN | Neogene |
17920 | Quercus | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 16:02:25 | US | Tortonian | NaN | Neogene |
17927 | Salix | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 16:02:25 | US | Tortonian | NaN | Neogene |
17930 | Acer | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 16:04:50 | US | Messinian | NaN | Neogene |
17935 | Quercus | Virginia | 38.163334 | -76.831390 | marine indet. | 2017-07-20 16:04:50 | US | Messinian | NaN | Neogene |
17946 | Quercus | Virginia | 38.163334 | -76.831390 | fluvial indet. | 2017-07-20 16:07:43 | US | Calabrian | NaN | Quaternary |
17954 | Salix | Virginia | 38.163334 | -76.831390 | fluvial indet. | 2017-07-20 16:07:43 | US | Calabrian | NaN | Quaternary |
17959 | Dryophyllum | Montana | 45.889168 | -104.549721 | channel lag | 2017-09-26 05:10:55 | US | Late Maastrichtian | NaN | Cretaceous |
18001 | Ficus | California | 35.299999 | -118.500000 | terrestrial indet. | 2018-04-27 17:43:32 | US | Hemingfordian | Barstovian | Neogene |
18049 | Acer | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Paleogene |
18050 | Quercus | Oregon | 44.301109 | -120.832222 | lacustrine indet. | 2018-06-04 01:16:17 | US | Early Oligocene | NaN | Paleogene |
18079 | Ficus | California | 33.588631 | -117.705620 | shallow subtidal indet. | 2019-02-15 19:35:27 | US | Early Miocene | NaN | Neogene |
2140 rows × 10 columns
g = sns.FacetGrid(top_plants, col="interval", col_wrap=2, aspect=2)
g = g.map(sns.countplot, "accepted_name", order=dino_plants['accepted_name'].value_counts().index.tolist()[:10])
g.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()
In Conclusion
In parts I and II, we imported data from the PBDB navigator and loaded it into Jupyter notebook. We cleaned the data by renaming columns, selecting the data columns of interest, and making the time scale of our entries coarser and easier to work with in the future. While we cleaned our data, we also performed some elementary exploratory analysis such as looking at the most common type of fossil specimen, the state with the most fossil counts, and the dinosaur count by era. We discovered that the data we’re using not only contains dinosaur fossils, but fossils of their descendents, including some that are still living to this day!
Here are some key take-aways from this part of our exploration:
- Defining functions are key in making data cleaning easier. By defining our function change_geog(), we were able to easily rename our time scales in both our dinosaur data and our plant data.
- Exploratory analysis is crucial to understanding what kind of data you’re working with. After I initially selected to work with the Dinosauria on the PBDB navigator, I assumed I would be receiving traditional dinosaur fossils from the Mezosoic era. Exploratory analysis helped me discover that there were data points that occurred later in the geologic time scale.
Now that we’ve cleaned our data and understand what we’re working with, we’re ready to perform further analysis and visualization! Let’s export our important data in the form of an hdf file for further use.