taxadb
What’s in a name?

!!Caution!! The information about
taxadb
in this post appears to be out of date and needs updating.
Setting the scene
During a recent project using the Global Invasive Species Database (GISD), I encountered several issues that are common when working with taxonomic databases. Having searched the GISD and imported the data into R, I noticed that some species had missing values (“NA”) for one or more variables. The missing values for the < 10 species were not difficult to determine manually, but what if I had been processing a species list of several hundred species? How would I approach the problem of taxonomic verification in a time-saving, reproducible manner?
The second issue was that species in my list of invasive herpetofauna had undergone taxonomic reassignment, either at species, genus or even family level. How should a biologist deal with this issue when communicating the results of an analysis or data visualisation such as mine? These issues are the topic of this post.
Reality check
In an ideal world, the GISD (or whichever database you prefer) is updated as soon as taxonomic changes are accepted, and the search results can be treated as authoritative. Realistically, for any database with vast number of taxa to consider it is unreasonable to expect the taxonomy of all species to be up-to-date and error-free. As I discovered in my invasive herpetofauna investigation, for some species the GISD was simultaneously ahead of and behind the other databases regarding taxonomic assignments and classifications.
For example, for Norops grahami the GISD appears to accept the generic reassignment from Anolis, while the Reptile Database notes the change on it’s N. grahami page, but has not reassigned it to Norops in their database yet. For the same species though, the Reptile Database has the family assignment specified as Dactyloidae whereas the GISD still assigns it to the Polychrotidae. So for the genus, the GISD is ahead, but behind for the family.
Deciding what to do in situations like this is always tricky, but a simple solution is to be able to present these uncertainties to the readers of your research/communication. Thankfully there is an R package that can be used to query the world’s major taxonomic databases from the comfort of your command line.
Introducing {taxadb}
We load the taxadb
package and the tidyverse
for managing the
post-query manipulations of the dataframes. I used the kableExtra
package to format the presentation of my tabulated taxonomic data.
library(taxadb)
library(tidyverse)
library(kableExtra)
{taxadb}
The taxadb
package installs a taxonomic database of your choice on
your workstation. This local database is installed from taxadb
, which
is periodically updated from the relevant online database APIs. For a
thorough understanding of the data sources used by taxadb
, I encourage
you to read the documentation found at the rOpenSci taxadb
page. The
TL:DR is that you should not simply merge information from two different
taxonomic data sources, as explained in this paragraph:
“Please Note:
taxadb
advises against uncritically combining data from multiple providers. The same name is frequently used by different providers to mean different things – some providers consider two names synonyms that other providers consider distinct species. It is crucial to recognize that taxonomic name providers represent independent taxonomic theories, and not merely additional observations of the same immutable reality (Franz & Sterner (2018)). You cannot just merge two databases of taxonomic names like you can two databases of, say, plant traits to get a bigger and more complete sample, because the former can contain meaningful contradictions.”
The data sources used by taxadb
are updated on an annual basis, and
this can be checked using the available_versions
function. The first
element "2019"
tells us the last time the data sources were updated by
taxadb
. The "dwc"
indicates that all data sources are formatted
according to the Darwin Core standard.
available_versions()
## [1] "2019" "dwc"
A quick comparison of herpetofaunal species in GISD and ITIS
In a related project, I queried the GISD database to ascertain the names
of all herpetofaunal species that have established non-native or
invasive populations to date. Let’s import the file downloaded from the
GISD and compare it to the herpetofauna listed in the ITIS database.
First we import the .csv
datafile, do some data preparation.
Second we create a local ITIS database using td_create.
Third we
extract all amphibian and reptile species from the ITIS database. Lastly
we join the information in the two tibbles using a left_join
where we
tell the function that Species
in GISD is the same variable as
scientificName
in ITIS. After joining the two tibbles, I decided to
select just the variables that I am interested in.
GISD_herp <- read_delim("amrep_gisd_Feb2020.csv",
trim_ws = TRUE,
delim = ";") %>%
select(-X8) %>%
separate(Species,
c("Genus",
"Specific_Epithet",
"Infraspecific_Epithet"),
sep = " ",
remove = FALSE)
td_create("itis")
database <- filter_rank(c("Amphibia", "Reptilia"), "class")
db_check <- GISD_herp %>%
left_join(database, by = c("Species" = "scientificName")) %>%
select(species_GISD = Species,
vernacularName_ITIS = vernacularName,
order_GISD = Order,
order_ITIS = order,
family_GISD = Family,
family_ITIS = family,
taxonomicStatus_ITIS = taxonomicStatus,
acceptedNameUsageID)
db_check %>% select(-vernacularName_ITIS) %>%
slice(c(1,5,6,16,20,23,28,30,31,43)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"hover")) %>%
column_spec(column = 1,
italic = TRUE) %>%
row_spec(row = c(5,9),
background = "Dodgerblue",
color = "white")
species_GISD |
order_GISD |
order_ITIS |
family_GISD |
family_ITIS |
taxonomicStatus_ITIS |
acceptedNameUsageID |
---|---|---|---|---|---|---|
Anolis aeneus |
Squamata |
Squamata |
Polychrotidae |
Dactyloidae |
accepted |
ITIS:1056079 |
Anolis equestris |
NA |
Squamata |
Polychrotidae |
Dactyloidae |
accepted |
ITIS:173891 |
Anolis extremus |
NA |
Squamata |
Polychrotidae |
Dactyloidae |
accepted |
ITIS:1056181 |
Boiga irregularis |
Squamata |
Squamata |
Colubridae |
Colubridae |
accepted |
ITIS:174206 |
Elaphe guttata |
Squamata |
Squamata |
Colubridae |
Colubridae |
synonym |
ITIS:1081818 |
Eleutherodactylus planirostris |
Anura |
Anura |
Leptodactylidae |
Eleutherodactylidae |
accepted |
ITIS:173568 |
Lithobates catesbeianus |
Anura |
Anura |
Ranidae |
Ranidae |
accepted |
ITIS:775084 |
Natrix maura |
Squamata |
Squamata |
Colubridae |
Colubridae |
accepted |
ITIS:700797 |
Norops grahami |
Squamata |
NA |
Polychrotidae |
NA |
NA |
NA |
Xenopus laevis |
Anura |
Anura |
Pipidae |
Pipidae |
accepted |
ITIS:173549 |
[Aside: Unfortunately, taxonomic data and their dataframes don’t make
for very pretty data visualisations. I apologise for the ‘wall of text’
feeling in this post, but I hope that you find this worked example of
taxadb
worth the eye-strain.]
The table above shows just 10 of the 43 species but gives you a feel for the information I have extracted from ITIS. The two rows highlighted blue indicate examples of species that need additional consideration. Elaphe guttata is listed as a synonym, so we need to find the new, accepted name for the species, and Norops grahami is the species I mentioned earlier and appears to be missing from the ITIS database.
We are interested in the species in the GISD that have no match in the
ITIS database. Species that have no match will have missing values for
all the *_ITIS variables, so I chose taxonomicStatus_ITIS
. The
output shows us that there are four species in the GISD that have no
match in the ITIS database.
db_check %>%
filter(is.na(taxonomicStatus_ITIS)) %>%
select(-vernacularName_ITIS,
-acceptedNameUsageID) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"hover")) %>%
column_spec(column = 1,
italic = TRUE,
width = "6cm")
species_GISD |
order_GISD |
order_ITIS |
family_GISD |
family_ITIS |
taxonomicStatus_ITIS |
---|---|---|---|---|---|
Anolis wattsi |
NA |
NA |
Polychrotidae |
NA |
NA |
Boa constrictor imperator |
Squamata |
NA |
Boidae |
NA |
NA |
Norops grahami |
Squamata |
NA |
Polychrotidae |
NA |
NA |
Trachemys scripta elegans |
Testudines |
NA |
Emydidae |
NA |
NA |
I am also interested in names that are synonyms and not currently the accepted scientific name for the species. Any species identified by a synonym in the GISD merits further investigation to determine if the most recent taxonomic assignment gave due consideration to the status of the alien/invasive population.
Using the code below, I compare the name from our GISD list with the accepted name in ITIS database. The code allows us to determine the level at which the reassignment occurred. A reassignment at genus level might mean that less attention is needed as compared to a change in the specific epithet. The work doesn’t end when you identify differences using this comparison but it does get a useful pointer.
db_check %>%
filter(taxonomicStatus_ITIS == "synonym") %>%
select(-vernacularName_ITIS) %>%
left_join(database, by = "acceptedNameUsageID") %>%
filter(taxonomicStatus == "accepted") %>%
select(species_GISD,
acceptedName_ITIS = scientificName,
acceptedNameUsageID) %>%
mutate(acceptedNameUsageID = cell_spec(acceptedNameUsageID,
"html",
background = "Lightgreen",
color = "white",
bold = TRUE)) %>%
kable("html",
escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped",
"hover")) %>%
column_spec(column = c(1,2),
italic = TRUE)
species_GISD |
acceptedName_ITIS |
acceptedNameUsageID |
---|---|---|
Chamaeleo jacksonii |
Trioceros jacksonii |
ITIS:1055685 |
Elaphe guttata |
Pantherophis guttatus |
ITIS:1081818 |
Litoria aurea |
Ranoidea aurea |
ITIS:1099285 |
Norops sagrei |
Anolis sagrei |
ITIS:173903 |
Ramphotyphlops braminus |
Indotyphlops braminus |
ITIS:1116297 |
The acceptedNameUsageID
number is a way for us to backreference the
spcies to the ITIS database to extract all the synonyms for a single
species. Below I demonstrate the process for Elaphe guttata (Eastern
Corn Snake), which shows us that the ITIS database recognises
Pantherophis guttatus as the accepted name and the three other
classifications as synonyms. You can also see that the
acceptedNameUsageID
is the same for all names of this species while
the taxonID
is different for each.
database %>%
filter(acceptedNameUsageID == "ITIS:1081818") %>%
select(scientificName,
taxonRank,
taxonomicStatus,
acceptedNameUsageID,
taxonID) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped",
"hover")) %>%
column_spec(column = 1,
italic = TRUE,
width = "5cm")
scientificName |
taxonRank |
taxonomicStatus |
acceptedNameUsageID |
taxonID |
---|---|---|---|---|
Elaphe guttata |
species |
synonym |
ITIS:1081818 |
ITIS:174175 |
Elaphe guttata guttata |
subspecies |
synonym |
ITIS:1081818 |
ITIS:174176 |
Coluber guttatus |
species |
synonym |
ITIS:1081818 |
ITIS:209204 |
Pantherophis guttatus |
species |
accepted |
ITIS:1081818 |
ITIS:1081818 |
Out with the old…
The last demo I want to do with taxadb
is compare a list of species
names from a reptile survey in 2006 with the current taxonomy of the
species. I know for a fact that several names have changed over the past
14 years, so let’s see how easy it is to determine the new accepted
species names.
data_2006 <- read_lines("rep_survey_2006.txt")
species_2006 <- filter_name(data_2006) %>%
select(reptiles_2006 = input,
acceptedNameUsageID,
Genus_ITIS = genus,
specificEpithet_ITIS = specificEpithet,
taxonomicStatus_ITIS = taxonomicStatus,
vernacularName_ITIS = vernacularName)
The output of our query provides food for thought. Five of the 20
species find no matches in the ITIS database. Eleven names are still the
accepted name for the species concerned and two names are synonyms. We
could use our code from above to get the accepted name, but we see that
the filter_name
function returns a genus
and specificEpithet
variable from the ITIS database. In the case of synonyms, these two
variables hold the accepted name for the species.
species_2006 %>%
filter(taxonomicStatus_ITIS == "synonym") %>%
select(-vernacularName_ITIS) %>%
mutate(Genus_ITIS = cell_spec(Genus_ITIS,
"html",
background = "darkviolet",
color = "white",
bold = TRUE,
italic = TRUE),
specificEpithet_ITIS = cell_spec(specificEpithet_ITIS,
"html",
background = "pink",
color = "white",
bold = TRUE,
italic = TRUE)) %>%
kable("html",
escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped",
"hover")) %>%
column_spec(column = 1,
italic = TRUE)
reptiles_2006 |
acceptedNameUsageID |
Genus_ITIS |
specificEpithet_ITIS |
taxonomicStatus_ITIS |
---|---|---|---|---|
Lamprophis capensis |
ITIS:1082810 |
Boaedon |
capensis |
synonym |
Typhlops bibronii |
ITIS:1116090 |
Afrotyphlops |
bibronii |
synonym |
Two of the seven species for which we have no genus or specific epithet
matches do have ITIS ID numbers but they have missing taxonomic
hierarchy information. While we could use their ID numbers to get
further information, here I will use the fuzzy_filter
function to
search for matching names at different hierarchies. For Agama aculeata
distanti, Agama atra atra, Bitis arietans arietans, we see that
taxonomic information is not missing when we search at the species level
but is missing for each subspecies level. For Agama atra, the ITIS
databse recognises no subspecies, so our trinomial designation finds no
ID match.
input <- species_2006 %>%
filter(is.na(Genus_ITIS) == TRUE) %>%
separate(reptiles_2006, c("Genus", "Specific_Epithet", "Subspecific_Epithet"),
sep = " ",
remove = FALSE) %>%
unite(c(Genus, Specific_Epithet),
col = "binomial",
sep = " ",
remove = FALSE)
fuzzy_filter(c(input$binomial), match = "contains") %>%
select(taxonID,
scientificName,
taxonRank,
taxonomicStatus,
class) %>%
mutate(class = if_else(class == "Reptilia",
cell_spec(class, "html",
background = "Dodgerblue",
color = "white",
bold = TRUE),
class),
taxonRank = if_else(taxonRank == "subspecies",
cell_spec(taxonRank, "html",
background = "green",
color = "white",
bold = TRUE),
taxonRank)) %>%
arrange(by_group = scientificName) %>%
kable("html",
escape = FALSE) %>%
kable_styling(bootstrap_options = c("striped",
"hover")) %>%
column_spec(column = 2,
italic = TRUE)
taxonID |
scientificName |
taxonRank |
taxonomicStatus |
class |
---|---|---|---|---|
ITIS:1055456 |
Agama aculeata |
species |
accepted |
Reptilia |
ITIS:1056979 |
Agama aculeata aculeata |
subspecies |
accepted |
NA |
ITIS:1056978 |
Agama aculeata distanti |
subspecies |
accepted |
NA |
ITIS:1055460 |
Agama atra |
species |
accepted |
Reptilia |
ITIS:634949 |
Bitis arietans |
species |
accepted |
Reptilia |
ITIS:635232 |
Bitis arietans arietans |
subspecies |
accepted |
NA |
ITIS:635233 |
Bitis arietans somalica |
subspecies |
accepted |
NA |
Using the input
object created above we could search for fuzzy matches
of the genus name or specific epithet to get a more detailed
understanding of the taxonomic situation in each case. It is interesting
to see the outputs, just change input$binomial
to input$Genus
etc.
You will see that matches from any class are returned, so you can
improve the returned table using a call such as filter(class == "Reptilia)
, but remember that you will also lose all “NA” columns this
way.
So much more than a name…
Taxonomic assignment within the ‘Tree of Life’ is a neverending process of hypothesis generation and revision. The technology for genomic sequencing and analysis has been available for more than 40 years, and yet phylogenetic revisions of reptiles and amphibians are being published annually. Processing these revisions places a heavy burden on database managers and they do an often thankless task with great dedication. The nett result is that each database varies from others in unpredictable ways. If this post achieves anything, I hope it gives you a deep appreciation for the fact that we are learning new facts about the interrelatedness of all organisms with every phylogenetic analysis conducted. Secondly, I want to highlight the incredible work being done by all taxonomic database managers in their efforts to curate the relevant taxonomic changes (read: taxonomic hypotheses) on an ongoing basis. Their work makes my biological research infinitely easier. A huge thank you to you all!
The last and very big “Thank you” goes to the developers of the taxadb
package - Carl Boettiger (Author, maintainer); Kari Norman (Author);
Jorrit Poelen (Author); Scott Chamberlain (Author); Noam Ross
(Contributor). I am always blown away by the R community and its
collaborative, opensource practices. The openSci project is a brilliant
example of this philosophy. Thank you so much!
Appendix
The files and code used in this blog post, can be found in this GitHub
repository. If you have
any comments, feedback or cool taxadb
tips - message me on Twitter or
via email through the links below.