taxadb

What’s in a name?

!!Caution!! The information about taxadb in this post appears to be out of date and needs updating.

Setting the scene

During a recent project using the Global Invasive Species Database (GISD), I encountered several issues that are common when working with taxonomic databases. Having searched the GISD and imported the data into R, I noticed that some species had missing values (“NA”) for one or more variables. The missing values for the < 10 species were not difficult to determine manually, but what if I had been processing a species list of several hundred species? How would I approach the problem of taxonomic verification in a time-saving, reproducible manner?

The second issue was that species in my list of invasive herpetofauna had undergone taxonomic reassignment, either at species, genus or even family level. How should a biologist deal with this issue when communicating the results of an analysis or data visualisation such as mine? These issues are the topic of this post.

Reality check

In an ideal world, the GISD (or whichever database you prefer) is updated as soon as taxonomic changes are accepted, and the search results can be treated as authoritative. Realistically, for any database with vast number of taxa to consider it is unreasonable to expect the taxonomy of all species to be up-to-date and error-free. As I discovered in my invasive herpetofauna investigation, for some species the GISD was simultaneously ahead of and behind the other databases regarding taxonomic assignments and classifications.

For example, for Norops grahami the GISD appears to accept the generic reassignment from Anolis, while the Reptile Database notes the change on it’s N. grahami page, but has not reassigned it to Norops in their database yet. For the same species though, the Reptile Database has the family assignment specified as Dactyloidae whereas the GISD still assigns it to the Polychrotidae. So for the genus, the GISD is ahead, but behind for the family.

Deciding what to do in situations like this is always tricky, but a simple solution is to be able to present these uncertainties to the readers of your research/communication. Thankfully there is an R package that can be used to query the world’s major taxonomic databases from the comfort of your command line.

Introducing {taxadb}

We load the taxadb package and the tidyverse for managing the post-query manipulations of the dataframes. I used the kableExtra package to format the presentation of my tabulated taxonomic data.

library(taxadb)
library(tidyverse)
library(kableExtra)

{taxadb}

The taxadb package installs a taxonomic database of your choice on your workstation. This local database is installed from taxadb, which is periodically updated from the relevant online database APIs. For a thorough understanding of the data sources used by taxadb, I encourage you to read the documentation found at the rOpenSci taxadb page. The TL:DR is that you should not simply merge information from two different taxonomic data sources, as explained in this paragraph:

Please Note: taxadb advises against uncritically combining data from multiple providers. The same name is frequently used by different providers to mean different things – some providers consider two names synonyms that other providers consider distinct species. It is crucial to recognize that taxonomic name providers represent independent taxonomic theories, and not merely additional observations of the same immutable reality (Franz & Sterner (2018)). You cannot just merge two databases of taxonomic names like you can two databases of, say, plant traits to get a bigger and more complete sample, because the former can contain meaningful contradictions.”

The data sources used by taxadb are updated on an annual basis, and this can be checked using the available_versions function. The first element "2019" tells us the last time the data sources were updated by taxadb. The "dwc" indicates that all data sources are formatted according to the Darwin Core standard.

available_versions()
## [1] "2019" "dwc"

A quick comparison of herpetofaunal species in GISD and ITIS

In a related project, I queried the GISD database to ascertain the names of all herpetofaunal species that have established non-native or invasive populations to date. Let’s import the file downloaded from the GISD and compare it to the herpetofauna listed in the ITIS database. First we import the .csv datafile, do some data preparation.

Second we create a local ITIS database using td_create. Third we extract all amphibian and reptile species from the ITIS database. Lastly we join the information in the two tibbles using a left_join where we tell the function that Species in GISD is the same variable as scientificName in ITIS. After joining the two tibbles, I decided to select just the variables that I am interested in.

GISD_herp <- read_delim("amrep_gisd_Feb2020.csv", 
                         trim_ws = TRUE, 
                         delim = ";") %>%
              select(-X8) %>% 
              separate(Species, 
                       c("Genus", 
                         "Specific_Epithet", 
                         "Infraspecific_Epithet"), 
                       sep = " ", 
                       remove = FALSE) 

td_create("itis")

database <- filter_rank(c("Amphibia", "Reptilia"), "class")

db_check <- GISD_herp %>% 
                left_join(database, by = c("Species" = "scientificName")) %>%
                select(species_GISD = Species,
                       vernacularName_ITIS = vernacularName,
                       order_GISD = Order, 
                       order_ITIS = order,
                       family_GISD = Family, 
                       family_ITIS = family,
                       taxonomicStatus_ITIS = taxonomicStatus,
                       acceptedNameUsageID) 
                
db_check %>% select(-vernacularName_ITIS) %>%
             slice(c(1,5,6,16,20,23,28,30,31,43)) %>% 
             kable() %>% 
             kable_styling(bootstrap_options = c("striped", 
                                                 "hover")) %>% 
             column_spec(column = 1, 
                         italic = TRUE) %>% 
             row_spec(row = c(5,9), 
                      background = "Dodgerblue", 
                      color = "white")

species_GISD

order_GISD

order_ITIS

family_GISD

family_ITIS

taxonomicStatus_ITIS

acceptedNameUsageID

Anolis aeneus

Squamata

Squamata

Polychrotidae

Dactyloidae

accepted

ITIS:1056079

Anolis equestris

NA

Squamata

Polychrotidae

Dactyloidae

accepted

ITIS:173891

Anolis extremus

NA

Squamata

Polychrotidae

Dactyloidae

accepted

ITIS:1056181

Boiga irregularis

Squamata

Squamata

Colubridae

Colubridae

accepted

ITIS:174206

Elaphe guttata

Squamata

Squamata

Colubridae

Colubridae

synonym

ITIS:1081818

Eleutherodactylus planirostris

Anura

Anura

Leptodactylidae

Eleutherodactylidae

accepted

ITIS:173568

Lithobates catesbeianus

Anura

Anura

Ranidae

Ranidae

accepted

ITIS:775084

Natrix maura

Squamata

Squamata

Colubridae

Colubridae

accepted

ITIS:700797

Norops grahami

Squamata

NA

Polychrotidae

NA

NA

NA

Xenopus laevis

Anura

Anura

Pipidae

Pipidae

accepted

ITIS:173549

[Aside: Unfortunately, taxonomic data and their dataframes don’t make for very pretty data visualisations. I apologise for the ‘wall of text’ feeling in this post, but I hope that you find this worked example of taxadb worth the eye-strain.]

The table above shows just 10 of the 43 species but gives you a feel for the information I have extracted from ITIS. The two rows highlighted blue indicate examples of species that need additional consideration. Elaphe guttata is listed as a synonym, so we need to find the new, accepted name for the species, and Norops grahami is the species I mentioned earlier and appears to be missing from the ITIS database.

We are interested in the species in the GISD that have no match in the ITIS database. Species that have no match will have missing values for all the *_ITIS variables, so I chose taxonomicStatus_ITIS. The output shows us that there are four species in the GISD that have no match in the ITIS database.

db_check %>%
  filter(is.na(taxonomicStatus_ITIS)) %>% 
  select(-vernacularName_ITIS, 
         -acceptedNameUsageID) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", 
                                      "hover")) %>% 
  column_spec(column = 1, 
              italic = TRUE, 
              width = "6cm")

species_GISD

order_GISD

order_ITIS

family_GISD

family_ITIS

taxonomicStatus_ITIS

Anolis wattsi

NA

NA

Polychrotidae

NA

NA

Boa constrictor imperator

Squamata

NA

Boidae

NA

NA

Norops grahami

Squamata

NA

Polychrotidae

NA

NA

Trachemys scripta elegans

Testudines

NA

Emydidae

NA

NA

I am also interested in names that are synonyms and not currently the accepted scientific name for the species. Any species identified by a synonym in the GISD merits further investigation to determine if the most recent taxonomic assignment gave due consideration to the status of the alien/invasive population.

Using the code below, I compare the name from our GISD list with the accepted name in ITIS database. The code allows us to determine the level at which the reassignment occurred. A reassignment at genus level might mean that less attention is needed as compared to a change in the specific epithet. The work doesn’t end when you identify differences using this comparison but it does get a useful pointer.

db_check %>%
  filter(taxonomicStatus_ITIS == "synonym") %>% 
  select(-vernacularName_ITIS) %>% 
  left_join(database, by = "acceptedNameUsageID") %>% 
  filter(taxonomicStatus == "accepted") %>% 
  select(species_GISD,
         acceptedName_ITIS = scientificName,
         acceptedNameUsageID) %>% 
  mutate(acceptedNameUsageID = cell_spec(acceptedNameUsageID,
                                         "html",
                                         background = "Lightgreen",
                                         color = "white",
                                         bold = TRUE)) %>% 
  kable("html",
        escape = FALSE) %>% 
  kable_styling(bootstrap_options = c("striped", 
                                      "hover")) %>% 
  column_spec(column = c(1,2), 
              italic = TRUE)

species_GISD

acceptedName_ITIS

acceptedNameUsageID

Chamaeleo jacksonii

Trioceros jacksonii

ITIS:1055685

Elaphe guttata

Pantherophis guttatus

ITIS:1081818

Litoria aurea

Ranoidea aurea

ITIS:1099285

Norops sagrei

Anolis sagrei

ITIS:173903

Ramphotyphlops braminus

Indotyphlops braminus

ITIS:1116297

The acceptedNameUsageID number is a way for us to backreference the spcies to the ITIS database to extract all the synonyms for a single species. Below I demonstrate the process for Elaphe guttata (Eastern Corn Snake), which shows us that the ITIS database recognises Pantherophis guttatus as the accepted name and the three other classifications as synonyms. You can also see that the acceptedNameUsageID is the same for all names of this species while the taxonID is different for each.

database %>%   
  filter(acceptedNameUsageID == "ITIS:1081818") %>% 
  select(scientificName,
         taxonRank,
         taxonomicStatus,
         acceptedNameUsageID,
         taxonID) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped",
                                      "hover")) %>% 
  column_spec(column = 1, 
              italic = TRUE, 
              width = "5cm")

scientificName

taxonRank

taxonomicStatus

acceptedNameUsageID

taxonID

Elaphe guttata

species

synonym

ITIS:1081818

ITIS:174175

Elaphe guttata guttata

subspecies

synonym

ITIS:1081818

ITIS:174176

Coluber guttatus

species

synonym

ITIS:1081818

ITIS:209204

Pantherophis guttatus

species

accepted

ITIS:1081818

ITIS:1081818

Out with the old…

The last demo I want to do with taxadb is compare a list of species names from a reptile survey in 2006 with the current taxonomy of the species. I know for a fact that several names have changed over the past 14 years, so let’s see how easy it is to determine the new accepted species names.

data_2006 <- read_lines("rep_survey_2006.txt")

species_2006 <- filter_name(data_2006) %>% 
                select(reptiles_2006 = input,
                       acceptedNameUsageID,
                       Genus_ITIS = genus,
                       specificEpithet_ITIS = specificEpithet,
                       taxonomicStatus_ITIS = taxonomicStatus,
                       vernacularName_ITIS = vernacularName)

The output of our query provides food for thought. Five of the 20 species find no matches in the ITIS database. Eleven names are still the accepted name for the species concerned and two names are synonyms. We could use our code from above to get the accepted name, but we see that the filter_name function returns a genus and specificEpithet variable from the ITIS database. In the case of synonyms, these two variables hold the accepted name for the species.

species_2006 %>% 
  filter(taxonomicStatus_ITIS == "synonym") %>% 
  select(-vernacularName_ITIS) %>% 
  mutate(Genus_ITIS = cell_spec(Genus_ITIS,
                                "html", 
                                background = "darkviolet", 
                                color = "white", 
                                bold = TRUE,
                                italic = TRUE),
         specificEpithet_ITIS = cell_spec(specificEpithet_ITIS,
                                          "html", 
                                          background = "pink", 
                                          color = "white", 
                                          bold = TRUE,
                                          italic = TRUE)) %>%
  kable("html", 
        escape = FALSE) %>% 
  kable_styling(bootstrap_options = c("striped", 
                                      "hover")) %>% 
  column_spec(column = 1, 
              italic = TRUE)

reptiles_2006

acceptedNameUsageID

Genus_ITIS

specificEpithet_ITIS

taxonomicStatus_ITIS

Lamprophis capensis

ITIS:1082810

Boaedon

capensis

synonym

Typhlops bibronii

ITIS:1116090

Afrotyphlops

bibronii

synonym

Two of the seven species for which we have no genus or specific epithet matches do have ITIS ID numbers but they have missing taxonomic hierarchy information. While we could use their ID numbers to get further information, here I will use the fuzzy_filter function to search for matching names at different hierarchies. For Agama aculeata distanti, Agama atra atra, Bitis arietans arietans, we see that taxonomic information is not missing when we search at the species level but is missing for each subspecies level. For Agama atra, the ITIS databse recognises no subspecies, so our trinomial designation finds no ID match.

input <- species_2006 %>%  
          filter(is.na(Genus_ITIS) == TRUE) %>% 
          separate(reptiles_2006, c("Genus", "Specific_Epithet", "Subspecific_Epithet"),
                               sep = " ", 
                               remove = FALSE) %>% 
          unite(c(Genus, Specific_Epithet), 
                col = "binomial",
                sep = " ",
                remove = FALSE)

fuzzy_filter(c(input$binomial), match = "contains") %>% 
  select(taxonID,
         scientificName,
         taxonRank,
         taxonomicStatus,
         class) %>%
  mutate(class = if_else(class == "Reptilia",
                          cell_spec(class, "html", 
                                    background = "Dodgerblue", 
                                    color = "white", 
                                    bold = TRUE),
                          class),
          taxonRank = if_else(taxonRank == "subspecies",
                                cell_spec(taxonRank, "html", 
                                          background = "green", 
                                          color = "white", 
                                          bold = TRUE),
                                taxonRank)) %>% 
  arrange(by_group = scientificName) %>% 
  kable("html", 
        escape = FALSE) %>% 
  kable_styling(bootstrap_options = c("striped", 
                                      "hover")) %>% 
  column_spec(column = 2, 
              italic = TRUE)

taxonID

scientificName

taxonRank

taxonomicStatus

class

ITIS:1055456

Agama aculeata

species

accepted

Reptilia

ITIS:1056979

Agama aculeata aculeata

subspecies

accepted

NA

ITIS:1056978

Agama aculeata distanti

subspecies

accepted

NA

ITIS:1055460

Agama atra

species

accepted

Reptilia

ITIS:634949

Bitis arietans

species

accepted

Reptilia

ITIS:635232

Bitis arietans arietans

subspecies

accepted

NA

ITIS:635233

Bitis arietans somalica

subspecies

accepted

NA

Using the input object created above we could search for fuzzy matches of the genus name or specific epithet to get a more detailed understanding of the taxonomic situation in each case. It is interesting to see the outputs, just change input$binomial to input$Genus etc. You will see that matches from any class are returned, so you can improve the returned table using a call such as filter(class == "Reptilia), but remember that you will also lose all “NA” columns this way.

So much more than a name…

Taxonomic assignment within the ‘Tree of Life’ is a neverending process of hypothesis generation and revision. The technology for genomic sequencing and analysis has been available for more than 40 years, and yet phylogenetic revisions of reptiles and amphibians are being published annually. Processing these revisions places a heavy burden on database managers and they do an often thankless task with great dedication. The nett result is that each database varies from others in unpredictable ways. If this post achieves anything, I hope it gives you a deep appreciation for the fact that we are learning new facts about the interrelatedness of all organisms with every phylogenetic analysis conducted. Secondly, I want to highlight the incredible work being done by all taxonomic database managers in their efforts to curate the relevant taxonomic changes (read: taxonomic hypotheses) on an ongoing basis. Their work makes my biological research infinitely easier. A huge thank you to you all!

The last and very big “Thank you” goes to the developers of the taxadb package - Carl Boettiger (Author, maintainer); Kari Norman (Author); Jorrit Poelen (Author); Scott Chamberlain (Author); Noam Ross (Contributor). I am always blown away by the R community and its collaborative, opensource practices. The openSci project is a brilliant example of this philosophy. Thank you so much!

Appendix

The files and code used in this blog post, can be found in this GitHub repository. If you have any comments, feedback or cool taxadb tips - message me on Twitter or via email through the links below.

Gavin Masterson, PhD
Gavin Masterson, PhD

My interests include snakes, lizards, guitar, swimming and exploring either the outdoors or data.

Related