pegax: Fast Taxonomic Name Parsing in R |
|
rOpenSci / UC Berkeley |
Nearly all biologists work with taxonomic names; taxonomic names are messy. Parsing taxonomic names into their components is a first step.
Global Names project (http://globalnames.org/) is a huge advance. However, Global Names is available via a web service (internet speed dependent) and in programming languages (Java/Go) that don’t play nice with R.
Many biologists work in the R language. We need really fast and dependency free R tooling to work with taxonomic names.
Learning from Global Names, pegax
(https://github.com/ropenscilabs/pegax) is a new R package that implements a Parsing Expression Grammar (PEG) for taxonomic names.
Parsing Expression Grammars, or PEGs, describe a formal language about how to recognize strings in a set of text. For example, capture any letter:
plus< alpha >
Combine to form a grammar, e.g., must match name
, one comma, a space, then match numbers
.
must< name, one< ',' >, space, numbers, eof >
Get it on GitHub: https://github.com/ropenscilabs/pegax
PEG work done in C++ using PEGTL (https://github.com/taocpp/PEGTL) via the piton (https://github.com/Ironholds/piton) R package.
# Author name
pgx_authority_names("Linnaeus, 1758")
#> [1] "Linnaeus"
# Author year
pgx_authority_years("Linnaeus, 1758")
#> [1] "1758"
# Parse ranks
pgx_ranks("Fagus sylvatica subsp. orientalis")
#> [1] "subsp."
# Many ranks at once
pgx_ranks(c("Helianthus annuus var. annuus",
"Helianthus annuus ssp. annuus",
"Caulerpa cupressoides forma nuda"))
#> [1] "var." "ssp." "forma."
# Scientific names w/o authorities
nms <- c("Fagus sylvatica subsp. orientalis",
"Potamogeton iilinoensis var. ventanicola",
"Callideriphus flavicollis morph. reductus",
"Chlorocyperus glaber form. fascicula",
"Sphaerotheca fuliginea f. dahliae")
dplyr::bind_rows(lapply(nms, pgx_sciname))
#> genus epithet rank infraspecific
#> Fagus sylvatica subsp. orientalis
#> Potamogeton iilinoensis var. ventanicola
#> Callideriphus flavicollis morph. reductus
#> Chlorocyperus glaber form. fascicula
#> Sphaerotheca fuliginea f. dahliae
library(microbenchmark)
library(charlatan)
per <- charlatan::PersonProvider$new()
date <- charlatan::DateTimeProvider$new()
# 10,000 name/year strings, e.g., Lueilwitz, 1945
x <- replicate(10^4, paste(
sample(per$person$last_names, 1),
date$year(),
sep = ", "))
# years
system.time(pgx_authority_years(x)) # 0.025 sec
# names
system.time(pgx_authority_names(x)) # 0.023 sec
contact: scott@ropensci.org
icons from FontAwesome v5.0.13 fontawesome.com