Header row

pegax: Fast Taxonomic Name Parsing in R

Scott Chamberlain
rOpenSci / UC Berkeley


Body row

Nearly all biologists work with taxonomic names; taxonomic names are messy. Parsing taxonomic names into their components is a first step.

Global Names project (http://globalnames.org/) is a huge advance. However, Global Names is available via a web service (internet speed dependent) and in programming languages (Java/Go) that don’t play nice with R.

Many biologists work in the R language. We need really fast and dependency free R tooling to work with taxonomic names.

Learning from Global Names, pegax (https://github.com/ropenscilabs/pegax) is a new R package that implements a Parsing Expression Grammar (PEG) for taxonomic names.


Parsing Expression Grammars

Parsing Expression Grammars, or PEGs, describe a formal language about how to recognize strings in a set of text. For example, capture any letter:

plus< alpha >

Combine to form a grammar, e.g., must match name, one comma, a space, then match numbers.

must< name, one< ',' >, space, numbers, eof >


pegax package

Get it on GitHub: https://github.com/ropenscilabs/pegax

PEG work done in C++ using PEGTL (https://github.com/taocpp/PEGTL) via the piton (https://github.com/Ironholds/piton) R package.

future work

  • Unicode support
  • Full name parsing (names, ranks, annotations, authorities)
  • R DSL to make custom PEGS for taxonomy


Use cases

  • Researchers/etc. use package directly to parse their own set of names
  • R developers import to leverage name parsing for domain specific use case
  • If/when DSL for custom name parsing: cover all corner cases


contact: scott@ropensci.org

icons from FontAwesome v5.0.13 fontawesome.com