Biodiversity & Taxonomy Software Tools @ rOpenSci

Scott Chamberlain ( @sckottie)

UC Berkeley / rOpenSci

Broad areas of packages

  • Taxonomy

  • Occurrence data

  • Environmental data


Questions addressed w/ our software

Use cases: taxize

  • classify species invasive or not
  • software uses taxize to check user names against ITIS
  • check names against TPL, EOL, COL, IUCN, uBIO
  • get name data for NCBI sequence data
  • validate genus names for a food web
  • compiled dataset of tropical forest tree species names checked w/ TNRS
  • add taxonomic classification data to meta-analysis dataset

Use cases: rgbif

  • occurrence records to construct niche models
  • collect occurrence records for catfishes in a Brazilian river
  • occurrence records of Acacia species in Australia through time
  • assessing niche expansion of invasive plants with occurrence records
  • small note in manuscript about a species being in a study area

Use cases: rfishbase

  • collect fish life history traits
  • extract fecundity data for four fish species
  • group species into trophic guilds using trophic position
  • fetch salinity associated traits for many fish species
  • acquire depth ranges for many species to determine a phylogenetic signal

Use cases: rentrez

  • search PubMed for mentions of phrases
  • demonstrate rentrez use to search NCBI for articles in institutional repositories
  • fetch NCBI taxonomic information for sequence data
  • use NCBI's Gene Expression Omnibus service
  • extract citations (presumably from PubMed) using rentrez

Use cases: spocc

  • use GBIF data to explore genome size variation against many variables
  • use GBIF data to construct species range and niche centroids
  • use GBIF, VertNet, BISON, Ecoengine, iNaturalist data to construct species niche models
  • use Vertnet and iNaturalist data to identify most vulernable populations for snakebites
  • use GBIF data via zoon in malariaAtlas R pkg
  • use GBIF and iDigBio data to construct future species ranges

Use cases: rnoaa

  • fetch sea surface temperature (SST): check if latitude/SST explains variation in body size
  • use many variables from ISD to predict airplane flight time
  • use climate data to identify opportunities for stream restoration
  • estimate interannual climatic variability in urban areas
  • government report on precipitation

Follow @rocitations for use cases

Citations spreadsheet ropenscilabs/ropensci_citations => citations.tsv for citations

Use cases spreadsheet ropenscilabs/ropensci_citations => use_cases.tsv for use cases


  • Taxonomic data: e.g., Poa annua ssp. annua

  • There's lots of taxonomic data sources ...

  • Data sources have diff. data ...

  • & diff. IDs for the same names ...

  • Taxonomy data often not the goal ...

  • But rather a necessary piece of research ...


  • taxa - Taxonomic classes and taxonomically aware data manipulation

  • taxize - taxonomy data from many data sources

  • taxizedb - taxize, but with local SQL databases

  • rentrez - NCBI data, including taxonomy data

  • worrms - WORMS marine data, including taxonomy

  • ritis - USGS's ITIS taxonomy

  • ... many others

R taxonomy task view:

Occurrence data

  • Occurrence data: e.g.:

  • key                    scientificName   decimalLatitude decimalLongitude
    1986611609   Encelia californica Nutt.        32.66866        -117.0856
  • Many data sources ...

  • Data from collected/observed specimens ...

  • Most feed (all or part of their data) into GBIF ...

Occurrence data

  • rgbif - occurrence data from GBIF's > 1 billion records

  • rbison - occurrence data from USGS's BISON

  • rvertnet - occurrence data from VertNet (vertebrates)

  • rebird & auk - occurrence data from eBird

  • spocc - occurrence data from many sources, single interface

  • finch - parse GBIF bulk data (Darwin Core)

  • CoordinateCleaner & scrubr - clean occurrence data

Environmental data

  • Environmental data: e.g.:

  • station  	    variable    value
    12334        precipitation_mm       34
  • Tons of data sources ...

  • Hierarchy of data: raw to various levels of cleaned/etc. ...

  • Types: precipation, temperature, wind speed, humidity, snow fall, etc.

Environmental data

  • rnoaa - download climate data (prec, temp, wind, storms, snow, etc.) from NOAA

  • rerddap - NOAA ERDDAP servers

  • riem - Iowa Environment Mesonet data

  • weathercan - Environment & Climate Change Canada

  • bomrang - Australian gov't Bureau of Meteorology

  • GSODR - Global Surface Summary of the Day

  • nasapower - NASA POWER data

  • hydroscoper - Greek National Data

future work /
hard problems

taxonomy tools: in the works

  • taxonomic name parsing: we need fast & platform independent name parsing gnparser/pegax

  • taxizedb - hard to a) make similar interface to SQL DB's as web services & 2) simplify varied database installs

    • see also: taxadb & Kari Norman's part
  • taxview - summarise and visualize data sets WRT taxonomy

occurrence tools: in the works

  • taking the pain out of GBIF downloads: ropensci/rgbif#266: queuing tool for GBIF downloads - would love any feedback

  • auk integration into spocc

  • occurrence cleaning in R: hard problem! A few efforts:

where to find out more

talk to us

discussion forum: