Open Science / Research w/ R featuring rOpenSci

Scott Chamberlain (@sckottie/@ropensci)

UC Berkeley / rOpenSci

hemlsley foundation


Keyboard shortcuts: press ?

open science/research

Open science as a lego set

Open science as a lego set

open science may be hard to do

but - you can work on different components

and - individual components are worth learning

Open Data

(at least within your organization)

funders/journals often requiring this anyway

future self will thank you

Versioning: code/data/text

Versioning: code/data/text

failure proofs your work

experiment freely!

makes collaboration easier

Do all work programatically


Do all work programatically

Key to reproduciblity

Most important person that wants to reproduce your work is you!

Do all work programatically

you and yourself

- one week from now

- two months from now

- & so on

An example to shoot for

BAAD blog post 

important (higher level) scientific programming languages

R language

  • used widely in biology, psychology, medicine, etc.

  • rapidly growing user base, companies surrounding it

  • includes all tools for open science workflow

  • though work to be done ...

Open science ecosytsem



rOpenSci Does


rOpenSci Staff

  • ~5 full time

  • leadership team

  • advisory board

Community stats

  • ~ 400 code contributors

  • ~ 490 Github repositories (most are R packages)

  • ~ 45,000 commits

  • ~ 160 published R packages on CRAN (another ~100 not on CRAN)

rOpenSci Unconference

Nominations (including self) close Mar. 8th

the research workflow

Data acquisition    

data manipulation/analysis/viz    



the research workflow

Data acquisition    

data manipulation/analysis/viz    



the research workflow

Data acquisition    

data manipulation/analysis/viz    



the research workflow

Data acquisition    

data manipulation/analysis/viz    



the research workflow

Data acquisition    

data manipulation/analysis/viz    



rOpenSci Tools

rOpenSci Software: some of the benefits

  • reduce redundant small software efforts

  • funnel effort into sustainable, well-maintained software (see lack of support for software MAINTANENCE in academia)

  • bring maintainers into a community

  • give otherwise isolated projects a louder voice

  • hopefully we make each piece of software more sustainable

but, software sustainability is hard

each panel is a package, each dot a person

rOpenSci software used in

within companies
fun side projects
and more

here are some of the academic research uses

... usually found in methods section of papers

use case 1

Claypool, K., & Patel, C. J. (2018). A transcript-wide association study in physical activity intervention implicates molecular pathways in chronic disease. BioRxiv 

We used the rentrez R package to execute the query on GEO [Gene Expression Omnibus] ...

use case 2

Emer, C., et al. (2018). Seed-dispersal interactions in fragmented landscapes - a metanetwork approach. Ecology Letters 

We compiled 16 studies of BSD [bird seed dispersal]-interactions in fragments of the SE Brazilian Atlantic Forest ... We updated species names with taxize package (Chamberlain & Szocs 2013).

use case 3

Harsch, M. A., & HilleRisLambers, J. (2016). Climate Warming and Seasonal Precipitation Change Interact to Limit Species Distribution Shifts across Western North America. PLOS ONE. 

To fill in missing elevation records and correct elevation records ... we estimated altitude ... using the GNsrtm3 function within the geonames package ...

rOpenSci *omics Tools


  • taxa - Taxonomic classes and taxonomically aware data manipulation

  • taxize - Taxonomic "toolbelt" - work w/ taxonomy web APIs

  • taxizedb - taxize, but with local SQL databases

  • rentrez - NCBI's Entrez services

  • biomartr - Biomart R client

  • genbankr - Parse GenBank files into useful objects

  • rsnps - SNPs data retrieval

(although most omics R packages are in Bioconductor,
rOpenSci is open to submissions!)

Taxonomic IDs

always try to move from:

  • taxonomic name -- to

  • taxonomic ID -- to

  • whatever other data

Genomic Data Retrieval - biomartr Interfaces to:

Spatial tools


Geospatial: conversion between data/spatial data formats - geojsonio

  • geojson_list - convert to GeoJSON as R list

  • geojson_json - convert to GeoJSON as JSON

  • geojson_read/geojson_write - read/write GeoJSON

from most R object types + many spatial data formats

geojson workflow

we're trying for a GeoJSON workflow in R, w/o heavy dependencies like GDAL/GEOS - get in touch if you have any interest

Climate data tools

Climate data

NOAA climate data - rnoaa


  • Severe weather data

  • Sea ice data

  • NOAA buoy data

  • Tornadoes

  • HOMR - Historical Observing Metadata Repository

  • Storm data

  • GHCND FTP data

  • Global Ensemble Forecast System (GEFS) data

  • Extended Reconstructed Sea Surface Temperature (ERSST) data

  • Argo buoys data

  • NOAA CO-OPS - tides and currents data

  • NOAA Climate Prediction Center (CPC)

  • Africa Rainfall Climatology version 2

  • Wrapping web APIs

    Wrapping web APIs:
    High level concepts

    • Each pkg is a snowflake: every web API is different

    • Try to cater to both beginners and power users

    • Fail fast and fail well: APIs may not do it for you

    • Pass on curl options! empower your users to:

      • investigate http request problems

      • set proxy options (IT often blocks certain sites/ports)

      • and more

    Defensive programming

    • Fail fast

    • Defend against many things

    • Give users good errors

    Check out my defensive programming chapter

    Example pkg wrapping web API

    ritis: client for ITIS taxonomic data

    ritis: notes/thoughts

    • imports: solrium, crul, jsonlite, data.table, tibble

    • package API: fxns for REST API and Solr API

    • a downside of this package possibly: a lot of functions

    • return tibbles from all functions

    • but raw JSON/XML output for those that want it

    • Solr queries handled by solrium package

    Combining many sources into one package

    Many into one considerations

    • Is it really a good idea?

    • Inputs:

      • What parameters can be unified across sources?

      • Allow users to fiddle with sources specific options

      • Fail consistently across sources if possible

    • Outputs:

      • What if any outputs can be combined

    Many into one e.g.: spocc

    Many into one e.g.: spocc

    • All 10 sources share common input: taxonomic names

    • Pagination is similar-ish across sources (requires some source specific variable mapping)

    • Geospatial search: WKT and bounding boxes then map to what source requires

    • Most can toggle whether to return records that have coordinates or not

    • Outputs: combine the minimum set of similar fields

    Software Review

    rOpenSci Software Review

    • R package maintainer submits to ropensci/onboarding

    • Editors determine fit or not a fit

    • Editors assign reviewers

    • Reviewers have ~ 3 weeks

    • Reviewers and maintainer go back and forth refining pkg

    • After approval, pkg moved to rOpenSci

    • A number of e.g.'s of pkgs from government agencies (including Canada)

    rOpenSci Software Review

    • Completely open source tools

    • Free to run

    • All reviews/conversations in the open

    • Reviews are/can be linked to code changes

    • Paired with journal submission: JOSS and MEE

    rOpenSci Onboarding

    not sure?

    pre-submission inquiry!

    checkout prior presub inquiries

    Bioconductor Does Open Review too!

    talk to us

    what would you like to see?

    what open data is too hard to get?

    discussion forum:

    submit a package/review a package:

    Made w/: reveal.js v3.2.0

    Some Styling: Bootstrap v3.3.5

    Icons by: FontAwesome v4.4.0