Scott Chamberlain

@recology_ & @ropensci


Science is often not reproducible or repeatable, even within the same lab group over time.

Increasing attention in the press...

...more in the press...

...and a recent initiative to address the problem...

What do we need?

Open Science

Open data + code

Source: Wolkovich et al. GCB 2012.

But, scientists don't want to share :(


But, this =>

Source: PLOS, 2007

R + Open Science

Open Science needs open source tools

Source: Revolution Analytics, 2010, Nature editorial, 2012

The old way...

The R way, e.g.,...

Make an API call

library(RCurl); library(RJSONIO)
dat <- fromJSON(getURL(""))

Manipulate the data

library(plyr); library(reshape2)
dat_df <- ldply(dat, function(x)[names(x) %in% c("name", "watchers_count", 
    "forks", "open_issues")]))
dat_melt <- melt(dat_df)

Run some statistical model

lm(value ~ variable, data = dat_melt)

Visualize results

ggplot(dat_melt, aes(name, value, colour = variable)) + geom_point() + coord_flip()

This is reproducible, repeatable and can serve as a analytic workflow.

Why R...

Wrapping all science APIs

Development team

Carl Boettiger

Karthik Ram

Scott Chamberlain

Edmund Hart

Advisory team

Temple Lang
Hadley Wickham

JJ Allaire

Matt Jones

rOpenSci packages

Data packages

Literature packages

Hybrid packages

rOpenSci packages

rOpenSci packages

Some stats on rOpenSci packages

Data via GitHub API, run in R function via

Data sources (click to go)

Literature sources

Coming when APIs ready/we get around to it...


Umbrella packages

Abstract away the detail of each package to search across data sources on:

Getting data from the web into R

Authentication: R and APIs - API keys

API keys can be stored in a users.Rprofile file

Click here for help on .Rprofile files

options(MendeleyKey = "uf5daib7wyil7ag5buc")
options(MendeleyPrivateKey = "faj2os5dyd7jop2fok6")
options(PlosApiKey = "ef3vip9yak7od3hud4g")
options(SpringerMetdataKey = "ri9hi7woc6jax4vaf8w")

Note: These keys aren't real.

Wrapping APIs in R

in the browser...

Wrapping APIs in R

same URL in RCurl::getURL...

# Call the github json data within R using the RCurl pkg
getURL("")  # or getForm()/postForm()
[1] "{\"has_downloads\":true,\"full_name\":\"hadley/ggplot2\",\"owner\":{\"gravatar_id\":\"7ba164f40a50bc23dbb2aa825fb7bc16\",\"login\":\"hadley\",\"avatar_url\":\"\",\"url\":\"\",\"id\":4196},\"forks_count\":53,\"homepage\":\"\",\"svn_url\":\"\",\"mirror_url\":null,\"git_url\":\"git://\",\"pushed_at\":\"2012-08-17T20:49:44Z\",\"network_count\":53,\"forks\":53,\"has_wiki\":true,\"language\":\"R\",\"created_at\":\"2008-05-25T01:21:32Z\",\"watchers\":392,\"watchers_count\":392,\"description\":\"An implementation of the Grammar of Graphics in R\",\"html_url\":\"\",\"clone_url\":\"\",\"open_issues\":106,\"open_issues_count\":106,\"has_issues\":true,\"size\":1722,\"fork\":false,\"updated_at\":\"2012-08-21T16:29:13Z\",\"ssh_url\":\"\",\"name\":\"ggplot2\",\"url\":\"\",\"private\":false,\"id\":19438,\"master_branch\":\"master\"}"

Wrapping APIs in R

same URL in RCurl::getURL...parsed

# And parse the results to more R friendly list

[1] TRUE

[1] "hadley/ggplot2"

Wrapping APIs in R

same URL in httr::GET...

# Or use httr package by Hadley Wickham
tt <- GET("")
content(tt)  # content auto-detects data type, and parses
[1] TRUE

[1] "hadley/ggplot2"

Wrapping APIs in R using httr

httr from Hadley Wickham

occurrencecount <- function(scientificname = NULL, coordinatestatus = NULL, 
	url = "", 
    	curl = getCurlHandle()) 

# The compact fxn is a great way to gather parameters - removes all NULL
querystr <- compact(list( scientificname = scientificname, coordinatestatus = coordinatestatus ))
temp <- GET(url, query = querystr) out <- content(temp)$doc$children$gbifResponse as.numeric(xmlGetAttr(getNodeSet(out, "//gbif:summary")[[1]], "totalMatched")) }

Wrapping APIs in R using httr

The output from the API that needs to be parsed

	[1] ""

	[1] "1.0"
	  <gbif:parameter name="request" value="count"/>
	  <gbif:parameter name="service" value="occurrence"/>
	  <gbif:parameter name="scientificname" value="Abies concolor"/>
	  <gbif:parameter name="coordinatestatus" value="true"/>
	  <gbif:summary totalMatched="597"/>

Wrapping APIs in R using httr

Run the occurrencecount function, search for white fir (Abies concolor)

occurrencecount(scientificname = "Abies concolor", coordinatestatus = TRUE)
[1] 597

Examples from some of our packages

Public Library of Science full text - rplos

Public Library of Science

plot_throughtime(list("reproducible science"), 500)

Public Library of Science uses rplos!

Public Library of Science uses rplos!

Managing bibliography - RMendeley


Manage libraries and measure impact of research

groupDocInfo(mc, 530031, 4344945792)
[1] "SUMMARY: Modern biological experiments create vast amounts of data which are geographically distributed. These datasets consist of petabytes of raw data and billions of documents. Yet to the best of our knowledge, a search engine technology that searches and cross-links all different data types in life sciences does not exist.....

      forename        surname
   "Dominic S" 	"L\xfctjohann" 
# ....

Accessing data behind papers - rdryad


# Get URL for a specific dataset
dryaddat <- download_url("10255/dryad.1759")

# Download the file from the Dryad servers
file <- dryad_getfile(dryaddat)

# Just first four columns
head(file[, 1:4])
  year nest.identity season clutch.size
1 2001             1      0           6
2 2001             1      0           6
3 2001             1      0           6
4 2001             1      0           6
5 2001             1      0           6
6 2001             1      0           6

Tracking altmetrics - raltmet

Tracks altmetrics across various sources such as GitHub, Total impact, CitedIn, CiteULike, Stackoverflow.

GitHub(userorg = "ropensci", repo = "rmendeley")
totimp(id = "10.5061/dryad.8671")
stackexchange(ids = 16632)

Definitely check out and here to learn more about #altmetrics

Mapping biodiversity data - rgbif

Global Biodiversity Information Facility

distribution <- occurrencelist(sciname = "Danaus plexippus", coordinatestatus = TRUE, maxresults = 1000, latlongdf = TRUE)

Sharing unpublished data - (rfigshare)

Using Figshare's new API, it will soon be possible to share figures, data, and any other object generated in R directly to one's figshare account.

> figshare(data)
# code isn't ready yet but once it is, it will return a persistent identifier

A multi-institution consortium to build infrastructure for open science


DataONE creates all the necessary components to support persistent and secure access to earth observation data.

DataONE's upcoming R package will allow users to submit and access data to/from member nodes directly from the console.

Git + Science

Rapid peer-peer sharing of code is great for science

R packages early in development can easily be tested, rapidly deployed from GitHub using devtools and revised before submitting to a persistent repository such as CRAN.

install_github("RMendeley", "ropensci")

R + collaborative writing

knitr + Markdown

Xie Y (2012). knitr: A general-purpose package for dynamic report generation in R.

knitr + Markdown + GitHub

GitHub automatically renders Markdown and even provides syntax highlighting

knitr + Markdown + GitHub = executible paper

knitr + Markdown + GitHub = pre publication review

Incorporate citations with R + Markdown


citet(c(Halpern2006 = "10.1111/j.1461-0248.2005.00827.x"))
# then cite in your markdown file

# or read citations from a bibtex file which can be automatically generated and updated from services like Mendeley
bib <- read.bibtex("example.bib") # then cite inline citet(bib[["knitr"]])

- knitcitations by Carl Boettiger @ GitHub
- tutorial

Please contact us if you have feedback or ideas for collaborations.

All ropensci code is on GitHub.

We're also on Twitter (@ropensci) and G+.

rOpenSci tutorials

rOpenSci tutorials

If you like, play with examples at: