Scott Chamberlain

@recology_ & @ropensci

Shortcuts:

M =
G =

http://bit.ly/rvantalk

Science is often not reproducible or repeatable, even within the same lab group over time.

Increasing attention in the press...

...more in the press...

...and a recent initiative to address the problem...

What do we need?

Open Science

Open data + code

Source: .

But, scientists don't want to share :(

Source: http://bit.ly/Tprrx8

But, this =>

Source: PLOS, 2007

R `+` Open Science

Open Science needs open source tools

Source:

The old way...

The R way, e.g.,...

Make an API call

library(RCurl); library(RJSONIO)
dat <- fromJSON(getURL("https://api.github.com/users/hadley/repos"))

Manipulate the data

library(plyr); library(reshape2)
dat_df <- ldply(dat, function(x) as.data.frame(x[names(x) %in% c("name", "watchers_count", 
    "forks", "open_issues")]))
dat_melt <- melt(dat_df)

Run some statistical model

lm(value ~ variable, data = dat_melt)

Visualize results

library(ggplot2)
ggplot(dat_melt, aes(name, value, colour = variable)) + geom_point() + coord_flip()

This is reproducible, repeatable and can serve as a analytic workflow.

Why R...

Most obvious reason --> many people in science already use R.

"Bring it to where the people are"

Massive number of libraries (~4,000) allow much of science workflow to take place in R, including:

getting data ~ social, financial, environmental, etc.
manipulating data ~ plyr, data.table, etc.
visualizing data ~ ggplot2
analyzing data ~ base stats, lme4, etc.
writing up results ~ LaTeX/Sweave, markdown, knitr

Wrapping all science APIs

Development team

Carl Boettiger

Karthik Ram

Scott Chamberlain

Edmund Hart

Advisory team

Duncan
Temple Lang

Hadley Wickham

JJ Allaire

Bertram
Ludascher

Matt Jones

rOpenSci packages

Data packages

Literature packages

Hybrid packages

rOpenSci packages

http://ropensci.org/packages/index.html

rOpenSci packages

https://github.com/ropensci

Some stats on rOpenSci packages

Data via GitHub API, run in R function via OpenCPU.org

Data sources (click to go)

Literature sources

Coming when APIs ready/we get around to it...

eLife

Umbrella packages

Abstract away the detail of each package to search across data sources on:

taxonomic data
metadata from datasets and journals
full text from journals

Getting data from the web into R

Web scraping html, xml, etc.
Reading json, csv, txt, etc.
Hitting an Application Programming Interface (API)

This is the preferred, and most common method we use
Some require an API key, some do not

Authentication: R and APIs - API keys

API keys can be stored in a users.Rprofile file

Click here for help on .Rprofile files

 
options(MendeleyKey = "uf5daib7wyil7ag5buc")
options(MendeleyPrivateKey = "faj2os5dyd7jop2fok6")
options(PlosApiKey = "ef3vip9yak7od3hud4g")
options(SpringerMetdataKey = "ri9hi7woc6jax4vaf8w")

Note: These keys aren't real.

Wrapping APIs in R

in the browser...

Wrapping APIs in R

same URL in RCurl::getURL...

# Call the github json data within R using the RCurl pkg
library(RCurl)

getURL("https://api.github.com/repos/hadley/ggplot2")  # or getForm()/postForm()

[1] "{\"has_downloads\":true,\"full_name\":\"hadley/ggplot2\",\"owner\":{\"gravatar_id\":\"7ba164f40a50bc23dbb2aa825fb7bc16\",\"login\":\"hadley\",\"avatar_url\":\"https://secure.gravatar.com/avatar/7ba164f40a50bc23dbb2aa825fb7bc16?d=https://a248.e.akamai.net/assets.github.com%2Fimages%2Fgravatars%2Fgravatar-140.png\",\"url\":\"https://api.github.com/users/hadley\",\"id\":4196},\"forks_count\":53,\"homepage\":\"http://had.co.nz/ggplot2\",\"svn_url\":\"https://github.com/hadley/ggplot2\",\"mirror_url\":null,\"git_url\":\"git://github.com/hadley/ggplot2.git\",\"pushed_at\":\"2012-08-17T20:49:44Z\",\"network_count\":53,\"forks\":53,\"has_wiki\":true,\"language\":\"R\",\"created_at\":\"2008-05-25T01:21:32Z\",\"watchers\":392,\"watchers_count\":392,\"description\":\"An implementation of the Grammar of Graphics in R\",\"html_url\":\"https://github.com/hadley/ggplot2\",\"clone_url\":\"https://github.com/hadley/ggplot2.git\",\"open_issues\":106,\"open_issues_count\":106,\"has_issues\":true,\"size\":1722,\"fork\":false,\"updated_at\":\"2012-08-21T16:29:13Z\",\"ssh_url\":\"git@github.com:hadley/ggplot2.git\",\"name\":\"ggplot2\",\"url\":\"https://api.github.com/repos/hadley/ggplot2\",\"private\":false,\"id\":19438,\"master_branch\":\"master\"}"

Wrapping APIs in R

same URL in RCurl::getURL...parsed

# And parse the results to more R friendly list
library(RJSONIO)

fromJSON(getURL("https://api.github.com/repos/hadley/ggplot2"))

$has_downloads
[1] TRUE

$full_name
[1] "hadley/ggplot2"
....

Wrapping APIs in R

same URL in httr::GET...

# Or use httr package by Hadley Wickham
library(httr)
tt <- GET("https://api.github.com/repos/hadley/ggplot2")
content(tt)  # content auto-detects data type, and parses

$has_downloads
[1] TRUE

$full_name
[1] "hadley/ggplot2"
....

Wrapping APIs in R using `httr`

httr from Hadley Wickham

occurrencecount <- function(scientificname = NULL, coordinatestatus = NULL, 
	url = "http://data.gbif.org/ws/rest/occurrence/count", 
    	curl = getCurlHandle()) 
{


# The compact fxn is a great way to gather parameters - removes all NULL


    querystr <- compact(list(
    	scientificname = scientificname, coordinatestatus = coordinatestatus
    )) 


    temp <- GET(url, query = querystr)
    out <- content(temp)$doc$children$gbifResponse
    as.numeric(xmlGetAttr(getNodeSet(out, "//gbif:summary")[[1]], "totalMatched"))
}

Wrapping APIs in R using `httr`

The output from the API that needs to be parsed

	$doc
	$file
	[1] ""

	$version
	[1] "1.0"
		
	</gbif:statements>
	  <gbif:stylesheet>http://data.gbif.org/ws/rest/occurrence/stylesheet</gbif:stylesheet>
	  <gbif:parameter name="request" value="count"/>
	  <gbif:parameter name="service" value="occurrence"/>
	  <gbif:parameter name="scientificname" value="Abies concolor"/>
	  <gbif:parameter name="coordinatestatus" value="true"/>
	  <gbif:summary totalMatched="597"/>
	 </gbif:header>
	</gbif:gbifResponse>

Wrapping APIs in R using `httr`

Run the occurrencecount function, search for white fir (Abies concolor)

library(XML)	
library(httr)
library(plyr)

occurrencecount(scientificname = "Abies concolor", coordinatestatus = TRUE)

[1] 597

Examples from some of our packages

Public Library of Science full text - `rplos`

Public Library of Science

library(rplos)
plot_throughtime(list("reproducible science"), 500)

Public Library of Science uses `rplos`!

Managing bibliography - `RMendeley`

Mendeley

Manage libraries and measure impact of research

groupDocInfo(mc, 530031, 4344945792)

$abstract
[1] "SUMMARY: Modern biological experiments create vast amounts of data which are geographically distributed. These datasets consist of petabytes of raw data and billions of documents. Yet to the best of our knowledge, a search engine technology that searches and cross-links all different data types in life sciences does not exist.....

$authors
$authors[[1]]
      forename        surname
   "Dominic S" 	"L\xfctjohann" 
# ....

Accessing data behind papers - `rdryad`

Dryad

# Get URL for a specific dataset
dryaddat <- download_url("10255/dryad.1759")

# Download the file from the Dryad servers
file <- dryad_getfile(dryaddat)

# Just first four columns
head(file[, 1:4])

  year nest.identity season clutch.size
1 2001             1      0           6
2 2001             1      0           6
3 2001             1      0           6
4 2001             1      0           6
5 2001             1      0           6
6 2001             1      0           6

Tracking altmetrics - `raltmet`

Tracks altmetrics across various sources such as GitHub, Total impact, CitedIn, CiteULike, Stackoverflow.

GitHub(userorg = "ropensci", repo = "rmendeley")

totimp(id = "10.5061/dryad.8671")

stackexchange(ids = 16632)

Definitely check out Total-Impact.org and here to learn more about #altmetrics

Mapping biodiversity data - `rgbif`

Global Biodiversity Information Facility

distribution <- occurrencelist(sciname = "Danaus plexippus", coordinatestatus = TRUE, maxresults = 1000, latlongdf = TRUE)

Sharing unpublished data - `(rfigshare)`

Using Figshare's new API, it will soon be possible to share figures, data, and any other object generated in `R` directly to one's figshare account.

> figshare(data)
# code isn't ready yet but once it is, it will return a persistent identifier

A multi-institution consortium to build infrastructure for open science

DataONE

DataONE creates all the necessary components to support persistent and secure access to earth observation data.

DataONE's upcoming R package will allow users to submit and access data to/from member nodes directly from the console.

Git + Science

Rapid peer-peer sharing of code is great for science

R packages early in development can easily be tested, rapidly deployed from GitHub using `devtools` and revised before submitting to a persistent repository such as CRAN.

library(devtools)
install_github("RMendeley", "ropensci")

R + collaborative writing

`knitr` + Markdown

Xie Y (2012). knitr: A general-purpose package for dynamic report generation in R.

`knitr` + Markdown + GitHub

GitHub automatically renders Markdown and even provides syntax highlighting

`knitr` + Markdown + GitHub = executible paper

`knitr` + Markdown + GitHub = pre publication review

Incorporate citations with R + Markdown

`knitcitations`

citet(c(Halpern2006 = "10.1111/j.1461-0248.2005.00827.x"))
# then cite in your markdown file
citet("Halpern2006")


# or read citations from a bibtex file which can be automatically generated and updated from services like Mendeley

bib <- read.bibtex("example.bib")
# then cite inline
citet(bib[["knitr"]])

- tutorial

ropensci.org

Please contact us if you have feedback or ideas for collaborations.

All ropensci code is on GitHub.

We're also on Twitter (@ropensci) and G+.

rOpenSci tutorials

http://ropensci.org/tutorials/

If you like, play with examples at: http://bit.ly/OoCUg3

Scott Chamberlain

@recology_ & @ropensci

http://bit.ly/rvantalk

Increasing attention in the press...

...more in the press...

...and a recent initiative to address the problem...

What do we need?

Open Science

Open data + code

But, scientists don't want to share :(

But, this =>

R + Open Science

Open Science needs open source tools

The old way...

The R way, e.g.,...

Make an API call

Manipulate the data

Run some statistical model

Visualize results

This is reproducible, repeatable and can serve as a analytic workflow.

Why R...

Most obvious reason --> many people in science already use R.

"Bring it to where the people are"

Massive number of libraries (~4,000) allow much of science workflow to take place in R, including:

getting data ~ social, financial, environmental, etc.

manipulating data ~ plyr, data.table, etc.

visualizing data ~ ggplot2

analyzing data ~ base stats, lme4, etc.

writing up results ~ LaTeX/Sweave, markdown, knitr

Wrapping all science APIs

Development team

Advisory team

rOpenSci packages

Data packages

Literature packages

Hybrid packages

rOpenSci packages

rOpenSci packages

Some stats on rOpenSci packages

Data via GitHub API, run in R function via OpenCPU.org

Data sources (click to go)

Literature sources

Coming when APIs ready/we get around to it...

Umbrella packages

Abstract away the detail of each package to search across data sources on:

taxonomic data

metadata from datasets and journals

full text from journals

Getting data from the web into R

Web scraping html, xml, etc.

Reading json, csv, txt, etc.

Hitting an Application Programming Interface (API)

This is the preferred, and most common method we use

Some require an API key, some do not

Authentication: R and APIs - API keys

Wrapping APIs in R

in the browser...

Wrapping APIs in R

same URL in RCurl::getURL...

Wrapping APIs in R

same URL in RCurl::getURL...parsed

Wrapping APIs in R

same URL in httr::GET...

Wrapping APIs in R using httr

httr from Hadley Wickham

Wrapping APIs in R using httr

The output from the API that needs to be parsed

Wrapping APIs in R using httr

Run the occurrencecount function, search for white fir (Abies concolor)

Examples from some of our packages

Public Library of Science full text - rplos

Public Library of Science uses rplos!

Public Library of Science uses rplos!

Managing bibliography - RMendeley

Manage libraries and measure impact of research

Accessing data behind papers - rdryad

Tracking altmetrics - raltmet

Tracks altmetrics across various sources such as GitHub, Total impact, CitedIn, CiteULike, Stackoverflow.

Definitely check out Total-Impact.org and here to learn more about #altmetrics

Mapping biodiversity data - rgbif

R `+` Open Science

Wrapping APIs in R using `httr`

Wrapping APIs in R using `httr`

Wrapping APIs in R using `httr`

Public Library of Science full text - `rplos`

Public Library of Science uses `rplos`!

Public Library of Science uses `rplos`!

Managing bibliography - `RMendeley`

Accessing data behind papers - `rdryad`

Tracking altmetrics - `raltmet`

Mapping biodiversity data - `rgbif`

Sharing unpublished data - `(rfigshare)`

Using Figshare's new API, it will soon be possible to share figures, data, and any other object generated in `R` directly to one's figshare account.

R packages early in development can easily be tested, rapidly deployed from GitHub using `devtools` and revised before submitting to a persistent repository such as CRAN.

`knitr` + Markdown

`knitr` + Markdown + GitHub

`knitr` + Markdown + GitHub = executible paper

`knitr` + Markdown + GitHub = pre publication review

`knitcitations`