ropensci

Building tools and community for open and reproducible research


Scott Chamberlain (@ropensci)

UC Berkeley / rOpenSci

Supported by:
sloan

License: CC-BY 3.0 - You are free to copy, share, adapt, or remix, photograph, film, or broadcast, blog, live-blog, or post video of this presentation, provided that you attribute the work to its author and respect the rights and licenses associated with its components.

Source: Giulia Forsythe

These data are hard to get

We need...



Cultural barriers



    Lack of incentives (carrots)

    Lack of pressure (sticks)

    Getting scooped ()

    Takes too much time! ()

Instructions for preparation of the Biographical Sketch have been revised to rename the "Publications" section to "Products" and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.




Issuance of a new NSF Proposal & Award Policies and Procedures Guide (October 4th)

the Data Policy states the ‘minimal dataset’ consists “of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. This does not mean that authors must submit all data collected as part of the research, but that they must provide the data that are relevant to the specific analysis presented in the paper.


PLOS Editorial and Publishing Policies

 Why Open?

 Why Open?



Most research publicly funded



 Why Open?



To increase the pace of science



 Why Open?

Sharing data increases citations

Piwowar et al. 2007

Openness facilitates reproducibility

 Why Reproducible?



For yourself!



 Why Reproducible?


To avoid mistakes



  

Reinhart-Rogoff
NYTimes

      

Amgen Reproduciblity Study


Reuters --  The Economist

What's needed?


toolset

 A reproducible workflow

These!

There's a learning curve, but...

link

Workflows side by side

  

Our workflow now

  • Browser
  • Excel
  • SAS
  • SigmaPlot
  • Word
  • Endnote
  

What it could be

  
   Cost = $$$$$$$

Open? = Nope

Reproducible? = Nope

   Cost = 0

Open? = Yes!

Reproducible? = Yep

A perfect combination



     

Data is increasingly on the web



API: Application Programming Interface





  



Reproducibly plug data from the web into research workflows






How rOpenSci got started



formed from ad-hoc conversation over Twitter. Now a worldwide community of researchers

http://ropensci.org/community

R has a lot of packages
(mostly data manip./viz & statistics)





Data acquisition    

data manipulation/analysis/viz    

writing    

publish




Data acquisition    

data manipulation/analysis/viz    

writing    

publish




Data acquisition    

data manipulation/analysis/viz    

writing    

publish




Data acquisition    

data manipulation/analysis/viz    

writing    

publish




Data acquisition    

data manipulation/analysis/viz    

writing    

publish

rOpenSci packages

ropensci.org/packages

 Biology


       

 Literature


     

 Publishing


rOpenSci packages

ropensci.org/packages

 History


 

 Archeology


 

 Altmetrics


Some numbers for amount of the data

source value type
GBIF 420222471 Species occurrence records
BHL 155891133 Names
BHL 43968949 Pages
DataCite 3618096 Data records
eBird 2923886 Observations
NPN 2537095 Phenology records
Neotoma 2200221 Data records
OpenSNP 2140939 SNPs
COL 1352112 Taxonomic names
ITIS 624282 Taxonomic names
eBird 205970 Checklist records
BHL 139561 Items
PLOS 136330 Articles
BHL 77258 Titles

Code demos



    The diversity of projects

    How easy it is

    Imagine these examples at scale




 Biology

rentrez - an R client for ENTREZ


david winter

David Winter

Post-doc at Arizona State University

taxize
all things taxonomy



ben marwick ed Scott Chamberlain karthik ram

and many more...

Unified species occurrence data - spocc

Various plotting options

              
           



 Literature




 Publishing

carl

EML


EML provides a common structure for data, to better enable ecologists to document, share, and interpret ecological data



EML standard enables data integration at the machine level (with little or no human intervention).


Read more about EML

Publishing outlets



  • Figshare

  • Zenodo

  • Dataone

  • Various journals

  • More...




 History


History - mostly by
Lincoln Mullen at George Mason University

lincoln



 Archeology




 Altmetrics

article-level metrics

There are a lot of altmetrics out there

lotsalts
Canonical altmetrics document

altmetrics data often open






Data acquisition    

data manipulation/analysis/viz    

writing    

publish

Docker

rocker

Community building

Why community building?

Software robustness through time

network
link

Other reasons

    Training

    Software quality

The rOpenSci Community

ropensci_community http://ropensci.org/community

Community stats



  • 71 contributors

  • 148 Github repositories

  • > 9,000 commits over ~3 years

  • a few pkgs with ~900 commits

  • ~66 R packages

Domains



  • Ecology/Biology



  • Altmetrics

  • History

  • Archeology

  • More...

Training







rOpenSci Ambassador Program



Give talks


Code demos

rOpenSci brings to the table:



    Bridge btw scientists & data providers

    Visibility

    Consistency

    Quality

    Conversation/community

    Stability

github.com/ropengov


ropengov

github.com/ropenhealth


ropenhealth

In closing...


This is all about improving research


rOpenSci is just one vehicle with which to help


rOpenSci on the web: ropensci.org



This talk on the web: recology.info/talks/sfustats







Made w/ reveal.js


Icons by: FontAwesome v4.2.0