Building tools and community for open and reproducible research
Scott Chamberlain (@ropensci)
UC Berkeley / rOpenSci
Supported by:
License: CC-BY 3.0 - You are free to copy, share, adapt, or remix, photograph, film, or broadcast, blog, live-blog, or post video of this presentation, provided that you attribute the work to its author and respect the rights and licenses associated with its components.
These data are hard to get
Cultural barriers
Lack of incentives (carrots)
Lack of pressure (sticks)
Getting scooped ()
Takes too much time! ()
the Data Policy states the ‘minimal dataset’ consists “of the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety. This does not mean that authors must submit all data collected as part of the research, but that they must provide the data that are relevant to the specific analysis presented in the paper.
PLOS Editorial and Publishing Policies
Why Open?
Most research publicly funded
Why Open?
To increase the pace of science
Openness facilitates reproducibility
Why Reproducible?
For yourself!
Why Reproducible?
To avoid mistakes
These!
There's a learning curve, but...
link
Workflows side by side
|
Our workflow now
- Browser
- Excel
- SAS
- SigmaPlot
- Word
- Endnote
|
|
What it could be
|
|
|
Cost = $$$$$$$
Open? = Nope
Reproducible? = Nope |
|
Cost = 0
Open? = Yes!
Reproducible? = Yep |
A perfect combination
Data is increasingly on the web
API: Application Programming Interface
Reproducibly plug data from the web into research workflows
R has a lot of packages
(mostly data manip./viz & statistics)
Data acquisition
data manipulation/analysis/viz
writing
publish
Data acquisition
data manipulation/analysis/viz
writing
publish
Data acquisition
data manipulation/analysis/viz
writing
publish
Data acquisition
data manipulation/analysis/viz
writing
publish
Data acquisition
data manipulation/analysis/viz
writing
publish
rOpenSci packages
Biology
|
|
|
|
|
Literature
|
|
|
|
Publishing
|
rOpenSci packages
History
|
|
Archeology
|
|
Altmetrics
|
Some numbers for amount of the data
source |
value |
type |
GBIF |
420222471 |
Species occurrence records |
BHL |
155891133 |
Names |
BHL |
43968949 |
Pages |
DataCite |
3618096 |
Data records |
eBird |
2923886 |
Observations |
NPN |
2537095 |
Phenology records |
Neotoma |
2200221 |
Data records |
OpenSNP |
2140939 |
SNPs |
COL |
1352112 |
Taxonomic names |
ITIS |
624282 |
Taxonomic names |
eBird |
205970 |
Checklist records |
BHL |
139561 |
Items |
PLOS |
136330 |
Articles |
BHL |
77258 |
Titles |
Code demos
The diversity of projects
How easy it is
Imagine these examples at scale
rentrez - an R client for ENTREZ
David Winter
Post-doc at Arizona State University
taxize
all things taxonomy
and many more...
Unified species occurrence data - spocc
EML
EML provides a common structure for data, to better enable ecologists to document,
share, and interpret ecological data
EML standard enables data integration at the machine level (with little or no human intervention).
Read more about EML
Publishing outlets
Figshare
Zenodo
Dataone
Various journals
More...
History - mostly by
Lincoln Mullen at George Mason University
altmetrics data often open
Data acquisition
data manipulation/analysis/viz
writing
publish
Other reasons
Training
Software quality
Domains
Ecology/Biology
Altmetrics
History
Archeology
More...
rOpenSci Ambassador Program
Give talks
Code demos
rOpenSci brings to the table:
Bridge btw scientists & data providers
Visibility
Consistency
Quality
Conversation/community
Stability
In closing...
This is all about improving research
rOpenSci is just one vehicle with which to help