Open and reproducible science with R
Scott Chamberlain
UC Berkeley / rOpenSci
open science is badly needed
Retractions
science should be reproducible!
but doing for real is another issue
Emergent findings
open data can make a new finding possible
Cultural barriers
Lack of incentives (carrots)
Lack of pressure (sticks)
Getting scooped ()
Takes too much time! ()
Open science as a lego set
Open science as a lego set
open science may be hard to do
but - you can work on different components
and - individual components are useful on their own
you don't need to do it all at once
Open science components
Open Data
make your data open
funders/journals often requiring this anyway
future self will thank you
Open science components
Open Data: Venues
- Include data with publications
- Data specific repositories
- Code sharing sites: e.g., GitHub
- so-called Institutional Repositories (IRs)
Open science components
Open Access
make your papers open
funders often requiring this anyway
talk to your librarians!
Open science components
Open Access: Preprints
Preprints increasingly allowed by publishers
++ preprint outlets
talk to your librarians!
*: think twice maybe
Open science components
Open Access: Green OA
Allowed to put up your "authors copy" on your website/etc.
the internet will surface it
Open science components
Versioning
Open science components
Versioning
source
Open science components
Versioning
Including basically all research components:
- Code
- Data
- Metadata
- Text: manuscripts
Open science components
Why use Versioning?
- failure proofs your work
- allows you to experiment freely!
- Metadata
- Text: manuscripts
git and R help
Open science components
Versioning: Git
Resources
Open science components
Do all work programatically
Key to reproduciblity:
Most important person that wants to reproduce your work is you!
Open science components
Do all work programatically
you and yourself
- one week from now
- two months from now
- & so on
Open science components
Do all work programatically
allows others to:
- contribute to your work
- check your work
- build on top of your work
scientific programming languages
scientific programming languages
are:
the canvas on which to do science
important scientific programming languages
R language
R homepage
used widely in biology, psychology, medicine, etc.
rapidly growing user base, companies surrounding it
includes all tools for open science workflow
salaries for R skills up there (1, 2)
What's the most important thing about R wrt open/reproducible?
R itself -> you're programming!
Workflows
A script (i.e., a .R
file)
Script + Text = Markdown/Latex (e.g., journal article)
Any files + Dropbox
Any files + versioning (git)
Any files + versioning (git) + Pandoc
What to aim for
Do as much as possible in code
Version control all products
Combine text and code together
Share/open up your work
Open science ecosytsem
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
rOpenSci makes data driven stories easier to tell
here are some stories ...
use case 1
Lovelace, R., Goodman, A., Aldred, R., Berkoff, N., Abbas, A., & Woodcock, J. (2015). The Propensity to Cycle Tool: An open source online system for sustainable transport planning. arXiv preprint
http://pct.bike
Wrap Up
Open science is essential
Open science tools are useful on their own
rOpenSci: one of the tool makers
Challenges going forward
Largely cultural - will slowly change
Wrap Up
rOpenSci is a community project
Let us know what you need
Help us make better tools
rOpenSci Tools
Data Publication | Data Access |
Literature
| Altmetrics | Scalable & Reproducible Computing | Databases | Data Vizualization | Image Processing | Data Tools | Taxonomy | HTTP tools | Geospatial | Data Analysis
rOpenSci Tools
Data Publication | Data Access |
Literature
| Altmetrics | Scalable & Reproducible Computing | Databases | Data Vizualization | Image Processing | Data Tools | Taxonomy | HTTP tools | Geospatial | Data Analysis
rOpenSci Literature Tools
Public Library of Science
using rplos
we can access metadata and fulltext for any PLOS article
install rplos
like
install.packages("rplos")
example demo
Exercise
- Create a
.Rmd
file
- Use
rplos
to get fulltext for 1000 articles
-calculate number of authors per article
-make a simple plot of authors per article
- Render the
.Rmd
to .html
- Send the
.Rmd
version to your partner via email
- Render the
.Rmd
file you received
- Does your
.html
look the same?
rOpenSci Tools
Data Publication | Data Access | Literature | Altmetrics | Scalable & Reproducible Computing | Databases | Data Vizualization | Image Processing | Data Tools | Taxonomy | HTTP tools |
Geospatial
| Data Analysis
rOpenSci Geospatial Tools
using openadds
(link) get addresses for Lane County
using leaflet
visualize locations on map
Exercise
- Create a
.Rmd
file
- Use
openadds
to get a data.frame of addresses, then leaflet
to visualize the map
- Render the
.Rmd
to .html