Open and reproducible research with R (and web scraping!)
Scott Chamberlain
UC Berkeley / rOpenSci
open research is badly needed
Retractions
research should be reproducible!
but doing for real is another issue
Emergent findings
open data can make a new finding possible
Cultural barriers
Lack of incentives (carrots)
Lack of pressure (sticks)
Getting scooped ()
Takes too much time! ()
Open science as a lego set
Open research as a lego set
open research may be hard to do
but - you can work on different components
and - individual components are useful on their own
you don't need to do it all at once
Open research components
Open Data
make your data open
funders/journals often requiring this anyway
future self will thank you
Open research components
Open Data: Venues
- Include data with publications
- Data specific repositories
- Code sharing sites: e.g., GitHub
- so-called Institutional Repositories (IRs) (UO's Scholars Bank)
Open research components
Open Access
make your papers open
funders often requiring this anyway
talk to your librarians!
Open research components
Open Access: Preprints
Preprints increasingly allowed by publishers
++ preprint outlets
talk to your librarians!
Open research components
Open Access: Green OA
Allowed to put up your "authors copy" on your website/etc.
the internet will surface it
Open research components
Versioning
Open research components
Versioning
source
Open research components
Versioning
Including basically all research components:
- Code
- Data
- Metadata
- Text: manuscripts
Open research components
Why use Versioning?
- failure proofs your work
- allows you to experiment freely!
- Metadata
- Text: manuscripts
git and R help
Open research components
Versioning: Git
Resources
Open research components
Do all work programatically
Key to reproduciblity:
Most important person that wants to reproduce your work is you!
Open research components
Do all work programatically
you and yourself
- one week from now
- two months from now
- & so on
Open research components
Do all work programatically
allows others to:
- contribute to your work
- check your work
- build on top of your work
research programming languages
research programming languages
are:
the canvas on which to do research
important research programming languages
R language
R homepage
used widely in biology, psychology, medicine, etc.
rapidly growing user base, companies surrounding it
includes all tools for open research workflow
salaries for R skills up there (1, 2)
What's the most important thing about R wrt open/reproducible?
R itself -> you're programming!
Open research ecosytsem
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
the research workflow
Data acquisition
data manipulation/analysis/viz
writing
publish
Wrap Up
Open research is essential
Open research tools are useful on their own
rOpenSci: one of the tool makers
Challenges going forward
Largely cultural - will slowly change
Wrap Up
rOpenSci is a community project
Let us know what you need
Help us make better tools
Web Data: Types
Scraping html
Download files
APIs
going down: increasing organization and longevity
going up: increasing complexity for user (mostly)
Web Data: Use Cases
Scraping: specialized, one time problems
Download: when you want all the data
APIs: stable, medium sized data
Approaching a website wrt data
Look at menu, header, footer
Look for key words: "API", "Data", "Developers", etc.
Contact a technical (or anyone) person
Browser developer tools are your friends
Scraping as a last resort
Scraping: brief intro
n, def: extract useful bits out of a pile of (typically) html (me, today)
Scraping Exercise
- Install and load
rvest
library
- Fetch the contents of the URL above
- Pull out the results table
url <- "http://www.goducks.com/cumestats.aspx?path=wsoc"
x <- xml2::read_html(url)
rvest::html_table(x)[[6]]
Scraping: beware
Look on "terms" or related page for any legalese
Probably all good when simply exploring
Take precautions when doing lots of scraping
Look around: StackOverflow, just googling, talk to friends, etc.
Downloading files: intro
(includes web scraping, have to download page to scrape it)
Can be easy - sometimes hard
Protocols: HTTP, FTP, etc. (mostly HTTP)
Authentication sometimes
File sizes can be very large
Does UO have a firewall?
Downloading files in R
lots of ways to download files
many pkgs/fxns for certain file types also download the file for you
recommendation: curl::curl_download - replaces download.file
in base R
think about where you're putting files!
Downloading Exercise - 1
the file: file
- Download the file using
curl::curl_download
- Read the file in however you like
url <- "https://raw.githubusercontent.com/sckott/soylocs/gh-pages/data.csv"
x <- curl::curl_download(url, (f <- tempfile(fileext = ".csv")))
readr::read_csv(x)
Downloading files: misc. topics
Downloading Exercise - 2
Compressed files
Answer
url <- "https://nces.ed.gov/pubs2006/data/ALS2002_ASCII.zip"
x <- curl::curl_download(url, (f <- tempfile(fileext = ".zip")))
mydir <- file.path(tempdir(), "mydir")
unzip(f, exdir = mydir)
readr::read_tsv(list.files(mydir, full.names=TRUE))
APIs!
API = Application Programming Interface
Rules about how machines/software talk to one another
Can be an API for: Android phone, Linux operating system, an R package, a web service
APIs for data on the web
Web APIs
A server in cloud
Database w/ data
Rules about how to talk to that database
Standardized APIs?
most APIs follow no standard, though loosely follow REST
but some standardized frameworks
databases: Solr is common
JSON API closest thing
to standard for REST
Lack of Standardization in APIs
Makes Consuming them Tough
APIs Exercise - 2
Data source: https://usedgov.github.io/api/crdc.html
- Make a request for data
- Parse data to a list with
resources
as a data.frame
- Pull out all the data for the
TOT_ENR_F
and TOT_ENR_M
fields
APIs Exercise - 3
Data source: https://usedgov.github.io/api/crdc.html
- Make a request for data w/ headers
- Get the request headers
- Get the response headers
- Pick a response header and google it/wikipedia it, find out what it is
APIs Exercise - 4
Data source: https://httpbin.org/
- Look up curl options (see
curl::curl_options()
), check a few out to see what they do
- Make a few requests for data w/ a different curl option specified for each
Look for API Wrappers
Don't reinvent the wheel