Open and reproducible research with R (and web scraping!)

Scott Chamberlain

UC Berkeley / rOpenSci

hemlsley foundation


open science/research

open research is badly needed


research should be reproducible!

but doing for real is another issue

100 psychology studies

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 

Emergent findings

open data can make a new finding possible







Cultural barriers

    Lack of incentives (carrots)

    Lack of pressure (sticks)

    Getting scooped ()

    Takes too much time! ()

Open science as a lego set

Open research as a lego set

open research may be hard to do

but - you can work on different components

and - individual components are useful on their own

you don't need to do it all at once

Open research components

Open research components

Open Data

make your data open

funders/journals often requiring this anyway

future self will thank you

Open research components

Open Data: Venues

  • Include data with publications
  • Data specific repositories
  • Code sharing sites: e.g., GitHub
  • so-called Institutional Repositories (IRs) (UO's Scholars Bank)

Open research components

Open Access

make your papers open

funders often requiring this anyway

talk to your librarians!

Open research components

Open Access: Preprints

Preprints increasingly allowed by publishers

++ preprint outlets

SocArXiv, PsyArXiv

talk to your librarians!

Open research components

Open Access: Green OA

Allowed to put up your "authors copy" on your website/etc.

the internet will surface it

Open research components


Open research components


Open research components


Including basically all research components:

  • Code
  • Data
  • Metadata
  • Text: manuscripts

Open research components

Why use Versioning?

  • failure proofs your work
  • allows you to experiment freely!
  • Metadata
  • Text: manuscripts

git and R help 
Open research components

Versioning: Git

Open research components

Do all work programatically

Open research components

Do all work programatically

Key to reproduciblity:

Most important person that wants to reproduce your work is you!

Open research components

Do all work programatically

you and yourself

- one week from now

- two months from now

- & so on

Open research components

Do all work programatically

allows others to:

- contribute to your work

- check your work

- build on top of your work

research programming languages

research programming languages

the canvas on which to do research

important research programming languages

R language

R homepage  

  • used widely in biology, psychology, medicine, etc.

  • rapidly growing user base, companies surrounding it

  • includes all tools for open research workflow

  • salaries for R skills up there (1, 2)

Open/Rep. Research w/ R

What's the most important thing about R wrt open/reproducible?

R itself -> you're programming!

Open research ecosytsem



rOpenSci does:


the research workflow

Data acquisition    

data manipulation/analysis/viz    



the research workflow

Data acquisition    

data manipulation/analysis/viz    



the research workflow

Data acquisition    

data manipulation/analysis/viz    



the research workflow

Data acquisition    

data manipulation/analysis/viz    



the research workflow

Data acquisition    

data manipulation/analysis/viz    



Wrap Up

  • Open research is essential

  • Open research tools are useful on their own

  • rOpenSci: one of the tool makers

  • Challenges going forward

    • Largely cultural - will slowly change

Wrap Up

  • rOpenSci is a community project

  • Let us know what you need

  • Help us make better tools


let's switch gears ...

Web Data

Web Data: Types

  • Scraping html

  • Download files

    • HTTP

    • FTP

    • etc.

  • APIs

going down: increasing organization and longevity

going up: increasing complexity for user (mostly)

Web Data: Use Cases

  • Scraping: specialized, one time problems

  • Download: when you want all the data

  • APIs: stable, medium sized data

Approaching a website wrt data

  • Look at menu, header, footer

  • Look for key words: "API", "Data", "Developers", etc.

  • Contact a technical (or anyone) person

  • Browser developer tools are your friends

  • Scraping as a last resort

Scraping: brief intro

n, def: extract useful bits out of a pile of (typically) html (me, today)

see Wikipedia's def.

Scraping Exercise

navigate to:

  1. Install and load rvest library
  2. Fetch the contents of the URL above
  3. Pull out the results table
url <- ""
x <- xml2::read_html(url)

Scraping: beware

  • Look on "terms" or related page for any legalese

  • Probably all good when simply exploring

  • Take precautions when doing lots of scraping

  • Look around: StackOverflow, just googling, talk to friends, etc.

Downloading files: intro

  • (includes web scraping, have to download page to scrape it)

  • Can be easy - sometimes hard

  • Protocols: HTTP, FTP, etc. (mostly HTTP)

  • Authentication sometimes

  • File sizes can be very large

  • Does UO have a firewall?

Downloading files in R

  • lots of ways to download files

  • many pkgs/fxns for certain file types also download the file for you

  • recommendation: curl::curl_download - replaces download.file in base R

  • think about where you're putting files!

Downloading Exercise - 1

the file: file

  1. Download the file using curl::curl_download
  2. Read the file in however you like
url <- ""
x <- curl::curl_download(url, (f <- tempfile(fileext = ".csv")))

Downloading files: misc. topics

Downloading Exercise - 2

Compressed files

US Dept. Education Academic Libraries Survey 2012

  1. Download the .zip file Academic Libraries Survey 2002
  2. Uncompress the file
  3. Read in the .txt file

Downloading Exercise - 2

Compressed files


url <- ""
x <- curl::curl_download(url, (f <- tempfile(fileext = ".zip")))
mydir <- file.path(tempdir(), "mydir")
unzip(f, exdir = mydir)
readr::read_tsv(list.files(mydir, full.names=TRUE))


  • API = Application Programming Interface

  • Rules about how machines/software talk to one another

  • Can be an API for: Android phone, Linux operating system, an R package, a web service

APIs for data on the web

  • Web APIs

    • A server in cloud

    • Database w/ data

    • Rules about how to talk to that database

Standardized APIs?

most APIs follow no standard, though loosely follow REST

but some standardized frameworks

Lack of Standardization in APIs

Makes Consuming them Tough

APIs Exercise - 1

Data source:

Make a request for data - any request

goal: make sure everything's working as expected

APIs Exercise - 2

Data source:

  1. Make a request for data
  2. Parse data to a list with resources as a data.frame
  3. Pull out all the data for the TOT_ENR_F and TOT_ENR_M fields

APIs Exercise - 3

Data source:

  1. Make a request for data w/ headers
  2. Get the request headers
  3. Get the response headers
  4. Pick a response header and google it/wikipedia it, find out what it is

APIs Exercise - 4

Data source:

  1. Look up curl options (see curl::curl_options()), check a few out to see what they do
  2. Make a few requests for data w/ a different curl option specified for each

Look for API Wrappers

Don't reinvent the wheel

Check CRAN, GitHub, etc.

Resources: software

Made w/: reveal.js v3.2.0

Some Styling: Bootstrap v3.3.5

Icons by: FontAwesome v4.4.0