Open and reproducible research with R (and web scraping!)


Scott Chamberlain

UC Berkeley / rOpenSci
rOpenSci

hemlsley foundation

LICENSE: CC-BY 4.0




open science/research


open research is badly needed

Retractions


research should be reproducible!


but doing for real is another issue

100 psychology studies

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 

Emergent findings




open data can make a new finding possible

Barriers



    Technical

    Cultural

Barriers



    Technical

    Cultural

Cultural barriers



    Lack of incentives (carrots)

    Lack of pressure (sticks)

    Getting scooped ()

    Takes too much time! ()

Open science as a lego set


Open research as a lego set


open research may be hard to do


but - you can work on different components


and - individual components are useful on their own

you don't need to do it all at once

Open research components

Open research components

Open Data


make your data open


funders/journals often requiring this anyway


future self will thank you

Open research components

Open Data: Venues


  • Include data with publications
  • Data specific repositories
  • Code sharing sites: e.g., GitHub
  • so-called Institutional Repositories (IRs) (UO's Scholars Bank)

Open research components

Open Access


make your papers open


funders often requiring this anyway


talk to your librarians!

Open research components

Open Access: Preprints


Preprints increasingly allowed by publishers


++ preprint outlets

SocArXiv, PsyArXiv


talk to your librarians!

Open research components

Open Access: Green OA


Allowed to put up your "authors copy" on your website/etc.


the internet will surface it

Open research components

Versioning


Open research components

Versioning



source
Open research components

Versioning


Including basically all research components:

  • Code
  • Data
  • Metadata
  • Text: manuscripts

Open research components

Why use Versioning?


  • failure proofs your work
  • allows you to experiment freely!
  • Metadata
  • Text: manuscripts

git and R help 
Open research components

Versioning: Git
Resources


Open research components

Do all work programatically



from geeksaresexy.net/2012/01/05/geeks-vs-non-geeks-picture
Open research components

Do all work programatically


Key to reproduciblity:


Most important person that wants to reproduce your work is you!

Open research components

Do all work programatically



you and yourself

- one week from now

- two months from now

- & so on

Open research components

Do all work programatically



allows others to:

- contribute to your work

- check your work

- build on top of your work

research programming languages

research programming languages





are:
the canvas on which to do research

important research programming languages





R language


R homepage  

  • used widely in biology, psychology, medicine, etc.

  • rapidly growing user base, companies surrounding it

  • includes all tools for open research workflow

  • salaries for R skills up there (1, 2)

Open/Rep. Research w/ R

What's the most important thing about R wrt open/reproducible?


R itself -> you're programming!

Open research ecosytsem


open-research-ecosystem

rOpenSci

ropensci.org  

rOpenSci does:



           

the research workflow



Data acquisition    

data manipulation/analysis/viz    

writing    

publish

the research workflow



Data acquisition    

data manipulation/analysis/viz    

writing    

publish

the research workflow



Data acquisition    

data manipulation/analysis/viz    

writing    

publish

the research workflow



Data acquisition    

data manipulation/analysis/viz    

writing    

publish

the research workflow



Data acquisition    

data manipulation/analysis/viz    

writing    

publish

Wrap Up


  • Open research is essential

  • Open research tools are useful on their own

  • rOpenSci: one of the tool makers


  • Challenges going forward

    • Largely cultural - will slowly change

Wrap Up


  • rOpenSci is a community project

  • Let us know what you need

  • Help us make better tools

Questions?

let's switch gears ...


Web Data



Web Data: Types


  • Scraping html

  • Download files

    • HTTP

    • FTP

    • etc.

  • APIs


going down: increasing organization and longevity

going up: increasing complexity for user (mostly)

Web Data: Use Cases


  • Scraping: specialized, one time problems

  • Download: when you want all the data

  • APIs: stable, medium sized data

Approaching a website wrt data

  • Look at menu, header, footer

  • Look for key words: "API", "Data", "Developers", etc.

  • Contact a technical (or anyone) person

  • Browser developer tools are your friends

  • Scraping as a last resort

Scraping: brief intro


n, def: extract useful bits out of a pile of (typically) html (me, today)


see Wikipedia's def.

Scraping Exercise


navigate to: http://www.goducks.com/cumestats.aspx?path=wsoc


  1. Install and load rvest library
  2. Fetch the contents of the URL above
  3. Pull out the results table
url <- "http://www.goducks.com/cumestats.aspx?path=wsoc"
x <- xml2::read_html(url)
rvest::html_table(x)[[6]]

Scraping: beware

  • Look on "terms" or related page for any legalese

  • Probably all good when simply exploring

  • Take precautions when doing lots of scraping

  • Look around: StackOverflow, just googling, talk to friends, etc.

Downloading files: intro


  • (includes web scraping, have to download page to scrape it)

  • Can be easy - sometimes hard

  • Protocols: HTTP, FTP, etc. (mostly HTTP)

  • Authentication sometimes

  • File sizes can be very large

  • Does UO have a firewall?

Downloading files in R


  • lots of ways to download files

  • many pkgs/fxns for certain file types also download the file for you

  • recommendation: curl::curl_download - replaces download.file in base R

  • think about where you're putting files!

Downloading Exercise - 1


the file: file


  1. Download the file using curl::curl_download
  2. Read the file in however you like
url <- "https://raw.githubusercontent.com/sckott/soylocs/gh-pages/data.csv"
x <- curl::curl_download(url, (f <- tempfile(fileext = ".csv")))
readr::read_csv(x)

Downloading files: misc. topics


Downloading Exercise - 2


Compressed files


US Dept. Education Academic Libraries Survey 2012

  1. Download the .zip file Academic Libraries Survey 2002
  2. Uncompress the file
  3. Read in the .txt file

Downloading Exercise - 2


Compressed files


Answer

url <- "https://nces.ed.gov/pubs2006/data/ALS2002_ASCII.zip"
x <- curl::curl_download(url, (f <- tempfile(fileext = ".zip")))
mydir <- file.path(tempdir(), "mydir")
unzip(f, exdir = mydir)
readr::read_tsv(list.files(mydir, full.names=TRUE))

APIs!


  • API = Application Programming Interface

  • Rules about how machines/software talk to one another

  • Can be an API for: Android phone, Linux operating system, an R package, a web service

APIs for data on the web


  • Web APIs

    • A server in cloud

    • Database w/ data

    • Rules about how to talk to that database

Standardized APIs?

most APIs follow no standard, though loosely follow REST


but some standardized frameworks


Lack of Standardization in APIs



Makes Consuming them Tough

APIs Exercise - 1


Data source: https://usedgov.github.io/api/crdc.html



Make a request for data - any request


goal: make sure everything's working as expected

APIs Exercise - 2


Data source: https://usedgov.github.io/api/crdc.html

  1. Make a request for data
  2. Parse data to a list with resources as a data.frame
  3. Pull out all the data for the TOT_ENR_F and TOT_ENR_M fields

APIs Exercise - 3


Data source: https://usedgov.github.io/api/crdc.html

  1. Make a request for data w/ headers
  2. Get the request headers
  3. Get the response headers
  4. Pick a response header and google it/wikipedia it, find out what it is

APIs Exercise - 4


Data source: https://httpbin.org/

  1. Look up curl options (see curl::curl_options()), check a few out to see what they do
  2. Make a few requests for data w/ a different curl option specified for each

Look for API Wrappers



Don't reinvent the wheel



Check CRAN, GitHub, etc.

Resources: software


scotttalks.info/uofo17



Made w/: reveal.js v3.2.0


Some Styling: Bootstrap v3.3.5


Icons by: FontAwesome v4.4.0