Part 3; Tools for Reproducible (Open) Science

2020-07-06 10:10:58

Resources

Get these slides at:

https://github.com/uashogeschoolutrecht/work_flows" (source code)
https://uashogeschoolutrecht.github.io/ (slides only)

Tools to enable Open Science (1/3)

1 - Data Science programming languages

Tools to enable Open Science (2/3)

2 - Data Science infrastructure & software

Tools to enable Open Science (3/3)

3 - Data Science learning tools

Exploring the tools to do (Open) Data Science at HU

A little introduction to data formats and shapes
Introducing the tools; Github + RStudio + HU ResearchDRive
Introducing RStudio and the {tidyverse}
Getting materials from Github LIVE DEMO
Getting access to HU ResearchDrive LIVE DEMO
(Getting access to HU ResearchDrive from within RStudio LIVE DEMO)

Data-formats - Non-Proprietary

File format source code is open and maintained by open source community or core development team.

.netCDF (Geo, proteomics, array-oriented scientific data)
.xml / .mzXML (Markup language, human and machine readable, metadata + data together)
.txt / .csv (flat text file, usually tab, comma or semi colon (;) seperated)
.json (text format that is completely language independent)

Will remain readable, even if format becomes obsolete

When storing a curated dataset for sharing or archiving it is always better to choose a non-proprietary format

Data shape

Look at these two tables, what do you notice?

## # A tibble: 3 x 4
##   country      year type          count
##   <chr>       <int> <chr>         <int>
## 1 Afghanistan  1999 cases           745
## 2 Afghanistan  1999 population 19987071
## 3 Afghanistan  2000 cases          2666

## # A tibble: 3 x 3
##   country      year rate           
##   <chr>       <int> <chr>          
## 1 Afghanistan  1999 745/19987071   
## 2 Afghanistan  2000 2666/20595360  
## 3 Brazil       1999 37737/172006362

Both tables are build-in datasets from the {tidyr} package belonging to the {tidyverse} suite of Data Science R packages

The `{tidyverse}`

Suite of R-packages for Data Science and functional programming
https://www.tidyverse.org/
Connect to many other tools in R

From: Storybench

Tidy data

Each variable goes in its own column
Each observation goes in its own row
Each cell contains only one value

From: “R for Data Science”, Grolemund and Wickham

Are these dataframes tidy?

## [[1]]
## # A tibble: 3 x 4
##   country     century year  rate           
##   <chr>       <chr>   <chr> <chr>          
## 1 Afghanistan 19      99    745/19987071   
## 2 Afghanistan 20      00    2666/20595360  
## 3 Brazil      19      99    37737/172006362
## 
## [[2]]
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

What steps would we need to tidy them?

`pivot_longer()`

table4a %>% 
  pivot_longer(
    cols = `1999`:`2000`, 
    names_to = "year",
    values_to = "cases"
  ) -> table4a_tidy
table4a_tidy

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

`separate()`

table5 %>%
  separate(
    col = rate,
    into = c("cases", "population"),
    remove = TRUE
  ) -> table5_tidy
table5_tidy

## # A tibble: 6 x 5
##   country     century year  cases  population
##   <chr>       <chr>   <chr> <chr>  <chr>     
## 1 Afghanistan 19      99    745    19987071  
## 2 Afghanistan 20      00    2666   20595360  
## 3 Brazil      19      99    37737  172006362 
## 4 Brazil      20      00    80488  174504898 
## 5 China       19      99    212258 1272915272
## 6 China       20      00    213766 1280428583

The git/Github.com workflow; segregating data from compute infrastructure from code

Just for kicks, a graph

Spiegelhalter, 2020, “The Art of Statistics”

Getting the code for all webinars into your RStudio environment; introducing the jargon

Get an Rstudio installation or account (via me)
Clone the repo to your RStudio Env.
Install any code dependencies in your Env.
Run the code, and adapt if you want
Work on the code
Create a commit
Create a pull request

LIVE DEMO

Github user-account

https://github.com
You can create personal and private repos
Adding a README.md to each repo is a good idea
The HU Github Data Science repos: https://github.com/uashogeschoolutrecht

RStudio

Integrated Development for R (and Python, Stan, C++, D3, SQL)
Favorite IDE for using R
Many integrated productivity tools (auto-completion, syntax highlighting, code-formatting, git-integrations, building tools)
Send me an email if you want to use R/RStudio yourself!

Getting Github-repo content in RStudio

Copied url to Github repo on clipboard
Open new RStudio Project
Choose ‘Version Control’ Option
Paste url from clipboard in url field
Let the clone finish
Start using the code!
My code will work from a cloned github repo in an RStudio Project because of the {here} package!

Stop using setwd()!!!

LIVE DEMO

HU ResearchDrive

Service brought to HU by SURF
Application: https://bibliotheek.hu.nl/onderzoekers/datamanagement/
Access though webinterfacte and other software
SFTP software CyberDuck (you need admin rights)
Rclone (commandline interface)

Which tool for what?

Access HU-ResearchDrive from RStudio

To make this work you will need a WebDav token from HU ResearchDrive

Profile -> Security -> Create App

Live Demo

Thank you for your attention!

Ejoy the summer & StaRt leaRning:

https://r4ds.had.co.nz/

https://rstudio.com/resources/webinars/

Find all the slides on: https://uashogeschoolutrecht.github.io/