2020-07-06 10:10:58

Resources

Tools to enable Open Science (1/3)

1 - Data Science programming languages

Tools to enable Open Science (2/3)

2 - Data Science infrastructure & software

Tools to enable Open Science (3/3)

3 - Data Science learning tools

Exploring the tools to do (Open) Data Science at HU

  • A little introduction to data formats and shapes
  • Introducing the tools; Github + RStudio + HU ResearchDRive
  • Introducing RStudio and the {tidyverse}
  • Getting materials from Github LIVE DEMO
  • Getting access to HU ResearchDrive LIVE DEMO
  • (Getting access to HU ResearchDrive from within RStudio LIVE DEMO)

Data-formats - Non-Proprietary

File format source code is open and maintained by open source community or core development team.

  • .netCDF (Geo, proteomics, array-oriented scientific data)
  • .xml / .mzXML (Markup language, human and machine readable, metadata + data together)
  • .txt / .csv (flat text file, usually tab, comma or semi colon (;) seperated)
  • .json (text format that is completely language independent)

Will remain readable, even if format becomes obsolete

When storing a curated dataset for sharing or archiving it is always better to choose a non-proprietary format

Data shape

Look at these two tables, what do you notice?

## # A tibble: 3 x 4
##   country      year type          count
##   <chr>       <int> <chr>         <int>
## 1 Afghanistan  1999 cases           745
## 2 Afghanistan  1999 population 19987071
## 3 Afghanistan  2000 cases          2666
## # A tibble: 3 x 3
##   country      year rate           
##   <chr>       <int> <chr>          
## 1 Afghanistan  1999 745/19987071   
## 2 Afghanistan  2000 2666/20595360  
## 3 Brazil       1999 37737/172006362

Both tables are build-in datasets from the {tidyr} package belonging to the {tidyverse} suite of Data Science R packages

The {tidyverse}

Tidy data

Are these dataframes tidy?

## [[1]]
## # A tibble: 3 x 4
##   country     century year  rate           
##   <chr>       <chr>   <chr> <chr>          
## 1 Afghanistan 19      99    745/19987071   
## 2 Afghanistan 20      00    2666/20595360  
## 3 Brazil      19      99    37737/172006362
## 
## [[2]]
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

What steps would we need to tidy them?

pivot_longer()

table4a %>% 
  pivot_longer(
    cols = `1999`:`2000`, 
    names_to = "year",
    values_to = "cases"
  ) -> table4a_tidy
table4a_tidy
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

separate()

table5 %>%
  separate(
    col = rate,
    into = c("cases", "population"),
    remove = TRUE
  ) -> table5_tidy
table5_tidy
## # A tibble: 6 x 5
##   country     century year  cases  population
##   <chr>       <chr>   <chr> <chr>  <chr>     
## 1 Afghanistan 19      99    745    19987071  
## 2 Afghanistan 20      00    2666   20595360  
## 3 Brazil      19      99    37737  172006362 
## 4 Brazil      20      00    80488  174504898 
## 5 China       19      99    212258 1272915272
## 6 China       20      00    213766 1280428583

The git/Github.com workflow; segregating data from compute infrastructure from code

Just for kicks, a graph

Getting the code for all webinars into your RStudio environment; introducing the jargon

  • Get an Rstudio installation or account (via me)
  • Clone the repo to your RStudio Env.
  • Install any code dependencies in your Env.
  • Run the code, and adapt if you want
  • Work on the code
  • Create a commit
  • Create a pull request

LIVE DEMO

Github user-account

RStudio

  • Integrated Development for R (and Python, Stan, C++, D3, SQL)
  • Favorite IDE for using R
  • Many integrated productivity tools (auto-completion, syntax highlighting, code-formatting, git-integrations, building tools)
  • Send me an email if you want to use R/RStudio yourself!

Getting Github-repo content in RStudio

  • Copied url to Github repo on clipboard
  • Open new RStudio Project
  • Choose ‘Version Control’ Option
  • Paste url from clipboard in url field
  • Let the clone finish
  • Start using the code!
  • My code will work from a cloned github repo in an RStudio Project because of the {here} package!

Stop using setwd()!!!

LIVE DEMO

HU ResearchDrive

Which tool for what?

Access HU-ResearchDrive from RStudio

To make this work you will need a WebDav token from HU ResearchDrive

Profile -> Security -> Create App

Live Demo

Thank you for your attention!