Get these slides at:
- https://github.com/uashogeschoolutrecht/work_flows" (source code)
- https://uashogeschoolutrecht.github.io/ (slides only)
2020-07-06 10:10:58
Get these slides at:
1 - Data Science programming languages
2 - Data Science infrastructure & software
3 - Data Science learning tools
{tidyverse}
File format source code is open and maintained by open source community or core development team.
.netCDF
(Geo, proteomics, array-oriented scientific data).xml
/ .mzXML
(Markup language, human and machine readable, metadata + data together).txt
/ .csv
(flat text file, usually tab, comma or semi colon (;
) seperated).json
(text format that is completely language independent)Will remain readable, even if format becomes obsolete
When storing a curated dataset for sharing or archiving it is always better to choose a non-proprietary format
Look at these two tables, what do you notice?
## # A tibble: 3 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666
## # A tibble: 3 x 3 ## country year rate ## <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362
Both tables are build-in datasets from the {tidyr} package belonging to the {tidyverse} suite of Data Science R packages
{tidyverse}
From: Storybench
## [[1]] ## # A tibble: 3 x 4 ## country century year rate ## <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## ## [[2]] ## # A tibble: 3 x 3 ## country `1999` `2000` ## <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
What steps would we need to tidy them?
pivot_longer()
table4a %>% pivot_longer( cols = `1999`:`2000`, names_to = "year", values_to = "cases" ) -> table4a_tidy table4a_tidy
## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766
separate()
table5 %>% separate( col = rate, into = c("cases", "population"), remove = TRUE ) -> table5_tidy table5_tidy
## # A tibble: 6 x 5 ## country century year cases population ## <chr> <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745 19987071 ## 2 Afghanistan 20 00 2666 20595360 ## 3 Brazil 19 99 37737 172006362 ## 4 Brazil 20 00 80488 174504898 ## 5 China 19 99 212258 1272915272 ## 6 China 20 00 213766 1280428583
LIVE DEMO
{here}
package!Stop using setwd()
!!!
LIVE DEMO
To make this work you will need a WebDav token from HU ResearchDrive
Profile -> Security -> Create App
Live Demo
Ejoy the summer & StaRt leaRning:
https://rstudio.com/resources/webinars/
Find all the slides on: https://uashogeschoolutrecht.github.io/