2020-06-23 09:40:17

Part 2; Contents

  1. Guerrilla Analytics principles
  2. Files and folders / Project structure
  3. Data-formats & Data shapes / Tidy data
  4. Encoding variables & Exploratory Data Analysis

The slides, data and source code are on Github: https://github.com/uashogeschoolutrecht/work_flows

Do you recognize this!

The Guerrilla Analytics Principles

  1. Space is cheap, confusion is expensive
  2. Simple, visual project structures and conventions
  3. Automate with program code
  4. Link stored data to data in the analytics environment to data in work products
  5. Version control changes to data and analytics code
  6. Consolidate team knowledge
  7. Use code that runs from start to finish

Guerrilla Analytics book by Enda Ridge,

P1: Space is cheap, confusion is expensive

  • Keep your files, you never know when you need them
  • Store data in online-cloud storage (HU Research Drive)
  • Protect youself: do not click on attachments and spiffy emails, cybercriminals are getting smarter everyday
  • Create md5sums for important (source) data-files
  • Agree on a system, share it, use it, stick to it

P2: Simple, visual project structures and conventions

  • Create a seperate folder for each analytics project (in RStudio -> RStudio Project)
  • Do not deeply nest folders (max 2-3 levels)
  • Keep information about the data, close to the data
  • Store each dataset in its own subfolder
  • Do not change file names or move them (in a code project)
  • Do not manually edit data source files
  • In code, use relative paths

Better not!

## D:/r_projects/work_flows/wrong_structure
## +-- Applications
## +-- Data files 001
## |   +-- experiment_1.txt
## |   \-- Final Results
## |       \-- experiment_1_results_final.txt
## +-- Data files 001 (Copy)
## +-- Manuscripts
## |   +-- teunis_et al , 2020_v01 - Copy.docx
## |   +-- teunis_et al , 2020_v02.docx
## |   \-- teunis_et al , 2020_v03_final_final.docx
## +-- Project Documentation
## |   \-- applications
## |       +-- Application final prject x.docx
## |       \-- application_final_project y.docx
## \-- Volunteer responses
##     +-- Patient 2.xlsx
##     +-- patient_1.xlsx
##     \-- Patient_3.xlsx

How to organize data files

## D:/r_projects/work_flows/data-raw
## +-- D010
## |   +-- 2020-06-19_covid_ecdc_cases_geography.csv
## |   +-- 2020-06-19_md5sums_covid_ecdc_cases_geography.md5
## |   +-- README.txt
## |   +-- supporting
## |   |   +-- covid_ecdc_cases_geography.R
## |   |   \-- md5sums.R
## |   \-- v01
## |       +-- 2020-05-31_covid_ecdc_cases_geography.csv
## |       \-- 2020-05-31_md5sums_covid_ecdc_cases_geography.md5
## \-- D020
##     +-- messy_excel.xlsx
##     \-- README.txt

Data integrity

MD5SUMS are

  • A unique code to identify a file (This file -> 10890fd8e80fd72a9140c72d00996948)
  • Can be used to verify the integrity or the version of a file
  • Can be genarated from Windows, MacOS, Linux or from within e.g. R/Python/Bash
  • md5sums are also used for safety: checking an md5sum ensures that the code is valid and has not changed (e.g. Anaconda)
  • There are many different types of hash functions MD5, SHA256 are much used for data and software

In webinar 3, I will show you how to generate these yourself! (from Windows and R). Look at the file “./data-raw/D010/supporting/md5sums.R” if you can’t wait

Sharing data

  • Remove sensitive data from each file by pseudoencoding or anonymizing or removing
  • Removing or encoding sensitive data can be done from within R
  • Agree on a file naming convention within a team, before the work starts
  • Agree on where data is stored and who has access
  • Suppress the impulse to store multiple copies of the data in different locations
  • If you sent data files, sent the md5sums along

Recieving data

  • Never change a filename (as inconvenient it may be)
  • Put a new dataset (one or multiple files) in its own numbered folder
  • Write an README.txt describing the data, store it in the same folder as where the dataset lives
  • A new version of the ‘same’ dataset goes into the orginal folder, the old version moves to a new folder (e.g v01)

P3: Automation

  • Do everything programatically (in code) for reasons of reproducibility
  • Store clean curated datasets in the “data” folder, with md5sums and a README.txt
  • Use literate programming (RMarkdown or Jupyter Notebook) for full analysis
  • Store scripts in a “./code” or “./inst” folder
  • Store functions in R in a “./R” folder

This requires a (COVID-19) example

  • Imagine we want daily reports on the number of COVID-19 cases and caused deaths
  • We want to be able to dynamically report data for different countries and dates to compare situations in the World
  • The data is available (for manual and automated download) from the European Center for Disease Control
  • The analysis can be coded completely from begin to end to result in the information we need

A more extensive reproducible research COVID-19 example for NL

The results of the analysis; deaths

The results of the analysis; cases

Let’s take a look at the source file

The source file is an RMarkdown file that downloads the data and generates an HTML report including two figures.

Parameterization

  • This Rmd is parameterized on date and country
  • The script automatically includes the parameters in the title of the report and the captions of the figures
  • The ‘rendered’ date is automatically set, for versioning
  • Parameterization can used to automate reporting for many values of parameters
  • Further automation is easy now (although the ECDC has current technical problems in making the latest data available for download - and they do not use md5sums!!)

P4: Link stored data, to data in the analytics environment, to data in work products

P5: Version control for data and code - Git/Github

  • When you do data analysis, you should use code (Webinar 1)
  • When you write code, you should use Git, preferably in combination with Github
  • Hence: When you do data analysis, you should use Git & Github
  • Git/Github is ‘track-changes for code’

Imagine working on a script together with a colleague in Groningen. You email her your code and your data. She makes adjustments and sents the code back to you. You test the code and change something, the code breaks…You are lost on what she changed and what you changed…

“Learning Git can be challenging …, but it pays off in the long run. Eventually you will always break working code, multiple times” Jenny Brian

Git/Github.com: Track changes for code

P6: Consolidate team knowledge

P7: Prefer analytics code that runs from start to finish

  • Create work products in RMarkdown or Jupyter notebooks (I will show these in Webinar 3)
  • In R, create an R-package
  • Write functions that isolate code and can be recycled
  • Use iterations to prevent repetition

Some pointers to help you (and others) use code for data analysis

  • If you want to use programming, you need to be consistent
  • A couple of seemingly unimportant things become vital
  • Practice makes perfect

“Ten minutes of R a day, keeps Excel away”

File names and file formats

  • Never use !@#$%^&*()+=:;"'|{}[]\<>?/~ in a file name
  • Use snake_case or CamelCase
  • Apply this also to file headers (column names)
  • Do not use soft spaces (" " = soft space / "_" = hard space)

A how-not-to example

what is wrong with this file name and its headers? can you spot another problem with the data sheet?

Data-formats - Non-Proprietary

File format source code is open and maintained by open source community or core development team.

  • .netCDF (Geo, proteomics, array-oriented scientific data)
  • .xml / mzXML (Markup language, human and machine readable, metadata + data together)
  • .txt / .csv (flat text file, usually tab, comma or semi colon (;) seperated)
  • .json (text format that is completely language independent)

Will remain readable, even if format becomes obsolete

When storing a curated dataset for sharing or archiving it is always better to choose a non-proprietary format

Data shape

Look at these two tables, what do you notice?

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

Both tables are build-in datasets from the {tidyr} package belonging to the {tidyverse} suite of Data Science R packages

Tidy data

A penguin wrap up

palmerpenguins

## # A tibble: 344 x 17
##    studyName `Sample Number` Species Region Island Stage `Individual ID`
##    <chr>               <dbl> <chr>   <chr>  <chr>  <chr> <chr>          
##  1 PAL0708                 1 Adelie~ Anvers Torge~ Adul~ N1A1           
##  2 PAL0708                 2 Adelie~ Anvers Torge~ Adul~ N1A2           
##  3 PAL0708                 3 Adelie~ Anvers Torge~ Adul~ N2A1           
##  4 PAL0708                 4 Adelie~ Anvers Torge~ Adul~ N2A2           
##  5 PAL0708                 5 Adelie~ Anvers Torge~ Adul~ N3A1           
##  6 PAL0708                 6 Adelie~ Anvers Torge~ Adul~ N3A2           
##  7 PAL0708                 7 Adelie~ Anvers Torge~ Adul~ N4A1           
##  8 PAL0708                 8 Adelie~ Anvers Torge~ Adul~ N4A2           
##  9 PAL0708                 9 Adelie~ Anvers Torge~ Adul~ N5A1           
## 10 PAL0708                10 Adelie~ Anvers Torge~ Adul~ N5A2           
## # ... with 334 more rows, and 10 more variables: `Clutch Completion` <chr>,
## #   `Date Egg` <date>, `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## #   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>, `Delta 15 N
## #   (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>

Exploratory data Analysis - missingness

Factor levels

data_penguins %>%
  ggplot(aes(x = Sex, y = `Flipper Length (mm)`)) +
  geom_point(aes(colour = Species), position = "jitter", show.legend = FALSE)

unique(data_penguins$Sex) ## we also call this factor levels
## [1] "MALE"   "FEMALE" NA       "."

Variable encodings

  • Use explicit encoding: male/female instead of 0/1
  • Encodings can always be altered programmatically
  • Be consistent (see next graph)
  • Write a code journal that explains encodings, including units and levels
  • Use factors if a variable has a set of discrete possible outcomes: sex, species, marital_status etc
  • Use an ordered factor if there is a hiarchy in the factor levels: year, month, number_of_legs

Just for kicks, one more graph

Thank you for your attention!