Part 2; Managing your (data) projects with ‘Guerrilla Analytics’

2020-06-23 09:40:17

Part 2; Contents

Guerrilla Analytics principles
Files and folders / Project structure
Data-formats & Data shapes / Tidy data
Encoding variables & Exploratory Data Analysis

The slides, data and source code are on Github: https://github.com/uashogeschoolutrecht/work_flows

Do you recognize this!

from: https://medium.com/@jameshoareid/final-pdf-finalfinal-pdf-actualfinal-pdf-cae61ab1d94c

The Guerrilla Analytics Principles

Space is cheap, confusion is expensive
Simple, visual project structures and conventions
Automate with program code
Link stored data to data in the analytics environment to data in work products
Version control changes to data and analytics code
Consolidate team knowledge
Use code that runs from start to finish

Guerrilla Analytics book by Enda Ridge,

P1: Space is cheap, confusion is expensive

Keep your files, you never know when you need them
Store data in online-cloud storage (HU Research Drive)
Protect youself: do not click on attachments and spiffy emails, cybercriminals are getting smarter everyday
Create md5sums for important (source) data-files
Agree on a system, share it, use it, stick to it

P2: Simple, visual project structures and conventions

Create a seperate folder for each analytics project (in RStudio -> RStudio Project)
Do not deeply nest folders (max 2-3 levels)
Keep information about the data, close to the data
Store each dataset in its own subfolder
Do not change file names or move them (in a code project)
Do not manually edit data source files
In code, use relative paths

Better not!

## D:/r_projects/work_flows/wrong_structure
## +-- Applications
## +-- Data files 001
## |   +-- experiment_1.txt
## |   \-- Final Results
## |       \-- experiment_1_results_final.txt
## +-- Data files 001 (Copy)
## +-- Manuscripts
## |   +-- teunis_et al , 2020_v01 - Copy.docx
## |   +-- teunis_et al , 2020_v02.docx
## |   \-- teunis_et al , 2020_v03_final_final.docx
## +-- Project Documentation
## |   \-- applications
## |       +-- Application final prject x.docx
## |       \-- application_final_project y.docx
## \-- Volunteer responses
##     +-- Patient 2.xlsx
##     +-- patient_1.xlsx
##     \-- Patient_3.xlsx

How to organize data files

## D:/r_projects/work_flows/data-raw
## +-- D010
## |   +-- 2020-06-19_covid_ecdc_cases_geography.csv
## |   +-- 2020-06-19_md5sums_covid_ecdc_cases_geography.md5
## |   +-- README.txt
## |   +-- supporting
## |   |   +-- covid_ecdc_cases_geography.R
## |   |   \-- md5sums.R
## |   \-- v01
## |       +-- 2020-05-31_covid_ecdc_cases_geography.csv
## |       \-- 2020-05-31_md5sums_covid_ecdc_cases_geography.md5
## \-- D020
##     +-- messy_excel.xlsx
##     \-- README.txt

Data integrity

MD5SUMS are

A unique code to identify a file (This file -> 10890fd8e80fd72a9140c72d00996948)
Can be used to verify the integrity or the version of a file
Can be genarated from Windows, MacOS, Linux or from within e.g. R/Python/Bash
md5sums are also used for safety: checking an md5sum ensures that the code is valid and has not changed (e.g. Anaconda)
There are many different types of hash functions MD5, SHA256 are much used for data and software

In webinar 3, I will show you how to generate these yourself! (from Windows and R). Look at the file “./data-raw/D010/supporting/md5sums.R” if you can’t wait

Sharing data

Recieving data

Never change a filename (as inconvenient it may be)
Put a new dataset (one or multiple files) in its own numbered folder
Write an README.txt describing the data, store it in the same folder as where the dataset lives
A new version of the ‘same’ dataset goes into the orginal folder, the old version moves to a new folder (e.g v01)

P3: Automation

Do everything programatically (in code) for reasons of reproducibility
Store clean curated datasets in the “data” folder, with md5sums and a README.txt
Use literate programming (RMarkdown or Jupyter Notebook) for full analysis
Store scripts in a “./code” or “./inst” folder
Store functions in R in a “./R” folder

This requires a (COVID-19) example

Imagine we want daily reports on the number of COVID-19 cases and caused deaths
We want to be able to dynamically report data for different countries and dates to compare situations in the World
The data is available (for manual and automated download) from the European Center for Disease Control
The analysis can be coded completely from begin to end to result in the information we need

A more extensive reproducible research COVID-19 example for NL

The results of the analysis; deaths

From: Data Source - ECDC

The results of the analysis; cases

From: Data Source - ECDC

Let’s take a look at the source file

The source file is an RMarkdown file that downloads the data and generates an HTML report including two figures.

Parameterization

This Rmd is parameterized on date and country
The script automatically includes the parameters in the title of the report and the captions of the figures
The ‘rendered’ date is automatically set, for versioning
Parameterization can used to automate reporting for many values of parameters
Further automation is easy now (although the ECDC has current technical problems in making the latest data available for download - and they do not use md5sums!!)

P4: Link stored data, to data in the analytics environment, to data in work products

Basically this can be done with literate programming with R or Python in RStudio or Jupyter:

The data is stored on disk or in the Cloud
The analytics environment is the Global Environment (where variables and R-objects live)
Data is pulled from the storage in the Analytics Environment by a script
The work products (Rmd / Notebooks) bring it together

P5: Version control for data and code - Git/Github

When you do data analysis, you should use code (Webinar 1)
When you write code, you should use Git, preferably in combination with Github
Hence: When you do data analysis, you should use Git & Github
Git/Github is ‘track-changes for code’

Imagine working on a script together with a colleague in Groningen. You email her your code and your data. She makes adjustments and sents the code back to you. You test the code and change something, the code breaks…You are lost on what she changed and what you changed…

“Learning Git can be challenging …, but it pays off in the long run. Eventually you will always break working code, multiple times” Jenny Brian

Git/Github.com: Track changes for code

Tutorial Git/Github and RStudio

Github HU repo

P6: Consolidate team knowledge

Make guidelines on datamanagement, storage places and workflows
Agree within the team on them
Stick to them!
Work together in a virtual collaboration envrionment (VRE)
Work together on code using Github
Provide for education and share best practices within the organization, the department and/or the team

Peer Support Group Data Science

Support for research

Github HU docs

Exploratory Data Analysis Masterclasses

P7: Prefer analytics code that runs from start to finish

Create work products in RMarkdown or Jupyter notebooks (I will show these in Webinar 3)
In R, create an R-package
Write functions that isolate code and can be recycled
Use iterations to prevent repetition

Some pointers to help you (and others) use code for data analysis

If you want to use programming, you need to be consistent
A couple of seemingly unimportant things become vital
Practice makes perfect

“Ten minutes of R a day, keeps Excel away”

File names and file formats

Never use !@#$%^&*()+=:;"'|{}[]\<>?/~ in a file name
Use snake_case or CamelCase
Apply this also to file headers (column names)
Do not use soft spaces (" " = soft space / "_" = hard space)

A how-not-to example

what is wrong with this file name and its headers? can you spot another problem with the data sheet?

Data-formats - Non-Proprietary

File format source code is open and maintained by open source community or core development team.

.netCDF (Geo, proteomics, array-oriented scientific data)
.xml / mzXML (Markup language, human and machine readable, metadata + data together)
.txt / .csv (flat text file, usually tab, comma or semi colon (;) seperated)
.json (text format that is completely language independent)

Will remain readable, even if format becomes obsolete

When storing a curated dataset for sharing or archiving it is always better to choose a non-proprietary format

Data shape

Look at these two tables, what do you notice?

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

Both tables are build-in datasets from the {tidyr} package belonging to the {tidyverse} suite of Data Science R packages

Tidy data

Each variable goes in its own column
Each observation goes in its own row
Each cell contains only one value

From: “R for Data Science”, Grolemund and Wickham

A penguin wrap up

palmerpenguins

## # A tibble: 344 x 17
##    studyName `Sample Number` Species Region Island Stage `Individual ID`
##    <chr>               <dbl> <chr>   <chr>  <chr>  <chr> <chr>          
##  1 PAL0708                 1 Adelie~ Anvers Torge~ Adul~ N1A1           
##  2 PAL0708                 2 Adelie~ Anvers Torge~ Adul~ N1A2           
##  3 PAL0708                 3 Adelie~ Anvers Torge~ Adul~ N2A1           
##  4 PAL0708                 4 Adelie~ Anvers Torge~ Adul~ N2A2           
##  5 PAL0708                 5 Adelie~ Anvers Torge~ Adul~ N3A1           
##  6 PAL0708                 6 Adelie~ Anvers Torge~ Adul~ N3A2           
##  7 PAL0708                 7 Adelie~ Anvers Torge~ Adul~ N4A1           
##  8 PAL0708                 8 Adelie~ Anvers Torge~ Adul~ N4A2           
##  9 PAL0708                 9 Adelie~ Anvers Torge~ Adul~ N5A1           
## 10 PAL0708                10 Adelie~ Anvers Torge~ Adul~ N5A2           
## # ... with 334 more rows, and 10 more variables: `Clutch Completion` <chr>,
## #   `Date Egg` <date>, `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## #   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>, `Delta 15 N
## #   (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>

Exploratory data Analysis - missingness

Factor levels

data_penguins %>%
  ggplot(aes(x = Sex, y = `Flipper Length (mm)`)) +
  geom_point(aes(colour = Species), position = "jitter", show.legend = FALSE)

unique(data_penguins$Sex) ## we also call this factor levels

## [1] "MALE"   "FEMALE" NA       "."

Variable encodings

Use explicit encoding: male/female instead of 0/1
Encodings can always be altered programmatically
Be consistent (see next graph)
Write a code journal that explains encodings, including units and levels
Use factors if a variable has a set of discrete possible outcomes: sex, species, marital_status etc
Use an ordered factor if there is a hiarchy in the factor levels: year, month, number_of_legs

Just for kicks, one more graph

Thank you for your attention!

UPCOMING WEBINARS:

Part 3; Reproducible (Open) Science @HU - Tools (July 6th, 2020)

Peer Support Group Data Science - not live yet

support voor onderzoek