Reproducible (Open) Science

2020-06-15 10:05:36

This is part 1 of a series of three webinars

Part 1; Introducing Reproducible (Open) Science (June 11th, 2020)
Part 2; Managing your project files and data with ‘Guerilla Analytics’ (June 23rd, 2020)
Part 3; Reproducible (Open) Science - Tools (July 6th, 2020)

The complete source code for the webinars and all dependent data, and files can be found on Github.com/uashogeschoolutrecht.

In part 3, I will show you how to use this Github resource for your own work.

Part 1; Introducing Reproducible (Open) Science

When things go wrong
Why Reproducible (Open) Science?
The need for learning programming
An example of Reproducible (Open) Science

\(Reproducible\ (Open)\ Science =\) \(Reproducible\ Research + Open\ Science\)

Is (hydroxy)chloroquine really an option for treating COVID-19?

As you probably know, hydroxychloroquine was repeatedly touted as a promising cure for COVID-19 by US President Donald Trump

https://www.washingtonpost.com/politics/2020/04/07/trumps-promotion-hydroxychloroquine-is-almost-certainly-about-politics-not-profits/

But how are we really doing with (hydroxy)chloroquine as a treatment for COVID-19?

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31180-6/fulltext

What was the reason for retracting this paper?

“Our independent peer reviewers informed us that Surgisphere would not transfer the full dataset, client contracts, and the full ISO audit report to their servers for analysis as such transfer would violate client agreements and confidentiality requirements”

Company Surgisphere (‘data owner’) did not share raw data
At time of publication (raw) data and analysis (code) was not included in the manuscript
The authors initiated the retract

https://www.sciencemag.org/news/2020/06/two-elite-medical-journals-retract-coronavirus-papers-over-data-integrity-questions

Why is this a problem?

Scientific conclusions get picked up by the media, retracting statements is difficult
The credibility of the Journal, the researchers and the affiliated institutions are at stake (people got sacked over this!)
Clinical studies to hydroxy(choloroquine) were halted because of this paper
The credibility of the company Surgisphere is at stake (they should have prevented this…)
The credibility of Science as a whole is at stake (‘in the eye of the beholder’)

The Lancet does not adhere to Reproducible (Open) Science

Would the Lancet have adopted the Reproducible (Open) Science framework:

There would have been no publication, so no retraction necessary
The manuscript of this paper would not even have made it through the first check round
All data, code, methods and conclusions would have been submitted
This would have enabled a complete and thorough peer-review process that includes replication of (part of) the data analysis of the study
Focus should be on the data and methods, not on the academic narratives and results …
In physics and bioinformatics this is already common practice

Data, methods and logic

Brown, Kaiser & Allison, PNAS, 2018

"…in science, three things matter:

the data,
the methods used to collect the data […], and
the logic connecting the data and methods to conclusions,

everything else is a distraction."

`Gollums` lurking about

“In one case, a group accidentally used reverse-coded variables, making their conclusions the opposite of what the data supported.”

“In another case, authors received an incomplete dataset because entire categories of data were missed; when corrected, the qualitative conclusions did not change, but the quantitative conclusions changed by a factor of >7”

Brown, Kaiser & Allison, 2018; PNAS

Why we need Reproducible (Open) Science?

To assess validity of science and methods we need access to data, methods and conclusions
To learn from choices other researchers made
To learn from omissions, mistakes or errors
To prevent publication bias (also negative results will be available in reproducible research)
To be able to re-use and/or synthesize data (from many and diverse sources)
To have access to it all!

Nature Collection on this topic

The GUI problem

How would you ‘describe’ the steps of an analysis or creation of a graph when you use GUI* based software?

“You can only do this using code, so it is (basically) impossible in a GUI”**

*Graphical User Interface (GUI)…is a form of user interface that allows users to interact with electronic devices through graphical icons and audio indicator such as primary notation, instead of text-based user interfaces, typed command labels or text navigation…

**The file “./Rmd/steps_to_graph_from_excel_file.html” shows you how to do this using the programming language R. In webinar part 3, we will revisit this example.

Programming is essential for Reproducible (Open) Science

Only programming an analysis (or creation of a graph) records every step
The script(s) function as a (data) analysis journal
Code is the logic that connects the data and methods to conclusions
Learning to use a programming language takes time but pays of at the long run (for all of science)

(Literate) programming is a way to connect narratives to data, methods and results

To replicate a scientific study we need at least:

Scientific context, research questions and state of the art [P]
(Experimental) model or characteristics of population or matter studied [P]
Data that was generated and corresponding meta data [D, C]
Exact (experimental) design of the study [P, D, C]
Exploratory data analysis of the data [P, C]
Exact methods that were used to conduct any formal inference [P, C]
Model diagnostics [C]
Interpretations of the (statistical) model results/model fitting process [P, C]
Conclusions and academic scoping of the results [P, C]
Access to all of the above [OAcc, OSrc]

\(P = Publication\), \(D = Data\), \(C = Code\), \(OAcc = Open\ Access\), \(OSrc = Open\ Source\)

A short example of Reproducible (Open) Science

Assume we have the following question: “Which of 4 types of chairs takes the least effort to arise from when seated in?” We have the following setup:

4 different types of chairs
9 different subjects (probably somewhat aged)
Each subject is required to provide a score (from 6 to 20, 6 being very lightly strenuous, 20 being extremely strenuous) when arising from each of the 4 chairs. There is some ‘wash-out’ time in between the trials. The chair order is randomised.

To analyze this experiment statistically, the model would need to include: the rating score as the measured (or dependent) variable, the type of chair as the experimental factor and the subject as the blocking factor

Mixed effects models

A typical analysis method for this type of randomized block design is a so-called ‘multi-level’ or also called ‘mixed-effects’ or ‘hierarchical’ models. An analysis method much used in clinical or biological scientific practice.

You could also use one-way ANOVA but I will illustrate why this is not a good idea

What do we minimally need, to replicate the science of this experiment?

I will show:

the data
an exploratory graph
a statistical model
the statistical model results
a model diagnostic
some conclusions

In the next few slides, I will hopefully convince you of the power of (literate) programming to communicate such an analysis.

Example reproduced from: Pinheiro and Bates, 2000, Mixed-Effects Models in S and S-PLUS, Springer, New York.

The data of the experiment

Wretenberg, Arborelius & Lindberg, 1993

library(nlme)
ergoStool %>% as_tibble()

## # A tibble: 36 x 3
##    effort Type  Subject
##     <dbl> <fct> <ord>  
##  1     12 T1    1      
##  2     15 T2    1      
##  3     12 T3    1      
##  4     10 T4    1      
##  5     10 T1    2      
##  6     14 T2    2      
##  7     13 T3    2      
##  8     12 T4    2      
##  9      7 T1    3      
## 10     14 T2    3      
## # ... with 26 more rows

An exploratory graph

Mind the variability per subject, what do you see?

Can you say something about within-subject variability (note ‘Minster Blue’)?
Can you say something about between-subject variability (note ‘Mister Green’, vs ‘Mister Black’)?
Which chair type takes, on average the biggest effort to arise from?

The statistical questions

Which chair type takes, on average the biggest effort to arise from? (ANOVA / MEM, fixed effects)

Do individual (within subject) differences play a role in appointing a average score to a chair type? (MEM, random effects)
Does variability between subjects play a role in determining the ‘best’ chair type (ANOVA / MEM, confidence intervals)

The statistical model

Statistical models (in R) can be specified by a model formula. The left side of the formula is the dependent variable, the right side are the ‘predictors’. Here we include a fixed and a random term to the model (as is common for mixed-effects models)

library(nlme)

ergo_model <- lme(
  data = ergoStool, # the data to be used for the model
  fixed = effort ~ Type, # the dependent and fixed effects variables
  random = ~1 | Subject # random intercepts for Subject variable
)

The lme() function is part of the {nlme} package for mixed effects modelling in R

Example reproduced from: Pinheiro and Bates, 2000, Mixed-Effects Models in S and S-PLUS, Springer, New York.

The statistical results

	Value	Std.Error	DF	t-value	p-value
(Intercept)	8.5555556	0.5760123	24	14.853079	0.0000000
TypeT2	3.8888889	0.5186838	24	7.497610	0.0000001
TypeT3	2.2222222	0.5186838	24	4.284348	0.0002563
TypeT4	0.6666667	0.5186838	24	1.285305	0.2109512

Model diagnostics

Diagnostics of a fitted model is the most important step in a statistical analysis
In most scientific papers the details are lacking
Did the authors omit to perform this step? Or did they not report it?
If you do not want to include it in your paper, put it in an appendix!

A residual plot shows the ‘residual’ error (‘unexplained variance’) after fitting the model. Under the Normality assumption standardized residuals should:

Be normally distributed around 0
Display no obvious ‘patters’
Should display overall equal ‘spread’ above and below 0 (‘assumption of equal variance’)

Residual plot

plot(ergo_model) ## type = 'pearson' (standardized residuals)

The conclusions in a plot

And the most important part…

odz: Practice what you preach

If you want to reproduce, add-on, falsify or apply your own ideas to this example, you can find the code (and data) in Github.com

In webinar 3, I will show you how to actually run, use and organize code like this!

Thank you for your attention!

UPCOMING WEBINARS:

Part 2; Managing your project files and data with ‘Guerilla Analytics’ (~June 23rd, 2020)
Part 3; Reproducible (Open) Science @HU - Tools (~July 6th, 2020)

Peer Support Group Data Science support voor onderzoek

Contents