Jan (JP) de Ruiter, Bielefeld University (email@example.com)
Klaus Reinhold, Bielefeld University (firstname.lastname@example.org)
Daniël Lakens, Eindhoven University of Technology, NL (D.Lakens@tue.nl)
Marco Mertens, Bielefeld University (email@example.com)
Presentations and Supplementary Material from the Workshop on Bayesian and Frequentist Statistics at the ZiF, Bielefeld, January 28th and 29th, 2016
Inference using orthodox hypothesis testing and Bayes factors is compared and contrasted in five case studies based on real research. The first study illustrates that the methods will often agree, both in in motivating researchers to conclude that H1 is supported better than H0 and the other way round, that H0 is better supported than H1. The next four however, show that the methods will also often disagree. Specifically, it is shown that a high-powered non-significant result is consistent with a no evidence for H0 over H1 worth mentioning, which a Bayes factor can show; and conversely that a low-powered non-significant result is consistent with substantial evidence for H0 over H1, again indicated by Bayesian analyses. The fourth study illustrates that a high-powered significant result may not amount to any evidence for H1 over H0, matching the Bayesian conclusion. Finally the fifth study illustrates that different theories can be evidentially supported to different degrees by the same data, a fact that p-values cannot reflect but Bayes factors can. It is argued that appropriate conclusions match the Bayesian inferences, but not the orthodox ones where they disagree.
"Information theory based inference - if we don't do it right we make it even worse."
Statistical inference based on Null-Hypothesis Significance Testing (NHST) has dominated research about animal behaviour and ecology for several decades. Recently, however, drawing inference based on an information criterion has gained some level of popularity in this research field. However, researchers using the method frequently mix it with NHST, namely by selecting the 'best' model and then testing its significance. Despite clear theoretical reasons for why doing so is grossly inappropriate, it can also be shown by simulations that doing so potentially drastically inflates type I error rates. Hence, mixing the two approaches has the potential to amplify the current replicability crisis in the life sciences.
The practical advantages of Bayesian inference are demonstrated through two concrete examples. In the first example, we wish to learn whether or not a criminal is intellectually disabled --- this is a problem of parameter estimation. In the second example, we wish to quantify support in favor of a null hypothesis, and track this support as the data accumulate --- this is a problem of hypothesis testing. The Bayesian framework unifies both problems within a coherent predictive framework, where parameters and models that predicted the data successfully will receive a boost in plausibility, whereas parameters and models that predicted poorly suffer a decline. Our examples demonstrate how Bayesian analyses can be more informative, more elegant, and more flexible than the orthodox methodology that remains dominant within the field of psychology.
For a long time there have been arguments about the best way to do statistics. Two arguments that have been conflated have been the Bayesian-Frequentist arguments, and criticisms of hypothesis tests. I will try to disentangle them, and show (for example) how Bayesians can be as bad as frequentists at mis-using p-values. I will argue that we have different statistical tools that can be used to ask different questions, and that the Bayesian-Frequentist arguments are less important in practice.
Should one, then, be Bayesian or a Frequentist? I will take a pragmatic stance, and suggest that ultimately it is a subjective choice. Which probably indicates where my sympathies can be found.
Every researcher knows the "p=.08" dilemma. One of the first intuitions in such a situation is to increase the sample size in order to push the p-value into a decisive region. In classical NHST, however, this is not allowed, as it increases Type-I error rates (unless proper sequential designs are used; Lakens, 2014). Notwithstanding, optionally increasing the sample size is one of the most common "questionable research practices" (John, Loewenstein, & Prelec, 2012). In my talk, I will present the "Sequential Bayes Factor" (SBF) design, which allows unlimited multiple testing for the presence or absence of an effect, even after each participant. Sampling is stopped as soon a pre-defined evidential threshold for H1 support or for H0 support is exceeded. Compared to an optimal NHST design, this leads on average to 50-70% smaller sample sizes, while having the same error rates. Furthermore, in contrast to NHST, its success is not dependent on a priori guesses of the true effect size. Finally, I give a quick overview over a priori Bayes factor power analyses (BFPA) which allow to envisage the expected sample size (given an assumed true effect size), and also allow to set a reasonable maximal sample size for the sequential procedure.
As scientist we are interested in the relation of theory to data in order to test theories. Scientists per se are irrelevant. Bayes can deal with this, because the personal probabilities regarding theories can be cleanly put in their place. And we are left with the evidence that data provide for different theories.
In orthodoxy, people are encouraged to say "*I* predicted this in advance...", "*I* planned to stop running subjects when ...", "*I* would never have drawn a conclusion about an opposite relation no matter how far the data would have gone in the opposite direction..."
But there is no scientific reason to be interested in saying in a paper "*I* predicted this..." People are just black boxes. What we are interested in the predictions of theories and the rational basis of those predictions. Unknown black boxes don't count as a rational basis. Arbitrary personal decisions don't count as reasons.
Could frequentist statistics be divorced from particular scientists, and conclusions be made conditional not what one particular scientist happened to do or believe, but on the range of possibilities - in order to try to make itself more objective? (Mayo tried and did not succeed, because her severity is monotonically related to the p-value.) But in doing so, if they did it properly, would they just become Bayesian?
Does it really matter, as long as we are doing our analyses properly?
JP de Ruiter
Seriously, what do we have to do if we want to collect evidence for two variables being equal (the null hypothesis). Suppose we do a study and want to argue that men and women are equally competent at some task, how would we do that within the classical paradigm? Obviously, we can't say "p > 0.05, hence there is no difference". A power analysis just tells us about the long term frequency of making a type II error using my design, but nothing about the actual data at hand.
What do we do in the situation that we have a very large N, and a p-value between 0.01 and 0.05? We can say it's "significant" but at the same time, we know that under these circumstances, our data actually provide evidence for the null. Which of these two facts do we report, and how?
It is well known that p-values and CIs are grossly misinterpreted by most researchers (e.g., Hoekstra, Morey, Rouder, & Wagenmakers, 2014; Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2015; Oakes, 1986), and decades of education in frequentist statistics could not change this. How could you ensure that future scientist have a correct understanding of these central frequentist concepts (or wouldn't it be better to switch to Bayesian stats from the start, which provides the answers that researchers usually really want to know?)
Many argue that one cause of the current replication crisis is the focus on p-values. What would be a frequentist counter-measure to increase replicability in the future? (Of course there are many new developments such as pre-registrations, open data, etc., which work towards more replicability. I am in particular interested in changes in the way frequentist statistics are done.)
What do you perceive to be the most prominent advantages of a Bayesian analysis?