[ Note: this is a first draft, a preprint of a blog post so to speak :-) ]
A recent 72-author preprint proposed to recalibrate when we award the qualitative label ‘significant’ in research in psychology (and other fields) such that more evidence is required before that label is used. In other words, the paper proposes that researchers have to be a bit more certain of their case before proclaiming that they have found a new effect.
The paper met with resistance, and although any proposal for change usually is, what’s interesting is that in this case, the resistance came in part from researchers involved in Open Science (the umbrella term for the movement to mature science through openness, collaboration and accountability). Since these researchers often fight for improved research practices ‘at all costs’ this resistance seems odd.
Thus ensued the Alpha Wars.
The Alpha Wars
The term Alpha Wars was coined by Simine Vazire, at least as far as I know, and the term seems appropriate. For some reason, people tend to get quite riled up about methodology and statistics. Daniel Lakens is coordinating a reply:
After 36 hours we are at 130+ people chiming in, 31 pages, some structure emerging, and I’m learning a lot and having fun! https://t.co/uCao0LPaOX
— Daniël Lakens (@lakens) July 27, 2017
And like Simine, many authors (e.g. Eric-Jan Wagenmakers, involving an excellent example with two trolls) and others (e.g. Ulrich Schimmack) blog happily about it.
All these blog posts share a focus on the statistics and methodology of the matter. I expect that regarding the statistics and methodology, most people involved in the Alpha Wars, regardless of their side, agree, or at least, would agree if they hadn’t picked sides already. These are all smart people, and with statistics and methodology, there exist few absolutes. There’s often no right or wrong, just (often flawed) tools that we try to use as well as possible to figure out how reality looks & works.
Interestingly, the discussion barely touches on the implications of the paper outside of the methodological and statistical implications. This is the topic of this blog post.
For readers who are unfamiliar with statistics, I’ll try to briefly summarize what exactly the paper proposes. I’m reasoning from psychology here, so your mileage may vary if you read this from another field.
The goal of research is more or less figuring out how reality looks & works, and to that end, the vast majority of researchers applies something called Null Hypothesis Significance Testing (NHST). This boils down to the following.
Basics of NHST, p values, and significance
You start by collecting data: you take a sample of whatever you wish to study (e.g. humans), and then you measure and sometimes manipulate one or more variables. This yields a dataset, usually lots and lots of numbers. You can aggregate these numbers to yield averages (means) and statistics representing, for example, associations. However, every value you compute from this sample is influenced in part by coincidence and error, so if you find that two means differ, or that two variables seem associated, this may be due completely to chance, and in that case, concluding that that pattern also exists in the population that you’re actually interested in would be wrong.
NHST provides a framework that enables statements about the population anyway. When applying NHST, a researcher first makes an assumption about the population value (for example 0): the null hypothesis. The researcher then uses software (nowadays) to construct the sampling distribution. This is the set of all potential values that could have been obtained in a sample of a given size. This sampling distribution makes it possible to compute the likelihood of obtaining any given value in one’s sample, under the assumption that the null hypothesis accurately describes the population in reality. The researcher then uses this sampling distribution to compute the likelihood of ending up with the value obtained in the researcher’s sample. This likelihood is called the p value.
The lower the p value, the lower the likelihood that the null hypothesis is correct. For example, if you obtain a p value of .0001, it is unlikely (but possible) that the sample value with which you computed that p value is drawn from the sampling distribution constructed using the null hypothesis. In that case it would be an extreme coincidence that you would have obtained such a rare sample value (one that occurs only in one in every 10.000 samples). In NHST, a researcher sets a threshold value, rejecting the null hypothesis (as reasonable description of the population, i.e. of reality) when obtaining a p value lower than that threshold value. This threshold value is called alpha, and p values lower than alpha are called significant.
The current dogma holds that the default value of alpha = .05, and therefore, p values under .05 are normally called significant, not only by researchers, but also by many media outlets, who have by now learned that ‘significance’ is what one should look for if one wants to know that an effect ‘really exists’.
The “Redefine Statistical Significance” paper
The disputed paper does two things. First, it proposes to no longer call p values significant when they are lower than .05, instead reserving that label for p value under .005. Second, it proposes to relabel p values under .05 but over .005 ‘suggestive’. This would mean that to reject a null hypothesis, researchers would require larger samples (this means more data points). This has two implications. First, researchers who conduct ‘underpowered’ research would not be able to claim to have found out something new as often. Second, sample sizes would start increasing.
As I said, the statistics and methodology of this paper has already been widely discussed, and the response Daniel Lakens is coordinating will probably compile a lot of that discussion. However, most of what I’ve seen in these discussions miss one perspective that I think is crucial: science is done by humans, in a world inhabited by other humans.
Humans as a factor in science
Most researchers in psychology were once students of psychology. One of the least popular topics in psychology is statistics and methodology. Although the requirement to do a PhD. somewhat functions as a selection mechanism, most researchers in psychology still don’t like methodology and statistics. Many consider it a necessary evil. Unlike many Open Science proponents, most researchers are not constantly looking for ways to change the way to work (more accurately, improve the way they work). For example, I switched to R a while ago - and even when I persuade colleagues that R is basically better than SPSS in almost every respect, most people are reluctant to start playing around. I used the words playing around deliberately: most researchers in psychology don’t like statistics, so why would they consider it fun to learn new powerful ways to do statistics and data visualization?
An example. I wrote this paper about why everybody should stop using Coefficient (“Cronbach’s”) Alpha (another Alpha, don’t worry about it). A colleague of mine once called me and said “Hey, you wrote that Alpha paper some time back, right? Could you send it to me? I’m doing a study, but my Coefficient Alpha is very low, so I’d like to cite your paper.”
This is how many researchers use statistics and methodology: as tools, not necessarily to figure out how reality looks & works, but to publish papers, ideally ‘make a splash’. They can only partly be blamed for this: most researchers, especially in social psychology, are raised on a diet of underpowered studies, learning that the objective is to design elegant studies and obtain sensational results. Couple this with the publish or perish culture where incentive structures remain highly dysfunctional (e.g. publications in “high impact factor journals” are rewarded with grant and career opportunities), and suddenly figuring out reality takes second place.
Another way in which humans play a role in the scientific endeavor is through media exposure. Journalists love juice psychological science stories, and universities love to be in the news. In fact, so do many researchers. However, the news is no place for nuance, and therefore, researchers are pushed by their ego, journalists, funders and universities (i.e. their employers) to try and sell their research. However, as most Alpha War combatants will agree, single studies rarely provide strong evidence for anything, and therefore, discussing such studies in the media is often best avoided.
Such a reluctant attitude is incompatible with the interests of media, universities, funders and often researchers. And the fact that the media have by now learned that ‘significance’ is some quality label doesn’t help. This concept of significance is complicated enough to signal to the audience that clearly, the journalist has applied some filter to the content they’re presenting. They clearly know their business: if it’s significant, that surely means something!
One related problem is that while from a methodological and statistical point of view, science is cumulative (and any single study has very, very limited value), this is not how single studies are treated in the rest of the world. Ever. Press releases aren’t limited to meta-analyses and other data syntheses. Many researchers happily speak to a newspaper when they did a study that yielded interesting results, and media happily cover the results of that study as if they have much more meaning than can be ascribed from a statistical point of view.
So, science, at least in psychology, is done by people who conduct single studies, mostly have very little affinity with methodology and statistics, and operate in an environment where selling sensationalist results from single studies is rewarded.
The merits of shifting Significance
I think most critiques of the “Redefine Statistical Significance” paper miss these points. Many arguments in the Alpha Wars revolve around whether lowering Alpha represents an improved way of doing science. The way such a redefinition would shake up the entire landscape, basically provide a ‘soft reset’ to the way much of science is done and communicated, seems to be ignored.
Journals, editors, reviews, and researcher will now have to choose how p values should be labelled. This shifting of alpha from the arbitrary value of .05 emphasizes that arbitrariness. Given the Open Science movement, it is unlikely that .005 will just become a new threshold that is clutched onto as stubbornly as the .05 threshold is currently. The introduction of the label of .05 as ‘suggestive’ is, in this respect, a brilliant move, especially coupled with the explicit recommendation that suggestive results should not suddenly become less publishable.
If we embrace .005 as a threshold for significance, journals will immediately have to reconsider their publication practices. This may finally help journals to take the step towards publishing all research, regardless of outcomes (something we already do at Health Psychology Bulletin, by the way). After all, they have to make some choice anyway: this re-calibration can catalyze this change.
Researchers who have spent most of their career ‘striving for significance’ are more likely to start running larger sample sizes, which would be a great development. Also, study results that make it to the media will be much less likely to be wrong.
The brilliance of this proposal is that it, so to speak, subverts the system from within. Even though the paper’s authors probably all agree that in many situations, NHST is misapplied, and that it often shouldn’t be used in the first place, they managed to formulate a solution that works from within the present dysfunctional context. As they say, this is easy to implement, and doesn’t require new skills of the researchers who will have to use it.
Many responses to this proposal have argued that the solution is not simply shifting alpha, but instead statistics and methodology have to be applied properly. Researchers should use Bayesian approaches more, or set a custom alpha value for each study depending on what exactly they’re studying, or simply power their studies properly.
While all this is true, all these suggestions ignore the wider context and the inconvenient truth that many researchers don’t want to learn more statistics, and just want to do research and yield sensational results.
Appropriate humility
The New Alpha implores a much more humble attitude from the researcher. Researchers can keep on doing the same studies, and can even keep powering their studies for alpha = .05 - but their descriptions of their results would be much more accurate. This would prevent horrible articles like this one (based on this Plos One paper with 20 participants). The researchers would have indicated that their results are suggestive and deserving of more researcher; they would then have done the additional research, and in a follow-up paper based on a more powerful study, either refuted or confirmed the initial results. If this second paper would be covered in the media, the likelihood of spreading misinformation, eventually eroding trust in psychological science, would be much lower.
So, if nothing else, the New Alpha would foster appropriate humility. It would, hopefully, prevent reading too much into results from single studies. In that sense, it can be seen as a heuristic instrument to help ameliorate the biases scientists suffer, being human.