Why one-sided tests in psychology are practically indefensible

Foundation of Statistics & Methodology

March 25, 2016

This post is a response to a post by Daniel Lakens, “One-sided tests: Efficient and Underused”, whom I greatly respect and, apparently up until now, always vehemently agreed with. So this post is partly an opportunity for him and others to explain where I’m wrong, so dear reader, if you would take this time to point that out, I would be most grateful. Alternatively, telling me I’m right is also very much appreciated of course :-) In any case, if you haven’t done so yet, please read Daniel’s post first (also, see below this post for an update with more links and the origin of this discussion).

Daniel discussed three arguments against one-sided testing, which I’ll try to summarise by quoting him:

  1. “First of all, a researcher should have a hypothesis where the expected effect lies in a specific direction. Importantly, the question is not whether a result in the opposite direction is possible, but whether it _supports _your hypothesis.”;

  2. “A second concern raised against one-sided tests is that surprising findings in the opposite direction might be meaningful, and should not be ignored. I agree, but this is not an argument against one-sided testing. The goal in null-hypothesis significance testing is, not surprisingly, to test a hypothesis. But we are not in the business of testing a hypothesis we fabricated after looking at the data.”;

  3. “A third concern is that a one-sided test leads to weaker evidence (e.g., Schulz & Grimes, 2005). This is trivially true: Any change to the design of a study that requires a smaller sample size reduces the strength of the evidence you collect, since the evidence is inherently tied to the total number of observations. Other techniques to design more efficient studies (e.g., sequential analyses, Lakens, 2014) also lead to lower samples sizes, and thus less evidence. The response to this concern is straightforward: If you desire a specific level of evidence, design a study that provides this desired level of evidence.”.

I’ll start at the bottom and work my way up (it’s just like real life). This third point I partly agree with: if you want some evidence, just design your study for it. However (and of course there always is one): given the current state of Psychology research, I adhere to the saying “een studie is geen studie”, or “one study is no study”. Until we start designing and conducting studies with massively larger sample sizes, and/or stop computing and reporting many, many p-values without correcting for the type-1 error, conclusions on the basis of any single study will remain (very) tentative. Of course, ideally, everybody does heavily powered research. In fact, I’d say that in an ideal world people power their studies at all. In an ideal world, our funders, in the Netherlands NWO and ZonMW would be willing to fund properly powered research - but nobody’s really happy to hear that instead of running 3x2x20 participants, they will need 3x2x140 participants if they really want to draw strong conclusions. So weak conclusions are drawn. So we need meta-analysis to get to the strong conclusions. Combined with the fact that still, most researchers don’t publish their data and metadata along with their papers, and often do not report what you need to compute the effect sizes you need, I think arguing against p-values makes sense. Therefore, I think we should be reluctant to give any advice that may be used as munition to support one-sided testing.

Daniel’s second and first point, I’ll deal with in one go, and this is the bit that I feel least sure about. My idea about one-sided testing was always that it was not possible for the following reasons:

  1. Hypothesis testing is a Null Hypothesis Significance Testing (NHST) endeavour.

  2. NHST lets you compute a p-value on the basis of a number of assumptions. Those assumptions are:

1. We know the distribution of the parameter of interest, for example Student's t;

2. We know how 'wide' that distribution is because we know the standard error of that parameter;

3. We have no idea where the distribution is 'centered', i.e. what the true effect size in the population is, so we assume a Null Hypothesis which states: the effect size in the population is zero and all deviations from zero are due to change.

When we do a t-test or a z-test (i.e. unless you have really really tiny samples they’re equivalent), this distribution is normal, centered at zero, with a standard deviation that is the standard error. When you use NHST, you test under the assumption that the null hypothesis is true. And the null hypothesis doesn’t care.

The null hypothesis doesn’t say that an association, if it is found, will be positive or negative. The null hypothesis says “There is no association, let alone a positive one. Or a negative one. There just isn’t any. Seriously. Just believe me.”

If you conduct a one-sided test, you change the null hypothesis. In its heart. It starts caring. The null hypothesis no longer goes like “There is no effect whatsoever no matter what you say. I’m not listening. La-die-da-die-da-die-da.” No, suddenly the null hypothesis has a preference: it goes like “Yeah, there definitely is an effect in the direction you don’t expect. Totally. Or not. No effect is also completely possible. Which it is, I don’t care. As long as it’s not the one you expect.”

And this, I think, is the problem. Employing NHST means you test under the assumption the null hypothesis (no difference) is true. Onesided testing changes this: you no longer assume the null hypothesis is not true; instead you assume that either the null hypothesis is true, or there is an effect of any possible size that is in the opposite direction of your hypothesizes effect.

This assumption, I assume (heh), changes the distribution of expected values your should use. After all, the normal (or t-) distribution is based on the assumption that there exists no effect, and that any deviations in the opposite direction are exclusively the consequence of error. That’s why the thing is symmetrical. If we would use a distribution to test our null hypothesis that would reflect instead the assumption that “any effect may exist as far as we know as long as it’s opposite to the tested hypothesis” it would, I expect, be skewed: right-skewed if the alternative hypothesis assumes a positive deviation from zero, left-skewed if the alternative hypothesis assumes a negative deviation from zero.

So, by conducting a one-sided test, you’re kind of cheating. You pretend to test under the assumption of the null hypothesis of “no association”, but in fact you change the null hypothesis from one based on a point estimate of zero to one based on an infinitely wide interval from -infinity (or infinity) and ending at zero. You’re violating the principle that in NHST, you’re testing under the assumption that the null hypothesis is true, and the null hypothesis is agnostic as to a direction of any effect that may exist. Assuming a null hypothesis with personal preferences changes it from a null hypothesis to a some-thing hypothesis, and that means you shouldn’t use the normal (or t-) distribution, but some other exotic distribution that I’m sure really clever people have discovered and given a fancy name.

So, this is why I don’t understand how one-sided testing is allowed in any situation where a counter-hypothetical effect is possible. Unless you use a distribution that is consistent with your fundamentally different null hypothesis: a distribution that is not based on the assumption that the effect is zero, but on the assumption that the effect may be anything in the interval from -infinity to zero (or zero to infinity), and I’ve never seen anybody do that. Everybody just used the regular distributions, which are based on a ‘symmetrical agnostic null hypothesis plus symmetrical deviations due to error’, which is not what you assume when you compute a one-sided p-value.

And I sincerely hope somebody can explain me why I’m wrong. Or, alternatively, that this ends the discussion and we can move on to more important things, such as writing a script that makes power analyses for more complex situations than bivariate tests accessible to the general audience of psychological researchers :-)

UPDATE: while reading up on this discussion at Twitter, I discovered another, a bit more technical, explanation which I think says roughly what I’m trying to say here, but with much less words and probably more accurately (albeit more technical/abstract as well) by Alex Etz.

Also, for the sake of completeness and to read up on this: I noticed Daniel’s blog post through Matti Heino’s Twitter message:

And here’s @Lakens in 2016: https://t.co/edTtQ9ztSg

— matti heino (@Heinonmatti) March 23, 2016

By reading up myself, I discovered that this is, apparently, a one-year old discussion. Story of my life. Well, better late than never :-)