Statistics is mainly about using the observed data to make conclusions about the "real state of affairs", underlying that data. A classical and most widely spread technique for making these conclusions is based on significance testing. In simple terms, the idea of significance testing is to ask the question: "if the real state of affairs were X, how probable would it be for us to obtain the data D we are currently observing?". If the answer is "rather unprobable" (e.g. p < 0.05), the common decision is to reject the proposition X in favor of the alternative "not X". Otherwise the researcher claims to "see no reason to reject X".
The logic behind that reasoning seems quite solid from the first glance, yet it is well known to be faulty. Naturally, the fact that the likelihood of the data P(D | X) is low need not imply that the underlying hypothesis is wrong - it might very well be the case that the data by itself is already rare enough to make this value low. The only correct way of making sound judgments is to consider the a-posteriori probability of the hypothesis P(X | D) instead. However, the latter can be quite inconvenient to compute. Besides, the wild popularity of significance tests and p-values seems to indicate that the issue is not at all that serious. Really, P(X | D) looks so similar to P(D | X), who cares?
The book "What If There Were No Significance Tests?", which I stumbled upon recently while browsing a stray library shelf, makes it clear that this issue is not a joke. It is a collection of chapters written by renowned statisticians (most of which some-why work in the field of psychology), that quite convincingly condemns the widespread overuse of p-values and the related significance-based hypothesis testing in favor of other approaches. The main point is nailed quite precisely in the very first essay by Jacob Cohen, which I strongly advise you to read right now in order to get rid of any illusions you might still have regarding significance testing. And when you're done with that, you can continue reading this post.
In the following I shall provide my personal summary of the marvelous "Member of Congress" example from J.Cohen's essay. So far it is the best illustration I know of, about why exactly it is dangerous to use significance tests blindly.
Improbable does not mean impossible
Consider the following situation. We have observed a person which we know to be a Member of the US Congress. We are interested in testing the hypothesis, that this person is an American citizen. To apply the significance testing methodology, we proceed by estimating the p-value:
P(Congressman | American) ~ 535/300 000 000.
This is clearly below the popular 0.05 threshold. As a result, we are forced to reject the null-hypothesis and conclude that the person is not an American citizen. Bummer.
What is the problem here? Well, one thing is worth noting - while the probability for an American to be a congressman is low, it is even lower (precisely, zero), for a non-American. So maybe we would have been better off if we expanded the procedure above to the following "maximum-likelihood"-style reasoning:
Considering that the likelihood P(Congressman | American) is greater than the likelihood P(Congressman | non-American), we must conclude that the person in front of us is an American rather than not.
Did we just solve the problem? Is it enough to consider "p-values both ways" to clear things up? No!
Maximum likelihood does not work
Let us now consider a reversed situation. We are faced with a person, which, we know, is an American. We are interested in the hypothesis that he is a congressman. Compute the two likelihoods:
P(American | Congressman) = 1
P(American | not Congressman) ~ 300 000 000 / 6 700 000 000
Observing that the first likelihood is greater than the second we are forced to conclude that the person in front of us is indeed a congressman. Bummer, again!
Only by multiplying the likelihood with the marginal probability P(Congressman) could we have obtained the correct decision. Which is, to say, we must have been estimating the probabilities the other way around from the start.
To summarize, be wary of these pitfalls. I would not agree with the strong negative opinion of the authors of the book, though. After all, a lot of stuff is quite fruitfully done nowadays using p-values only. However, each time you use them, do it sensibly and keep in mind the following two aspects:
- If your p-value is low, can this be solely due to low marginal probability of the data? What is the "reversed" p-value? What is the power of your test?
- If you suspect that your hypotheses might be subject to a highly non-uniform prior probabilities, do not use bare p-values. You must consider the prior!