• Posted by Swen 13.02.2010 2 Comments

    The recent blog post "Machine Learning in Mordor" is an excellent example of inherent bias in error estimation methods. Let us consider a standard classification task for simplicity. Then a good machine learning practice is first to find a classifier that is optimal or near-optimal on the training set. Recall that for a fixed classifier the training error is a Monte-Carlo estimate of the generalisation error - the average error that the classifier makes if the data is drawn independently from some unknown but fixed distribution. However, as soon as we choose to minimise the training error, the estimate becomes biased. The training error underestimates the generalisation error.

    To avoid this the data is usually split into two separate sets: training data and validation data. If we choose the classifier according to the training data and later estimate the generalisation error on the validation data, the bias disappears and we should be in good shape. Moreover for certain classification methods, such as Support Vector Machines, it is possible to prove that the training error and the generalisation error are closely connected. That is, we can prove formulae

    GeneralisationError <= TrainingError + F(n)

    where F(n) is a bias term that depends on size of the training sample. Although both of these arguments are formally correct, they are actually void in certain practical settings.

    For instance, student Bughrorc who split the data into a 15 element training and 5 element test set did not do anything wrong, but yet his classification algorithm is likely to perform inferior to the method proposed by Aghargh.

    One could argue that Bughrorc made a mistake and used the validation set for choosing between several competing classifiers and thus, for the same reasoning as explained above, the estimate on the generalisation error is now an underestimation. However, this is only a partial explanation.

    Imagine a standard no-free-lunch setting, where the labels to the data are assigned randomly so that 80% spells are good and 20% of the spells are bad. Moreover, assume that Bughrorc chooses only one classification algorithm on the training set and later uses the validation set for estimating the generalisation error. As the sample is small, the validation error might be zero again. Is the estimate free of bias now and is Bughrorc safe?

    Actually not. As weird as it seems, statistical estimates depend on your intention and potential behaviour. If Bughrorc sees a small error estimate and accepts it as a correct one, it does not mean that this is unbiased eventhough the validation was properly done. To verify whether the estimate is biased or not, Bughrorc has to explore his deeper thoughts. If he can solemnly promise that a larger validation error would not had discouraged him to drop the experiment, then the validation error is indeed unbiased (but wrong). However, if a larger validation error would have stopped him from using the classifier, the same result is biased (and wrong) - whenever Bughrorc accepts the validation error it is necessarily small and thus the validation error actually underestimates the generalisation error.

    As most of us are sane enough not to use a classifier with an estimated classification error near 100% or even near 50%, the validation error is biased for all practical purposes.

    Surprisingly enough this bias can be estimated. Let us first fix a conservative point estimate for the validation error that is smaller than the actual generalisation error on at most 5% cases, if the validation samples are independently from the actual distribution. For example, assume that we are willing to accept the 20% classification error and the classification problem is easy - for a randomly drawn sample, the validation error is below 20 with probability 95%. Now even though we shall tend to unfairly reject all cases where the validation error overestimates the generalisation error (thus biasing our estimate towards underestimation), the fraction of such cases is at most 5/95=5.3%. Consequently, the estimate is a safe overestimate with confidence 94.7%.

    Easy machine learning problem

    Easy machine learning problem

    On the other hand, if the classification problem is difficult, say, the validation error is below 20% with probability 4%, then the situation is completely different. Since the fraction of cases where the validation fails is larger than the fraction of cases we accept the estimate, all cases where we accept the estimate might be incorrect. Consequently, the estimate is a safe overestimate with confidence 0%.

    Difficult machine learning problem

    Difficult machine learning problem

    As a final remark, note that if the validation set is small and the classification task is difficult then validation error is below accepting threshold with significant probability. For the non-free lunch Mordor case, the acceptance probability is 80% for a single point and thus we underestimate the result in 80% possible cases. If the validation set is large, then such a small error is achieved with insignificant probability and the question whether we should or should not continue does not occur "at all". In other words, the smaller is our acceptable error rate, the larger must be the validation set to compensate the rejection bias.

    Tags: , ,

  • Posted by Konstantin 12.02.2010 4 Comments

    Statistics is mainly about using the observed data to make conclusions about the "real state of affairs", underlying that data. A classical and most widely spread technique for making these conclusions is based on significance testing. In simple terms, the idea of significance testing is to ask the question: "if the real state of affairs were X, how probable would it be for us to obtain the data D we are currently observing?". If the answer is "rather unprobable" (e.g. p < 0.05), the common decision is to reject the proposition X in favor of the alternative "not X". Otherwise the researcher claims to "see no reason to reject X".

    The logic behind that reasoning seems quite solid from the first glance, yet it is well known to be faulty. Naturally, the fact that the likelihood of the data P(D | X) is low need not imply that the underlying hypothesis is wrong - it might very well be the case that the data by itself is already rare enough to make this value low. The only correct way of making sound judgments is to consider the a-posteriori probability of the hypothesis P(X | D) instead. However, the latter can be quite inconvenient to compute. Besides, the wild popularity of significance tests and p-values seems to indicate that the issue is not at all that serious. Really, P(X | D) looks so similar to P(D | X), who cares?

    Book cover

    The book "What If There Were No Significance Tests?", which I stumbled upon recently while browsing a stray library shelf, makes it clear that this issue is not a joke. It is a collection of chapters written by renowned statisticians (most of which some-why work in the field of psychology), that quite convincingly condemns the widespread overuse of p-values and the related significance-based hypothesis testing in favor of other approaches. The main point is nailed quite precisely in the very first essay by Jacob Cohen, which I strongly advise you to read right now in order to get rid of any illusions you might still have regarding significance testing. And when you're done with that, you can continue reading this post.

    In the following I shall provide my personal summary of the marvelous "Member of Congress" example from J.Cohen's essay. So far it is the best illustration I know of, about why exactly it is dangerous to use significance tests blindly.

    Improbable does not mean impossible

    Consider the following situation. We have observed a person which we know to be a Member of the US Congress. We are interested in testing the hypothesis, that this person is an American citizen. To apply the significance testing methodology, we proceed by estimating the p-value:

    P(Congressman | American) ~ 535/300 000 000.

    This is clearly below the popular 0.05 threshold. As a result, we are forced to reject the null-hypothesis and conclude that the person is not an American citizen. Bummer.

    What is the problem here? Well, one thing is worth noting - while the probability for an American to be a congressman is low, it is even lower (precisely, zero), for a non-American. So maybe we would have been better off if we expanded the procedure above to the following "maximum-likelihood"-style reasoning:

    Considering that the likelihood P(Congressman | American) is greater than the likelihood P(Congressman | non-American), we must conclude that the person in front of us is an American rather than not.

    Did we just solve the problem? Is it enough to consider "p-values both ways" to clear things up? No!

    Maximum likelihood does not work

    Let us now consider a reversed situation. We are faced with a person, which, we know, is an American. We are interested in the hypothesis that he is a congressman. Compute the two likelihoods:

    P(American | Congressman) = 1

    P(American | not Congressman) ~ 300 000 000 / 6 700 000 000

    Observing that the first likelihood is greater than the second we are forced to conclude that the person in front of us is indeed a congressman. Bummer, again!

    Only by multiplying the likelihood with the marginal probability P(Congressman) could we have obtained the correct decision. Which is, to say, we must have been estimating the probabilities the other way around from the start.

    To summarize, be wary of these pitfalls. I would not agree with the strong negative opinion of the authors of the book, though. After all, a lot of stuff is quite fruitfully done nowadays using p-values only. However, each time you use them, do it sensibly and keep in mind the following two aspects:

    1. If your p-value is low, can this be solely due to low marginal probability of the data? What is the "reversed" p-value? What is the power of your test?
    2. If you suspect that your hypotheses might be subject to a highly non-uniform prior probabilities, do not use bare p-values. You must consider the prior!

    Tags: , ,