Four Years Remaining

The Prior Confusion

Posted by Konstantin 28.01.2009
Imagine that you have just derived a novel IQ test. You have established that for a given person the test produces a normally-distributed unbiased estimate of her IQ with variance 10². That is, if, for example, a person has true IQ=120, the test will result in a value from a N(120,10²) distribution. Also, from your previous experiments you know that among all the people, IQ has a N(110,15²) distribution.

One a sunny Monday morning you went out on a street and requested the first bypasser (whose name turned out to be John) to take your test. The resulting score was t=125. The question is: what can you conclude now about the true IQ of that person (assuming, of course, that there is such a thing as a "true IQ"). There are at least two reasonable approaches to this problem.
1. You could apply the method of maximum likelihood. Here's John, standing beside you, and you know his true IQ must be some real number a. The test produced an unbiased estimate of a equal to 125. The likelihood of the data (i.e. the probability of obtaining a test score of 125 for a given a) is therefore:
  
  $P[T=125|A=a]=\frac{1}{\sqrt{2\pi 10^2}}\exp\left(-\frac{1}{2}\frac{(125-a)^2}{10^2}\right)$
  
  The maximum likelihood method suggests picking the value of a that maximizes the above expression. Finding the maximum is rather easy here and it turns out to be at a=125, which is pretty natural. You thus conclude that the best what you can say about John's true IQ is that it is approximately 125.
2. An alternative way of thinking is to use the method of maximum a-posteriori probability, where instead of maximizing likelihood $P[T=125|A=a]$ , you maximize the a-posteriori probability $P[A=a|T=125]$ . The corresponding expression is:
  
  $\begin{multiline} P[A=a|T=125] \sim P[T=125|A=a]\cdot P[A=a] = \\ = \frac{1}{\sqrt{2\pi 10^2}}\exp\left(-\frac{1}{2}\frac{(125-a)^2}{10^2}\right)\cdot \frac{1}{\sqrt{2\pi 15^2}}\exp\left(-\frac{1}{2}\frac{(110-a)^2}{15^2}\right) \end{multiline}$
  
  Finding the required maximum is easy again, and the solution turns out to be a=120.38. Therefore, by this logic, John's IQ should be considered to be somewhat lower than what the test indicates.
Which of the two approaches is better? It might seem utterly unbelievable, but the estimate provided by the second method is, in fact, closer to the truth. The straightforward "125", proposed to by the first method is biased, in the sense that on average this estimate is slightly exaggerated. Think how especially unintuitive this result is from the point of view of John himself. Clearly, his own "true IQ" is something fixed. Why on Earth should he consider "other people" and take into account the overall IQ distribution just to interpret his own result obtained from an unbiased test?

To finally confuse things, let us say that John got unhappy with the result and returned to you to perform a second test. Although it is probably impossible to perform any real IQ test twice and get independent results, let us imagine that your test can indeed be repeated. The second test, again, resulted in a score of 125. What IQ estimate would you suggest now? On one hand, John himself came to you and this time you could regard his IQ as a "real" constant, right? But on the other hand, John is just a person randomly picked from the street, who happened to take your test twice. Go figure.

PS: Some additional remarks are appropriate here:
- Although I'm not a fan of The Great Frequentist-Bayesian War, I cannot but note that the answer is probably easier if John is a Bayesian at heart, because in this case it is natural for him to regard "unknown constants" as probability distributions and consider prior information in making inferences.
- If it is hard for you to accept the logic in the presented situation (as it is for me), some reflection on the similar, but less complicated false positive paradox might help to relieve your mind.
- In general, the correct way to obtain the true unbiased estimate is to compute the mean over the posterior distribution:
  $E[a|T=125] = \int a \mathrm{dP}[a|T=125]$
  
  In our case, however, the posterior is symmetric and therefore the mean coincides with the maximum. Computing the mean by direct integration would be much more complicated.
Posted by Konstantin @ 11:20 pm

Tags: Bayes, Paradox, Probability theory, Puzzle, Statistics
5 Comments
1. swen on 29.01.2009 at 17:52 (Reply)
  
  Kostja again you confuse statistics and bayesian statistyics:
  
  Statistics is all about producing inference algorithms that work well on average. A statistical test is unbiased if (asymptotically) the expected value of the point estimator coincides (converges to) with the true value of the parameter. Here the distribution over which the average is computed is fixed ahead and is consistent with the parameter. In the IQ test case, mathematical statistics studies the behaviour of the MLE estimator over the exchaustive set of questionaire answers that are generated by a hypothetical never changing person with IQ 125.
  
  Bayesian statistics is in principle subjective. It could not care less about the average case behaviour of an estimation algorithm. The question is posed as follows. We have the data---a single questionaire answer and prior knowledge about IQ tests and IQ in general, what is the best answer I can give for these concrete case. In the scope of this question it is completely irrelevant how this algorithm behaves on the average---over an exchaustive list of all questionaire answers generated by a never-changing person with IQ 125.
  
  The average case behaviour is completely unintersting, since I will do a decision whether to hire John only once and I could not care less whether my decision is good on average as long as I get optimal decision for John. As long as you really belive in the information you used to derive MAP estimate you cannot do better if you want to behave rationally.
  
  However, the MAP estimate does not tell you the whole story you should consider also the shape of the posterior and find out confidence intervals. Now differently from ordinary statistics, these really mean what they supposed to say: with 95% probability (as my measure of uncertainty) the correct answer for John is in the fixed range.
  
  Classical confidence intervals from mathematical statistics would say something completely different and actually irrelevant. If we fix the method of drawing the confidence intervals, then on avearge over exchaustive listing of all quiestionaire results completed by hypothetical never changing person with fixed IQ the true value of fixed parameter would be in the range in 95% of cases. Although this is a reasonable goal, it actually says nothing about the particular result obtained by John. Assuming that John has an IQ, then the fixed questionaire and fixed method for computing confidence intervals provides a fixed result. The true IQ is either in the computed range (fixed range) or not. Although we do not know the result and will never learn it, there is (formally) no uncertaintly left and thus the classical guarantee is rendered meaningless without further assumptions.
  
  It is just like you have throw a coin, it had landed on tail, then the probability of obtaining a tail in a single throw is meaningless, since the coin has landed on tail. Obviously, nothing changes if the person who throws the coin is blind.
  1. Konstantin on 30.01.2009 at 01:18 (Reply)
    
    Why do I have a feeling that you are replying to a different post now? It is certainly not (yet) the post for waging anti-Frequentist wars you like so much. :p
    
    You seem to be arguing with something but I don't understand what is the claim you are trying to disprove?
    Did you want to say that "MAP estimate is always more rational than the unbiased estimate"?
    Firstly, this is just not the point of the post. I do use MAP there!
    Secondly, I wouldn't be so certain about that, because this clearly depends on the circumstances and your definition of rationality (e.g. a loss function). You don't know why I'm asking John to fill in the test, so don't imagine things. Why are you so sure that I won't perform the estimation multiple times? I didn't write anything like that. Moreover, the IQ test was probably designed to be used multiple times! Or, maybe it is all a game where I pay John proportionally to the square of the estimation error and I wish to minimize losses?
    I do agree that the mean is a somewhat complicated object, but it is nonetheless quite often the "best" possible point estimate, even in a one-shot experiment, simply because quadratic loss is most often assumed as a default definition of a rational choice.
    Did you also want to say that bias in the estimate does not matter? This is a weird opinion. Clearly, if your algorithm always returns "true value + 10", you better go fix it.
    
    And please, don't go into your hate and misunderstanding of confidence intervals yet. For example, your favourite argument:
    
    It is just like you have throw a coin, it had landed on tail, then the probability of obtaining a tail in a single throw is meaningless, since the coin has landed on tail. Obviously, nothing changes if the person who throws the coin is blind.
    
    is flawed. Even when the coin has already landed, it still makes sense to ask what is the probability p for this coin to land tails, because this p is not a description of the event, it is a model parameter, i.e. something "built into" the coin that you can never really observe, no matter whether you throw the coin or not. And it is this p that you build confidence intervals for, not the cointoss event probability. In statistics, you always estimate parameters, never "probability". But that should come in a separate post too, perhaps.
    1. swen on 30.01.2009 at 16:37 (Reply)
      
      Secondly, I wouldn’t be so certain about that, because this clearly depends on the circumstances and your definition of rationality (e.g. a loss function). You don’t know why I’m asking John to fill in the test, so don’t imagine things. Why are you so sure that I won’t perform the estimation multiple times? I didn’t write anything like that. Moreover, the IQ test was probably designed to be used multiple times!
      
      Here is your first mistake. Classical statistics is all about generic algorithms, i.e., algorithms that work for many times. Bayesian approach is to infer a single posterior distribution or a specific decision (whether to hire John or not). In that frame the questions about average behaviour of the decision procedure does not make sense. It is out of the scope. Hence, the question about biasedness of point-estimates like MAP are irrelevant and meaningless since you never reach the same decision with the same knowledge
      (You cannot step into the same river twice).
      
      Secondly, I wouldn’t be so certain about that, because this clearly depends on the circumstances and your definition of rationality (e.g. a loss function).
      
      Again, you make mistake. By rationality I most certainly do not mean minimisation under some loss function. I just assume that you can quantify uncertaintly with real numbers and your assignments are coherent:
      1) internally consistent
      2) externally consistent
      The latter means that uncertainties must satisfy standard Kolmogorov's axioms and the Bayes rule. All such uncertainty assignments are coherent and equially applicable. The exact circumstances (your subjective belief) determines which of those equially applicable assignments you should choose.
      
      Now if you accept this the question of biasedness becomes irrelevant. If you have correctly formulated what you believe about John and IQ tests, then the posterior distribution is the only coherent inference you can draw.
      
      If you want to develop objective Bayesian theory, then you might consider asymptotic consistency. That is, you might want to prove that different people with different beliefs finally reach similar conclusions if enough evidence (IQ tests are performed on John). Such results do exist but again there will be no notion of biasedness, since the "bias" comes from your prior belief and is actually not bias but usage of your prior knowledge.
      
      Did you also want to say that bias in the estimate does not matter? This is a weird opinion. Clearly, if your algorithm always returns “true value + 10″, you better go fix it.
      
      Again, you miss the point. If I consider the only experiment and really do not care how this procedure works on 1000 people, since I hire only 10 people during my life, the question is meaningless. Assume that for John there is a true IQ value, the the only non-biased procedure is to return this true value.
      Also, if I qiven my beliefs reach true value + 10, the method is not biased rather my prior belief is biased. Consequently, the only way to assure non-biased results is to somehow magically get non-biased beliefs about all the facts in the world, which is of course impossible.
      
      And please, don’t go into your hate and misunderstanding of confidence intervals yet. For example, your favourite argument:
      
      It is just like you have throw a coin, it had landed on tail, then the probability of obtaining a tail in a single throw is meaningless, since the coin has landed on tail. Obviously, nothing changes if the person who throws the coin is blind.
      
      is flawed. Even when the coin has already landed, it still makes sense to ask what is the probability p for this coin to land tails, because this p is not a description of the event...
      
      What I meant was that knowledge of probability of getting tail is meaningless after you have thrown a coin and never intent to throw it again. The probability of tails is useful only if you plan to throw the coin. Whether this probability is inherent property of a coin or not is irrelevant. If the coin is never thrown again, then estimating the probability is meaningless, since you can never use this knowledge.
      
      What I meant about confidence intervals is a trivial observation. When the true value of the parameter, data and the deterministic inference algorithm for confidence interval is fixed, then the true value lies in the resulting unique confidence interval or not. In other words, the coin has already landed and the 95% probability over all possible data sets becomes irrelevant.
      
      Although we do not know whether the confidence interval contains the true parameter value (we are blind coin-tossers), there is no probability left and thus 95% confidence is meaningless. If you have obtained a tail it is really irrelevant whether the probability of tail is 95% or 0.01%, since the event has already happened.
2. Konstantin on 30.01.2009 at 17:25 (Reply)
  
  Flood, yeee!
  
  Here is your first mistake. Classical statistics is all about generic algorithms, i.e., algorithms that work for many times.
  
  Please, do not put your opinion as some absolute truth. Classical statistics is not all about generic algorithms. That's my opinion, now. Moreover, whether this opinion is true or false does not change anything in the discussion.
  
  Again, you make mistake. By rationality I most certainly do not mean minimisation under some loss function. I just assume that you can quantify uncertaintly with real numbers and your assignments are coherent...
  
  Stop the demagogy about blabla-consistent-Kolmogorov. Here's John. He took the test. He wants to know one number. What number should I give him? If your answer is "anything you wish depending on your belief" then it is a very useless answer indeed.
  
  Note that in the presented case the "prior" distribution is not a matter of your prior belief very much, at least not something you can discuss or argue about. You confirmed by experiment that N(110,10^2) is the correct distribution, so it's given to you by the boss and thus fixed.
  
  since I hire only 10 people during my life.
  
  You developed a test that will be used many times. You are selling it. The client requests that the test would produce one number per person examined. That's it. No philosophy.
  
  If the coin is never thrown again, then estimating the probability is meaningless, since you can never use this knowledge.
  
  Indeed, there is no point in inferring anything, if you will never use the knowledge. This trivial fact is not specific to statistics nor confidence intervals, nor anything you discussed here. It's just demagogy.
  
  In other words, the coin has already landed and the 95% probability over all possible data sets becomes irrelevant.
  
  For a frequentist the parameter is always fixed and in general you never know its exact value. Throwing or not throwing a coin does not change anything in principle, as you claim (i.e. "95%" probability does not become irrelevant). Formally, what you know about an unobserved variable is described by a probability measure on the space of its possible values. A 95% interval is an interval with probability measure 0.95, no matter whether you are a Bayesian or a Frequentist. You'll just use slightly different words to interpret this probability measure, but both interpretations are in fact completely meaningful.
  
  I will post a longer explanation especially for you. The point is that the Bayesian/Frequentist topic is not related to this post in any way.
3. Four Years Remaining » Blog Archive » The Difficulties of Self-Identification on 07.03.2017 at 16:36
  
  […] since the "Prior Confusion" post I was planning to formulate one of its paragraphs as the following abstract puzzle, but […]
Leave a comment

Name (required)

E-Mail:(not displayed)(required)

Website:

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

Reply to: