• Posted by Konstantin 28.01.2009 5 Comments

    Imagine that you have just derived a novel IQ test. You have established that for a given person the test produces a normally-distributed unbiased estimate of her IQ with variance 102. That is, if, for example, a person has true IQ=120, the test will result in a value from a N(120,102) distribution. Also, from your previous experiments you know that among all the people, IQ has a N(110,152) distribution.

    One a sunny Monday morning you went out on a street and requested the first bypasser (whose name turned out to be John) to take your test. The resulting score was t=125. The question is: what can you conclude now about the true IQ of that person (assuming, of course, that there is such a thing as a "true IQ"). There are at least two reasonable approaches to this problem.

    1. You could apply the method of maximum likelihood. Here's John, standing beside you, and you know his true IQ must be some real number a. The test produced an unbiased estimate of a equal to 125. The likelihood of the data (i.e. the probability of obtaining a test score of 125 for a given a) is therefore:

          \[P[T=125|A=a]=\frac{1}{\sqrt{2\pi 10^2}}\exp\left(-\frac{1}{2}\frac{(125-a)^2}{10^2}\right)\]

      The maximum likelihood method suggests picking the value of a that maximizes the above expression. Finding the maximum is rather easy here and it turns out to be at a=125, which is pretty natural. You thus conclude that the best what you can say about John's true IQ is that it is approximately 125.

    2. An alternative way of thinking is to use the method of maximum a-posteriori probability, where instead of maximizing likelihood P[T=125|A=a], you maximize the a-posteriori probability P[A=a|T=125]. The corresponding expression is:

       \begin{multiline} P[A=a|T=125] \sim P[T=125|A=a]\cdot P[A=a] = \\ = \frac{1}{\sqrt{2\pi 10^2}}\exp\left(-\frac{1}{2}\frac{(125-a)^2}{10^2}\right)\cdot \frac{1}{\sqrt{2\pi 15^2}}\exp\left(-\frac{1}{2}\frac{(110-a)^2}{15^2}\right) \end{multiline}

      Finding the required maximum is easy again, and the solution turns out to be a=120.38. Therefore, by this logic, John's IQ should be considered to be somewhat lower than what the test indicates.

    Which of the two approaches is better? It might seem utterly unbelievable, but the estimate provided by the second method is, in fact, closer to the truth. The straightforward "125", proposed to by the first method is biased, in the sense that on average this estimate is slightly exaggerated. Think how especially unintuitive this result is from the point of view of John himself. Clearly, his own "true IQ" is something fixed. Why on Earth should he consider "other people" and take into account the overall IQ distribution just to interpret his own result obtained from an unbiased test?

    To finally confuse things, let us say that John got unhappy with the result and returned to you to perform a second test. Although it is probably impossible to perform any real IQ test twice and get independent results, let us imagine that your test can indeed be repeated. The second test, again, resulted in a score of 125. What IQ estimate would you suggest now? On one hand, John himself came to you and this time you could regard his IQ as a "real" constant, right? But on the other hand, John is just a person randomly picked from the street, who happened to take your test twice. Go figure.

    PS: Some additional remarks are appropriate here:

    • Although I'm not a fan of The Great Frequentist-Bayesian War, I cannot but note that the answer is probably easier if John is a Bayesian at heart, because in this case it is natural for him to regard "unknown constants" as probability distributions and consider prior information in making inferences.
    • If it is hard for you to accept the logic in the presented situation (as it is for me), some reflection on the similar, but less complicated false positive paradox might help to relieve your mind.
    • In general, the correct way to obtain the true unbiased estimate is to compute the mean over the posterior distribution:

          \[E[a|T=125] = \int a \mathrm{dP}[a|T=125]\]

      In our case, however, the posterior is symmetric and therefore the mean coincides with the maximum. Computing the mean by direct integration would be much more complicated.

    Tags: , , , ,

  • Posted by Margus 27.01.2009 4 Comments

    Two hardened criminals are taken to interrogation in separate cells.
    They are offered the usual deal:
      If neither confesses, both get one year probation.
      If both confess, both do 5 years in jail.
      If one confesses, he goes free but the other does 10 years hard time.

    Here's what actually goes through their minds:
    "Okay, if neither of us confesses, we have to go back to the real world. But its so hard there!
    But if I confess, he will kill me when he gets out.. so thats bad...
    If both of us confess, then we can just get back to jail and continue our lives!"

    Note that it shares some similarities with the original...

    Tags: ,

  • Posted by Konstantin 22.01.2009 2 Comments

    I bet most of you haven't heard of MDX, because it seems to be a technology that is difficult to stumble upon. It is not covered in university courses, not mentioned in the headlines of PC-magazine articles, and the corresponding google search returns about 20000 results, which makes it thousands (and even tens of millions) times less popular than most other related keywords. This is completely unfair. I believe MDX is compulsory knowledge for everyone who is educated in databases and data analysis enough to appreciate the virtues of both SQL queries and Excel-like spreadsheet processing. A famous quote by Alan Perlis says: "A language that doesn't affect the way you think about programming, is not worth knowing". In these terms, MDX is a language well worth knowing.

    MDX stands for Multidimensional Expressions. It is a language for specifying computations and performing queries on a multidimensional database, which effectively provides the computing power of spreadsheets in the form of a query language. Although it is not possible to explain all of MDX in a blog post, I'll try to show the gist of it in a couple of examples. I hope it will raise at least some interest in those, who are patient enough to read to the end. To understand MDX you need some experience with spreadsheets, so I'll assume you do and use the analogies with Excel.

    Multidimensional data. You use spreadsheets to work with tabular data, i.e. something like that:

    Tartu Tallinn
    Jan 21 -1.0 °C -2.1 °C
    Jan 22 -0.2 °C -0.2 °C
    Jan 23 0.3 °C -0.2 °C
    Average temperature

    This is a two-dimensional grid of numbers. Each point in the grid can be indexed by a tuple (Date, City). A multidimensional database is essentially the same grid but without the limitation of two dimensions. For instance, if there are several methods of measuring temperature, you can add a new dimension and index each cell by a tuple (Date, City, MeasurementMethod). And if you wish to keep both the average temperature and humidity, you just add a fourth dimension MeasureType to the database and store both, etc. Once your database becomes multidimensional it becomes impossible to display all of it as a two-dimensional table, so you will have to explore it by slices. For example, if the database did indeed contain 4 dimensions (Date, City, MeasurementMethod, MeasureType) then the above table would be a slice of it for (MeasureType = Temperature, MeasurementMethod = Usual). The MDX query corresponding to the slice would look as follows:

        { City.Tartu, City.Tallinn } on columns,
        { Date.Jan.21, Date.Jan.22, Date.Jan.23 } on rows
      where (MeasureType.Temperature, MeasurementMethod.Usual)

    Cell calculations. The second thing you use spreadsheets for is to compute new cell values from the existing ones. For example, you might wish to augment the table above with a column showing the temperature difference between Tartu and Tallinn and a row showing the average temperature over three days:

    Tartu Tallinn Difference
    Jan 21 -1.0 °C -2.1 °C 1.1 °C
    Jan 22 -0.2 °C -0.2 °C 0.0 °C
    Jan 23 0.3 °C -0.2 °C 0.5 °C
    Average -0.3 °C -0.83 °C 0.53 °C
    Average temperature

    To do that in Excel you would have to enter a formula for each cell in the new column and row. In MDX you analogously write:

      create member City.Difference as (City.Tartu - City.Tallinn)
      create member Date.Jan.Average as 
             (Avg({ Date.Jan.21, Date.Jan.22, Date.Jan.23 }))

    Note that once you have defined the new members, they apply to any slice of your data. I.e., you have just defined the way to compute City.Difference and Date.Jan.Average for all existing measure types and measurement methods. Moreover, many of the useful aggregate computations are already predefined for you. For example, the MDX server would implicitly understand that Date.Jan denotes be the aggregation (e.g. average) of values over all the days of January, and Date is the aggregation of values over all the dates ever. Similarly, City denotes the aggregation over all cities. Thus, selecting the average temperature over all cities in January is a matter of requesting

      where (Date.Jan, City, MeasureType.Temperature, 

    Filters, orders and more. You often need to query the data for things like "cities with the highest average temperature in January", or request to "order days by the temperature difference between Tartu and Tallinn". This is where both the power and complexity of MDX becomes visible. The first query would look as follows:

        Filter(City.Members, Date.Jan == Max(Date.Jan)) on columns
      where (MeasureType.Temperature, MeasurementMethod.Usual)

    Notice how much more expressive this is in comparison to the equivalent query you would have to come up with in SQL. Besides slicing, filtering and ordering, an MDX server can support lots of various generic data processing and analysis functions (here, the exact capabilities depend on the vendor). Thus, when properly tuned, even the tasks such as "selecting differentially expressed genes that play a significant role in a linear model for predicting cancer stage from microarray expression data" could become a matter of a single concise query.

    Pretty cool, don't you think?

    Tags: , , , , ,

  • Posted by Konstantin 15.01.2009 21 Comments

    Consider the following excerpt from a recent article in the British Medical Journal:


    Mike has only two children, and they are called Pat and Alex, which could equally be boys’ or girls’ names. In fact, Pat is a girl. What is the probability that Alex is a boy?

    a 50%
    b Slightly less than 50%
    c Slightly more than 50%
    d Between 60% and 70%
    e Between 40% and 30%

    d—Although this could be about the relative popularity of ambiguous names for boys and girls or about subtle imbalances in the sex ratio, it is not meant to be. The clue to the correct answer is in thinking about what we do not know about the family and what we do know already, and applying this to the expected probabilities of girls and boys.

    We do not know if Pat was born first or second. We do know that there are only two children and that Pat is a girl. I am assuming that in the population, 50% of children are girls.

    The birth order and relative frequency of two child families are: boy, boy (25%), girl, girl (25%), boy, girl (25%) girl, boy (25%). We know Mike’s family does not have two boys, since Pat is a girl, so we are only left with three choices for families with at least one girl. Two of these pairs have a boy and one does not. Hence the probability that Alex is a boy is two thirds or 66.7%.

    If we had been told that Pat was born first then the probability that Alex is a boy drops to 50%.

    The well-known "Boy or Girl" paradox that is referenced in the fragment above is probably as old as the probability theory itself. And it is therefore quite amusing to see an incorrect explanation for it presented in a serious journal article. You are welcome to figure out the mistake yourself.

    For completeness sake, here is my favourite way of presenting this puzzle:

    In the following, let Mike be a randomly chosen father of two kids.

    1. Mike has two kids, one of them is a boy. What is the probability that the other one is a girl?
    2. Mike has two kids, the elder one is a boy. What is the probability that the other one is a girl?
    3. Mike has two kids. One of them is a boy named John. What is the probability that the other one is a girl?
    4. I came to visit Mike. One of his two kids, a boy, opened the door to me. What is the probability that Mike's other child is a girl?
    5. I have a brother. What is the probability that I am a girl?
    6. I have a brother named John. What is the probability that I am a girl?

    You can assume that boys and girls are equiprobable, the births of two kids are independent events, a randomly chosen boy will be named John with probability p, and that a family may have two kids with the same name.

    If you haven't tried solving these yet, give it a try. I'm pretty sure you won't do all 6 questions correctly on the first shot.

    Tags: , , ,

  • Posted by Konstantin 01.01.2009 2 Comments
    PhD comic: New Year's Resolutions
    1. Submit at least 2 papers.
    2. Attend at least 2 significant conferences (ICML and ISMB, hopefully).
    3. Make at least 2 important acquaintances.
    4. Participate as a reviewer at least twice.
    5. Write at least 2 pieces of software (useful to at least 2 people each).
    6. Read at least 2 non-technical books and at least 2 technical books.
    7. Complete at least 2 of these multiple side-projects that eagerly wait for completion.
    8. Supervise at least 22 students to successful defence.
    9. Keep this blog running (26 blog entries or so).
    10. Keep in touch with friends and relatives, have a fair dose of fun, sports, music and all the rest that belongs here (you go windsurfing, you call me, OK?).

    Did I miss something?