• Posted by Konstantin 22.10.2008 No Comments

    A beloved child has many names, and so does the field of pattern analysis. Some people call it soft computing, others refer to it as machine learning or data-driven statistics. Although in practice all those terms denote the same approach to data analysis, there is always a certain bias hidden behind each name that relates it to its history and origins. I'm not sure whether I understand these biases correctly, but here's how I would define the terms:

    Pattern Analysis
    This is probably the most general term that refers to a kind of data analysis, where the goal is to find "something interesting" in a given dataset. We typically know what kind of things (patterns) we consider interesting (this might be an association rule, a frequent subsequence, a classifier, etc), and the task is to detect the instances of this kind of patterns. Conceptually somewhat opposite to pattern analysis would be statistical hypothesis testing, where the task is to test a given pattern for interestingness. In this sense pattern analysis is very close to the multiple hypothesis testing statistical problem.
    The simplest way to formalize this generic notion I've seen in this exposition by Tilj de Bie. Let D denote the dataset, let \Pi be the pattern space (e.g. "the space of all sequences", or "the space of all classifiers", etc), and let for each \pi\in\Pi there be a pattern function p_\pi(\cdot), so that p_\pi(D) measures the "interestingness" of this pattern with respect to the data D. Then the general task of pattern analysis is just the following maximization problem:

    \max_{\pi\in\Pi} p_\pi(D)

    Machine Learning
    Machine learning can be regarded as a specific type of pattern analysis, where the central interest lies in finding classifiers or regression functions. The dataset in this case consists of a number of "instances" x_i \in X, and the task is to find a function f: X \to Y that maps these instances to corresponding "outputs" \hat{y}_i = f(x_i) in various ways.
    In the case of supervised machine learning, a "true" output y_i is provided with each instance x_i, and the resulting function should aim to match this output for given instances (i.e. f(x_i) \approx y_i) as well as generalize to the unseen ones. Statisticians call this task regression analysis. A classical example is the problem of detecting incoming spam mail by learning the classification rule from previously labeled messages.
    The task of unsupervised machine learning is to find a classifier f without any prespecified labeling of data. Typically one searches for a function that maps similar inputs to similar outputs. If the set of outputs Y is small and discrete, the task is referred to as clustering and somewhat resembles the problem of quantization. If the outputs are continuous, the task can often be related to either factor analysis or noise reduction.
    Finally, the problem of semi-supervised machine learning is to find a classifier for a dataset where some instances have labels and some don't. This can sometimes be regarded as task of missing value imputation.
    The typical approach to all types of machine learning task usually employs the definition of a loss measure L(f), that estimates the goodness of fit for a given classifier f, and then searching for a classifier that optimizes this measure, i.e.:

    \max_f L(f)

    This defines machine learning as a certain kind of pattern analysis.

    Soft Computing
    If the previous terms denote purely mathematical approaches, for which the exact computational method of obtaining a solution is somewhat irrelevant, then this term, on the contrary, denotes a set of certain metaheuristical computational techniques for obtaining these solutions. In the narrowest sense, by soft computing one might collectively denote the areas of evolutionary algorithms, fuzzy logic and neural networks. This gives soft computing a somewhat ad-hockish flavour: "If you don't know how to solve it, simply put it into a neural network, neural networks just work".

    Data Mining
    This might be just my opinion, but I define data mining as any kind of pattern analysis, that is applied in an offline setting. Moreover, I'd say that, ideally, true data mining happens when you attempt to search for patterns in data that was not initially meant for that specific analysis, thus contradicting the classical statistical ideology "specify your hypothesis first, collect the data later".

    Knowledge Discovery from Databases
    As far as I understand it, KDD is just a YABA and marketingspeak for "data mining", somewhat in the same manner as OLAP and Business Intelligence are marketingspeak for "descriptive statistics".

    Data-Driven Statistics
    In the narrowest sense, data-driven statistics denotes all kinds of nonparametric statistical approaches. These are those methods, where one manages to perform data analysis without specifying any parameterized distributions for the inputs or the like. Typical examples would be the various resampling and randomization techniques. In the broad sense, again, this might very much be the same as pattern analysis in general.

    Any other popular terms for the same thing I've forgotten here?

    Tags: , , , , ,

  • Posted by Konstantin 16.10.2008 2 Comments

    Dan said once that I should produce more Weltschmertz in this blog. However, I find this area to be expremely complex to write about because it is really difficult to be even marginally constructive or at least analytic in this field. Anyway, here's an attempt to touch on the classics.

    Today we had a "CS institute day" - a local-scale PR-event aiming at introducing the inner workings of the computer science institute to the students and answering whatever questions they might have about their present or future studies. As it often is the case with such events, the majority of the attendees were not the students but rather the members of the faculty who were either willing to help answer potential questions or just curious about the event. Not more than a dozen students attended. This is unfortunate, as it displays a significant lack of interest of the students towards their studies. There are several reasons for that, the most prominent being perhaps the following two issues:

    • The first "problem" is that studies are free (for most people) here and many students regard them as a nuisance rather than as a way to learn necessary skills. It's often about "I go to university because of some stupid tradition" rather than "I study because I really need knowledge and skills". I heard that the attitude is different in those universities where students pay for their studies. Of course, I'm not promoting the idea of paid studies, but I think that the students would benefit if they could at least mentally put themselves in a situation where the university is something expensive and optional (rather than free and necessary). I'm not sure how to do that, though: whenever the faculty attempts to arrange a motivational event, noone attends.
    • The second "problem" lies in the bloated market demand for IT specialists of any level. Nowadays one can easily get a disproportionately well-paid code-monkey position without any education. This will change, however. As the price level rises, Estonia quickly loses its appeal as an IT outsource country. The internal market for IT solutions does exist, but the pool of available developers grows faster than demand. Three years ago the local large IT companies were literally fighting to recruit as many IT students of any age as possible. Today they are probably still glad to get new people, but it's not as critical. Tomorrow they won't have free projects to assign to arbitrary new people. Finally, the day after tomorrow we are hopefully going to see real competition in IT-skills, which will be the normal situation for an industry. It is high time for a lazy student to think about that seriously now.

    Tags:

  • Posted by Konstantin 10.10.2008 4 Comments

    Note: This is a brief transcript of actions related to my recent hard disk crash recovery. I'm publishing it here in the hope it might be of use to someone someday, just as all those similar posts out there are every so often of great use to me. So if you, dear reader, are not really interested in replacing a hard disk of your DELL PC, you better save your time and skip to the next post.

    A pair of days ago I accidentally managed to test what happens if you throw a running 3-year-old DELL XPS M1210 laptop from a height of about 1.3 meters against solid floor. The result, fortunately, was radically different from what one might expect after observing that painful inelastic crash. After switching the screen off for a period of 5 seconds or so (the reasons behind this behavior being a complete mystery for me) the laptop happily resumed as if nothing had happened. A later examination revealed that the only part that was damaged was the hard disk where a number of files became unreadable.

    I presumed that it must have been the result of a strong head crash. Common knowledge suggests that a head-crashed hard disk should better be replaced, even if it continues functioning. This is because new head crashes are much more probable after the first one has happened. Thus, I went to the shop and got myself a new average-grade 160Gb SATA hard disk (these are surprisingly cheap nowadays, btw). I also bought a small external case to put my old drive into, so that I could continue using it as an external storage medium.

    In principle, the whole procedure could have ended here: I could have taken the old hard disk out, put the new one in, installed the OS onto the new disk from a CD and copied my files from the old disk with the help of the external case. However, there was one thing I didn't like about this approach. If I were to restart with a blank OS, I'd have to bother installing the proper drivers as well as several pieces of software that were shipped with the factory preinstallation of Windows XP (in particular, that enormously useless yet really impressive Logitech Video Effects webcam feature). I knew, however, that there was a way to restore the disk to the factory setting so I went on to figure out how to do it for a new hard disk. In hindsight, the whole procedure is actually rather straightforward, at least if you appreciate the virtues of the dd command.

    DELL System Restore Partition

    It seems that most DELL laptops nowadays come with a "DELL System Restore" feature, which means that there is a hidden partition on the hard disk that stores the factory-installed Ghost image of the system drive. If at boot time you press Ctrl+F11, you get the opportunity to write this image to disk thus destroying all your data but bringing the computer to its retail state.

    The hidden partition is labeled DellRestore and besides the image file it contains a bootable DOS operating system (yes, real DOS!), the program for restoring the Ghost image (restore.exe) and an autoexec.bat script that invokes this program on boot. So, if I'm not mistaken, by pressing Ctrl+F11 you just direct the laptop to boot the DellRestore partition.

    In theory, if you mirror the partition structure of your old hard drive on the new one and copy the data of the DellRestore partition, it should be possible to simply put in the new drive and press Ctrl+F11 for a proper restoration. However, my new drive was larger in capacity than the old one so I didn't want to mirror the partition structure exactly and I ended up with a somewhat more complicated procedure. The following is an approximate list of steps I went through. Don't try repeating them unless you really understand everything, though. You are also strongly advised to visit this site first.

    The Walkthrough

    1. I needed two bootable CDs here: one Knoppix Live CD and one DOS bootable CD.
    2. First, keep the old (damaged) hard disk in the laptop, put the new disk into the external case, but don't plug it in yet.
    3. Boot the Knoppix CD, when it starts switch to the command line (Ctrl+Alt+F1)
    4. Ensure that the laptop's hard disk is seen as /dev/sda and examine its partition table:
      # cfdisk /dev/sda
    5. It should have 3 partitions: DellUtility (sda1), the main NTFS partition (sda2) and DellRestore (sda3). Write down their sizes and quit cfdisk.
    6. Plug in the new disk (the one in the external case) into the USB port. Ensure that OS detects it: type
      # dmesg
      and examine the output.
    7. Partition Table on the New Disk

      New Drive's Partition Table

      Type
      # cfdisk /dev/sdb
      and create a partition table for the new disk that mimics the one of the old one. In my case I created the first DellUtility partition of the same size and type as was sda1. I then added one NTFS partition (for the main Windows OS) and a small Linux partition (just for fun) choosing the sizes of both to my liking. Finally the fourth partition was created of the same size and type as sda3. See Figure.

    8. Copy the data of sda1 to sdb1 and sda3 to sdb4.
      # dd if=/dev/sda1 of=/dev/sdb1 bs=1024
      # dd if=/dev/sda3 of=/dev/sdb4 bs=1024
    9. Copy the master boot record of the old drive to the new one.
      # dd if=/dev/sda of=/dev/sdb bs=446 count=1
    10. Now, shutdown Knoppix
      # shutdown -h now
      remove the old disk from the laptop, take the new disk from the case and put it into the laptop. Now boot the DOS CD.
    11. When in DOS, go to disk C: (this is the DellRestore partition) subdirectory bat\:
      > cd c:\bat
    12. Execute restore.bat
      > restore.bat
      This should initiate the restore process and write the factory image to your second partition.
    13. After the restore process completed I somewhy still had the problem that the system would not boot. I booted Knoppix, opened the partition table (cfdisk /dev/sda) and discovered that the bootable flag of the second partition was gone after the restoration process. Making the partition bootable again and restarting fixed the problem.

    That's it.

    Tags: , ,

  • Posted by Konstantin 08.10.2008 No Comments

    I received a number of "why" and "how" questions regarding the pri.ee domain name of this site and I thought the answers are worth a post. The technically savvy audience can safely skip it, though.

    The pri.ee subdomain is reserved by EENet for private individuals, who have an estonian ID code. The registration is free of charge and very simple: you just fill in a short form and wait a day or two until your application is processed. As a result you end up with a simple affiliation-free way of designating your site. Of course, it does not have the bling of a www.your-name.com, but I find it quite appropriate for an aspiring blog (and besides, I'm just too greedy and lazy to bother paying for the privilege of a flashy name for my homepage).

    Now on to the "how" part. The only potentially tricky issue of the registration process is the need to fill in the "Name servers" field. Why do you need that and why can't you just directly provide the IP address of the server where you host your site? Well, if you could register the specific IP of your server with EENet, you would have to to contact EENet every time your hosting provider changed, right? In addition, you would need to bother EENet about any subdomain (i.e. <whatever>.yourname.pri.ee) you might be willing to add in the future. Certainly not the most convenient option. Therefore, instead of providing an IP address directly, you specify a reference to an intermediate server, which will perform the mapping of your domain name (and any subdomains) to IP addresses. That's how the internet domain naming system actually works.

    So which name server should you choose? Most reasonable hosting providers (that is, the ones that allow to host arbitrary domains) allow you to use their name servers for mapping your domain name. The exact server names depend on the provider and you should consult the documentation. For example, if you were hosting your site at 110mb.com (which is here just an arbitrarily chosen example of a reasonable free web hosting I'm aware of), the corresponding name servers would be ns1.110mb.com and ns2.110mb.com.

    However, using the name server of your provider is, to my mind, not the best option. In most cases the provider will not allow you to add subdomains and if you change your hosting you'll probably lose access to the name server, too. Thus, a smarter choice would be to manage your domain names yourself using an independent name server. Luckily enough, there are several name servers out there that you can use completely free of charge (or for a symbolic donation): EveryDNS and EditDNS are two examples of such services that I know of.

    After you register an account with, say, EveryDNS, you can specify the EveryDNS nameservers (ns1.everydns.net, ..., ns4.everydns.net) in the pri.ee domain registration form. You are now free to configure arbitrary address records for yourname.pri.ee or <whatever>.yourname.pri.ee to your liking.

    To summarize, here how one can get a reasonable website with a pri.ee domain name for free:

    1. Register with a reasonable web hosting provider
      • 110mb.com is one simple free option (with an exception that they charge $10 once if you need MySQL)
      • other options
    2. Register a DNS account
    3. Fill out this form.
      • If you chose EveryDNS in step 2, state ns1.everydns.net, ns2.everydns.net, ns3.everydns.net, ns4.everydns.net as your name servers.
      • Wait for a day or two.
    4. Suppose you applied for yourname.pri.ee (the domain is still free, by the way!), then:
      • Add this domain in your hosting's control panel and upload your website.
      • Add this domain to your DNS account.
        • You can add an A ("address") record mapping yourname.pri.ee to an IP address.
        • Alternatively, you can add a NS ("name server") record referencing yourname.pri.ee further to ns1.110mb.com (or whatever name server your hoster provides).
    5. Profit!

    Tags: ,

  • Posted by Konstantin 04.10.2008 No Comments

    Probability theory is often used as a sound mathematical foundation to formalize and derive solutions to the real-life problems in fields such as game theory, decision theory or theoretical economics. However, it often turns out that the somewhat simplistic "traditional" probabilistic approach is insufficient to formalize the real world, and this results in a large number of rather curious paradoxes.

    One of my favourite examples is the Ellsberg's paradox, which goes as follows. Imagine that you are presented with an urn, containing 3 white balls and 5 other balls, that can be gray or black (but you don't know how many of these 5 exactly are gray, and how many are black). You will now draw one ball from the urn at random, and you have to choose between one of the two gambles:

    1A)
    You win if you draw a white ball.
    1B)
    You win if you draw a black ball.

    Which one would you prefer to play? Next, imagine the same urn with the same balls, but the following choice of gambles:

    2A)
    You win if you draw either a white or a gray ball.
    2B)
    You win if you draw either a black or a gray ball.

    The paradox lies in the fact, that most people would strictly prefer 1A to 1B, and 2B to 2A, which seems illogical. Indeed, let the number of white balls be W=3, the number of gray balls be G and the number of black balls - B. If you prefer 1A to 1B, you seem to be presuming that W > B. But then W+G > B+G and you should necessarily also be preferring 2A to 2B.

    What is the problem here? Of course, the problem lies in the uncertainty behind the number of black balls. We know absolutely nothing about it, and we have to guess. Putting it in Bayesian terms, in order to make a decision we have specify our prior belief in what is the probability that there would be 0, 1, 2, 3, 4 or 5 black balls in the urn. The classical way of modeling the "complete uncertainty" would be to state that all the options are equiprobable. In this case the probability of having more black balls in the urn than the white balls is only 2/6 (this can happen when there are 4 or 5 black balls, each option having probability 1/6), and it is therefore reasonable to bet on the whites. We should therefore prefer 1A to 1B and 2A to 2B.

    The real life, however, demonstrates that the above logic does not adequately describe how most people decide in practice. The reason is that we would not assign equal probabilities to the presumed number of black balls in the urn. Moreover, in the two situations our prior beliefs would differ, and there is a good reason for that.

    If the whole game were real, there would be someone who had proposed it to us in the first place. This someone was also responsible for the number of black balls in the urn. If we knew who this person was, we could base our prior belief on our knowledge of that person's motives and character traits. Is he a kindhearted person who wants us to win, or is he an evil adversary who would do everything to make us lose? In our situation we don't know, and we have to guess. Would it be natural to presume the uncertainty to be a kindhearted friend? No, for at least the following reasons:

    • If the initiator of the game is not a complete idiot, he would aim at gaining something from it, or why would he arrange the game in the first place?
    • If we bet on the kindness of the opponent we can lose a lot when mistaken. If, on the contrary, we presume the opponent to be evil rather than kind, we are choosing a more robust solution: it will also work fine for the kind opponent.

    Therefore, it is natural to regard any such game as being played against an adversary and skew our prior beliefs to the safer, more robust side. The statement of the problem does not clearly require the adversary to select the same number of black balls for the two situations. Thus, depending on the setting, the safe side may be different. Now it becomes clear why in the first case it is reasonable to presume that the number of black balls is probably less than the number of white balls: this is the only way the adversary can make our life more difficult. In the second case, the adversary would prefer the contrary: a larger number of black balls. Therefore, we would be better off reversing our preferences. This, it seems to me, explains the above paradox and also nicely illustrates how the popular way of modeling total uncertainty using a uniform prior irrespective of the context fails to consider the real-life common sense biases.

    The somewhat strange issue remains, however. If you now rephrase the original problem more precisely and define that the number of black balls is uniformly distributed, many people will still intuitively tend to prefer 2B over 2A. One reason for that is philosophical: we might believe that the game with a uniform prior on the black balls is so unrealistic, that we shall never really have the chance to take a decision in such a setting. Thus, there is nothing wrong in providing a "wrong" answer for this case, and it is still reasonable to prefer the "wrong" decision because in practice it is more robust. Secondly, I think most people never really grasp the notion of true uniform randomness. Intuitively, the odds are always against us.

    Appendix

    There are still a pair of subtleties behind the Ellsberg's problem, which might be of limited interest to you, but I find the discussion somewhat incomplete without them. Read on if really want to get bored.

    Namely, what if we especially stress, that you have to play both games, and both of them from the same urn? Note that in this case the paradox is not that obvious any more: you will probably think twice before betting on white and black simultaneously. In fact, you'd probably base your decision on whether you wish to win at least one or rather both of the games. Secondly, what if we say that you play both games simultaneously by picking just one ball? This would provide an additional twist, as we shall see in a moment.

    I. Two independent games

    So first of all, consider the setting where you have one urn, and you play two games by drawing two balls with replacement. Consider two goals: winning at least one of the two games, and winning both.

    I-a) Winning at least one game

    To undestand the problem we compute the probabilities of winning game 1A, 1B, 2A, 2B for different numbers of black balls, and then the probabilities of winning at least one of two games for our different choices:

    Black balls Probability of winning a gamble Probability of winning one of two gambles
    1A 1B 2A 2B 1A or 2A 1A or 2B 1B or 2A 1B or 2B
    0 3/8 0/8 8/8 5/8 1 39/64 1 40/64
    1 3/8 1/8 7/8 5/8 59/64 39/64 57/64 43/64
    2 3/8 2/8 6/8 5/8 54/64 39/64 52/64 46/64
    3 3/8 3/8 5/8 5/8 49/64 39/64 39/64 49/64
    4 3/8 4/8 4/8 5/8 44/64 39/64 38/64 52/64
    5 3/8 5/8 3/8 5/8 39/64 39/64 39/64 55/64
    Probabilities of winning various gambles for different numbers of black balls

    Now the problem can be regarded as a classical game between us and "the odds": we want to maximize our probabilities by choosing the gamble correctly, and "the odds" wants to minimize our chances by providing us with a bad number of black balls. The game presented above has no Nash equilibrium, but it seems that the choice of 3 black balls is the worst for us on average. And if we follow this pessimistic assumption, we see that the correct choice would be to pick consistently either both "A" or both "B" gambles (a further look suggests that both "A"-s is probably the better choice of the two).

    I-b) Winning two games
    Next, assume that we really need to win both of the games. The following table summarizes our options:

    Black balls Probability of winning a gamble Probability of winning two gambles
    1A 1B 2A 2B 1A and 2A 1A and 2B 1B and 2A 1B and 2B
    0 3/8 0/8 8/8 5/8 24/64 15/64 0 0
    1 3/8 1/8 7/8 5/8 21/64 15/64 7/64 5/64
    2 3/8 2/8 6/8 5/8 18/64 15/64 12/64 10/64
    3 3/8 3/8 5/8 5/8 15/64 15/64 15/64 15/64
    4 3/8 4/8 4/8 5/8 12/64 15/64 16/64 20/64
    5 3/8 5/8 3/8 5/8 9/64 15/64 15/64 25/64
    Probabilities of winning various gambles for different numbers of black balls

    This game actually has a Nash equilibrium, realized when we select options 1A and 2B. Remarkably, it corresponds exactly to the claim of the paradox: when we need to win both games and are pessimistic about the odds, we should prefer the options with the least amount of uncertainty.

    II. Two dependent games
    Finally, what if both games are played simultaneously by taking just one ball from the urn. In this case we also have two versions: aiming to win at least one, or aiming to win both games.

    II-a) Winning at least one game
    The solution here is to choose either the 1A-2B or the 1B-2A version, which always guarantees exactly one win. Indeed, if you pick a white ball, you win 1A, and otherwise you win 2B. The game matrix is the following:

    Black balls Probability of winning a gamble Probability of winning one of two gambles
    1A 1B 2A 2B 1A or 2A 1A or 2B 1B or 2A 1B or 2B
    0 3/8 0/8 8/8 5/8 1 1 1 5/8
    1 3/8 1/8 7/8 5/8 7/8 1 1 5/8
    2 3/8 2/8 6/8 5/8 6/8 1 1 5/8
    3 3/8 3/8 5/8 5/8 5/8 1 1 5/8
    4 3/8 4/8 4/8 5/8 4/8 1 1 5/8
    5 3/8 5/8 3/8 5/8 3/8 1 1 5/8
    Probabilities of winning various gambles for different numbers of black balls

    II-b) Winning both games
    The game matrix looks as follows:

    Black balls Probability of winning a gamble Probability of winning two gambles
    1A 1B 2A 2B 1A and 2A 1A and 2B 1B and 2A 1B and 2B
    0 3/8 0/8 8/8 5/8 3/8 0 0 0
    1 3/8 1/8 7/8 5/8 3/8 0 0 1/8
    2 3/8 2/8 6/8 5/8 3/8 0 0 2/8
    3 3/8 3/8 5/8 5/8 3/8 0 0 3/8
    4 3/8 4/8 4/8 5/8 3/8 0 0 4/8
    5 3/8 5/8 3/8 5/8 3/8 0 0 5/8
    Probabilities of winning various gambles for different numbers of black balls

    The situation here is contrary to the previous: if you win 1A, you necessarily lose 2B, so here you have to bet both "A"-s to achieve a Nash equilibrium.

    Summary
    If you managed to read to this point, then I hope you've got the main idea, but let me summarize it once more: the main "problem" with the Ellsberg's paradox (as well as a number of other similar paradoxes) can be in part due to the fact that pure "uniform-prior" probability theory is not the correct way to approach game-theoretical problems, as it tends to hide from view a number of aspects that we, as humans, usually handle nearly subconsciously.

    Tags: , , ,

  • Posted by Konstantin 27.09.2008 No Comments
    Python 2.6 is out!

    "Python" by xkcd, abridged version. Full version here.

    Tags: , , ,

  • Posted by Konstantin 25.09.2008 13 Comments

    Money

    Money is something we take for granted and we usually don't think too much into what it actually is. As a result, when it comes up as a discussion topic, one may easily discover that nearly everyone has a personal twist on the understanding of what exactly is "money", and how this notion should be interpreted. At least that is the case within the circle of people, who are far from economics; and computer science people most often are. Last week this fact got itself another confirmation, when Swen came up with some fiscal policy-related problem, which didn't seem like a problem to me at all. After a long round of discussion we found out that when we said "money" we meant different things. So I thought I'd put down my understanding of this notion in words, so that next time I could point here and say: look, that's what I mean.

    Of course, I don't pretend to be even a bit authoritative. After all, I don't know anything about economics (although I did take the elementary course, as any computer science student should). But I do find my understanding quite consistent with observation and common knowledge.

    The Ideal Case

    It's easy to start by imagining a country, call it "Veggieland", where people are completely inactive. They don't eat, work or play. Don't shop, produce or consume. Just lie around like vegetables, enjoy the sun and somehow manage to stay alive nonetheless. It is clear, that in this world, no money is needed: there is simply nothing to do with it.

    Next, imagine that suddenly one veggielander Adam started longing for a glass of water, but he had no idea where to get it. At the same time another veggielander Betty saw Adam's problem and said: "Hey Adam, I can give you a glass of water, but you have to scratch my back for that", because Betty really liked when someone scratched her back. They exchanged services and were happy. Still no money was needed in Veggieland.

    After that day Adam discovered that he was really good at scratching Betty's back, and he even liked it. So next time he came to Betty and proposed to scratch her back even though he didn't want water. What he asked in return was a promise that she would get him a glass of water next time he would ask. After a month or so of scratching Betty's back he collected 100 water glass promises, each written with Betty's hand on a nice pink piece of paper, and he found out that this was much more water than he would need until the end of his life. But he had to do something with these promises, because he couldn't "unscratch" Betty's back, right? So he went to that other guy Carl and said: "Hey, Betty has promised me a glass of water, and I can pass this promise to you, if you give me that shiny glass bead that you have, because I adore glass beads". Carl always trusted Betty's promises so he agreed. Now Adam had a bead and 99 water promises, Carl had one water promise. Carl then went to Betty, showed her the pink piece of paper, Betty exchanged it for a glass of water and burned it. Now there were just 99 promises left in circulation in Veggieland.

    A month has passed and veggielanders understood that they could actually pass Betty's promises around, and give out their own promises in exchange for services. They wrote promises on pink pieces of paper and picked the "water glass promise" as a universal measure: now Daffy was providing floozies in exchange for 2 water glass promises, and Ewan was selling tiddlybums at a price 3 water glasses per portion. The system worked, because it satisfied two simple conditions:

    1. Firstly, at any given moment the number of undone work was equal the number of these pink paper promises in circulation.
    2. Secondly, the number of promised work was always feasible.

    These conditions were easy to satisfy in Veggieland, because veggielanders were all honest and friendly people. They decided that any time someone fulfills his own promise, she should destroy one pink paper, and any time someone gives a promise, she should write a new pink paper, but no one should give a promise he can't hold.

    As a result the people of Veggieland ended up with a perfect monetary system, driving its economics, and allowing citizens to trade services and goods. The analogy of this system with the real monetary systems around is straightforward, yet it is much easier to understand its properties, and thus the properties of the real-world money. The following are the important observations:

    • Money itself is just an abstract promise, it has no form or inherent value. As long conditions 1 and 2 are satisfied, there is really no difference whether the function of money is performed using pink papers, glass beads, or even word-of-mouth promises. Money is not a resource, don't be misled by the idea of the gold standard times, that money equals gold. The point of the gold standard was just that, at the time, gold could somewhy satisfy properties 1 and 2 better than printed money and checks. Otherwise, you don't really need to "support" money with goods for no purpose, but to provide an illusion to people that money is indeed worth it's promise.
    • The amount of money in circulation does not equal the amount of natural resources of the land. It is equal to the amount of economic activity, i.e., the number of unfulfilled promises. If someone discovers a gold mine, or simply brings in a lot of gold into the country, this will begin having impact only when people start actually demanding this resource and giving out promises in exchange.
    • Who makes money? Well, ideally, each person that gives a reasonable promise in exchange for a real service increases the amount of money in circulation, and the person that fulfills a promise - decreases. Of course, real people are not honest and therefore real monetary systems are a crude approximation, with banks being in response of this dynamic money emission process.
    • The choice of currency does make a difference. If someone comes to Veggieland with a bunch of euros (which are, in a sense, promises given by the europeans), the veggielanders won't be inclined to exchange them for their own promises, because although the newcomer can easily make use of the veggielanders' services, the contrary is not true: Veggielanders just never leave their country! A corollary of this observation is that the larger is the geographical span of a particular currency, the further away from the ideal trade unit it will be.

    The Real Life

    Now let's look at how monetary system is implemented in real countries. There are two main aspects to this implementation.

    First issue is the choice and social support for the base monetary unit. Each country typically figures out its favourite name and denomination for the unit and calls it its "currency". A social contract is then made among the citizens, that all promises in the country are to be measured in this unit, and everyone should accept this unit in return for services. Note that this social contract guarantees the absolute liquidity of the currency and puts it strictly aside from anything else you might imagine to be a "close analogue" for money - be it gold, government bonds or securities. The latter are just not liquid enough to work as a general promise-container.

    Secondly, the process of emission and subtraction of money from circulation. This certainly cannot be trusted to people, like it was in Veggieland. A banking system serves the purpose. The idea is that a single central trusted party is chosen, which emits and discards money. A person comes to the bank asking for a loan, the bank confirms that the person is trustworthy enough to keep his promises and gives him some money: money has just been emitted into circulation. Note that in our digital world most money is emitted in digital form, no paper money needs to be printed for that. In other words, the number you see on your account is actually larger than the number of paper money stored "behind" it. This is very natural, because otherwise the bank would have to keep paper-money-printing and destroying machines in it, which would be dangerous.

    Most real monetary systems are just a bit more complicated, having a two-level banking system. The central bank crudely regulates money supply by emitting paper money, giving loans to commercial banks and instructing them to as to how much "more" money they may emit via the reserve requirements. The commercial banks then provide actual banking services to the people. With this distributed system people have a choice of several banks to trust their finances to, not just one. And if it turns out that one bank emits too much (i.e. gives too many loans), so that the emitted promises cannot be met, only the people trusting this bank (i.e. holding "this bank's" money) will suffer.

    Final Questions

    Despite the fact that the above theory seems consistent to me, it is not without complications. As I believe you've already tired of reading this post, I'll avoid spilling too much further thought upon you and just list the three points I find most confusing briefly.

    • Despite being an abstract trade unit, money actually has value: it can bring interests. How do these interests "earned from nowhere" correspond to money emission? Why should owing lots of abstract promises necessarily generate more abstract promises?
    • Internet banks take a fee for transfers, which makes digital money somewhat less liquid than paper money. I find it incorrect. Why should I pay to pay? I understand that banks need resources to support their services, but the amount of resources should not be proportional to the number of operations performed.
    • If all the customers of a bank one day come to claim their savings in cash, the bank won't have enough cash and it will be perceived as a collapse for the bank. Considering the fact that digital money is, in fact, still real money, I see this as a certain drawback of the current two-level banking system. Or is it a necessary condition to distribute trust among the commercial banks?

    If you think I'm completely wrong with all that, you are welcome to state your opinion in the comments.

    Tags: ,

  • Posted by Konstantin 19.09.2008 No Comments

    In the yesterday's early-early-morning discussion with Meelis we had an idea that it would be nice to have some kind of an interactive Javascript-based version of the Patmatch-like binding site visualization. I found the idea to be relevant and easy to implement so I couldn't resist hacking up a small example.

     Threshold: 80%





    Tags: , , ,

  • Posted by Konstantin 17.09.2008 2 Comments

    Suppose the other day you came up with a fresh and brilliant idea: to make an automated system that would track the social and economical situation in various countries and predict when and where a new war is going to happen. So you collected historical data on numerous incidents in the world, fed that data into your favourite machine learning model, played around with the parameters, and voilà - you've got something that seems to work. There is just one thing left to do - assess whether the system you created is any good. At the very least, you need to measure its precision: the probability that a positive prediction will in fact be true.

    But how do you measure precision in this case? The straightforward way would be to wait a pair of years and compare the numbers of actual and predicted wars on the planet - this would provide an approximate estimate of the true generalization ability of your system. However, this is not a very convenient option, so you better figure out how to assess your new system using only the existing data. And this is when things can get confusing (at least if you think long enough). The most popular solutions here are, of course, holdout testing and cross-validation, but their interpretation is often overlooked, so I believe they are worth thinking about for a minute.

    Holdout

    Conceptually the simplest way is the holdout validation: you randomly split your data into a training set of size n and a test set of size m, use former for training and estimate performance on the latter. The two things to note here are the following:

    • Due to the finite size of the test set you will only get an approximate idea of the true performance. The rule of thumb is that the performance that you measure has an error of the order 1/sqrt(m). You get this number if you try to compute the 99% confidence interval under the normality assumption.
    • You are actually measuring the performance of the classifier constructed on your training set, and not the one that you are interested in. This introduces another bias in your measurement. Most often this bias is pessimistic (i.e. the performance of a classifier trained on a smaller set of data is usually worse than the performance of the classifier, trained on the full dataset), however if the dataset contains outliers this need not be the case. I am not aware of any results describing the magnitude of this bias, but I presume that for small datasets it depends on the proportion of data used for training.

    Cross-validation

    The idea of cross-validation is to repeat the holdout experiment described above k times and average the results.

    • If on each iteration your training would result in the same classifier, you could say that the resulting measured performance has an error of the order of 1/sqrt(km). However, each iteration of cross-validation results in a different classifier being built on the selected training set. As a result, you are testing k different classifiers, each with precision 1/sqrt(m), take their average performance, and hope to obtain a better estimate for the performance of your "main" classifier. This hope is justified, because you believe that all of the k classifiers are similar to the "main" one. But what if it is not true? Can it happen so that the variance introduced by the differences in the k classifiers is significant enough to be considered? It is, after all, possible, if the data is very scarce. But again, cross-validation is used precisely when the data is scarce.
    • Similarly to the case with holdout, the cross-validation estimate will most often be pessimistically biased, and it seems that no one really knows how large the bias is going to be.

    Double Cross-validation

    Finally, things get more complicated when you have some "hyper" parameters in your model that you need to tune (such as the λ tradeoff parameter for regularized models). If you simply estimate the performance for a range of parameter values and then pick the one reporting the best value, you are introducing an optimistic bias, because you have in fact tested your algorithm on the same data that was used for training. Therefore, it would be fair to "re-test" your algorithm on some additional data. It is not uncommon to see cases, when a round of cross-validation is used to select the model parameters, and another round is used to estimate the performance (that is, you split the training set in turn into a training and testing sets for model selection). But now things get strange.

    • If on each iteration of "outer" cross validation, we re-select the "high-impact" model parameters using "inner" cross-validation, aren't we introducing just too much variability into the k models of the outer round? What performance are we measuring? Is it optimistically or pessimistically biased now? In general, how close should this doubly-cross-validated performance measure to the real model performance? How should this number be interpreted?

    If some minutes of thought will help you answer these questions, I believe these would be minutes well worth spending.

    Tags: , ,

  • Posted by Konstantin 11.09.2008 7 Comments

    The internet nowadays is a data miner's paradise, providing unlimited ground for novel ideas and experiments. With just a brief look around one will easily find well-structured data about anything ranging from macroeconomic and business indicators to networks and genes.

    However, among this wild variety of datasets there are a few, which seem to possess an especially high wow-factor, either in terms of their fame, size, content, or the amount of potentially interesting information that is still waiting to be uncovered from them. Here is a short list of my text- and graph-mining favourites.

    1. Enron Emails. (download)
      Need a test set for your brand new text mining algorithm? Take this massive set of about 500 000 email messages from about 150 employees of the Enron Corporation, that were made public during the investigation of the famous fraud incident in 2004.
    2. AOL Search Logs. (download)
      A set of about 20 000 000 search queries, made by about 500 000 users, released by the AOL in 2006. The data contains an (anonymous) user id, search keywords and the rank of the the results link, that was clicked. NB: Besides being a nice research resource, it's a comprehensive collection of (probably outdated) porn links.
    3. OpenCyc Common Knowledge. (download)
      A large database of seemingly arbitrary common knowledge facts, such as "girls' moccasin" is a "moccasin used by female children", which is, in other words, "(ArtifactTypeForUserTypeFn Moccasin FemaleChild)". Comes with a reasoning engine and some serious ambitions.
    4. RealityMining: People Tracks. (on request)
      A continuous activity log of 100 people over the period of 9 months, recorded via a Symbian phone application. Includes call logs, Bluetooth devices in proximity, cell tower IDs, application usage, and phone status. If you ever want to construct a model of human social behaviour (or just interested in the emerging stalker technologies), this is certainly a place to look at. In contrast to some other entries in this list, the download is just a 38 megabyte file.
    5. DMOZ100k06: Internet Documents and Tags. (download)
      This is a sample of 100 000 webpages from the Mozilla Open Directory project together with their metadata and tags. A must see for those interested in bookmarks, tagging and how it relates to PageRank. Somewhat similar information could probably also be mined from CiteULike data.
    6. SwetoDblp: Ontology of CS Publications. (download)
      Interested in very large graphs and ontologies? Why not take the Computer Science Digital Bibliography and Library Project's data, converted to RDF. Contains about 11 000 000 triples, relating authors, universities and publications.
    7. Wikipedia3: Wikipedia in RDF. (download)
      This is just a conversion of English Wikipedia links and category relations into RDF, which produces a massive 47 000 000 triple dataset, available as a single download. The maintainers of the project, however, promise to develop it further and add more interesting relations to it in the future.
    8. Overstock Product Data. (download)
      Information on more than a million various products and their review ratings, available in a convenient tabular format. I'm not really sure about what can be done with it, but the sheer size of this data requires mentioning it here.
    9. Ensembl: Genomes of 40+ organisms. (download)
      This dataset is of a slightly different flavour than all of the above, and comes from a completely different weight category, but nonetheless it is purely textual and I thought it would be unfair not to mention it. In fact, this is probably one of the most expensive, promising and yet the least understood textual datasets in the world. And as long as noone really knows what to do with it, bioinformatics will prosper.

    I still lack one entry to make it a "top 10", so your suggestions are welcome.

    Update:

    1. The Netflix Challenge Dataset.(register)
      A data mining challenge that raised the awareness of the commercial importance of machine learning as a field to a new level. Netflix promises $1 000 000 for the first one to beat their movie rating prediction algorithm by just 10%. The contest will soon celebrate its second year but no one has yet claimed the grand prize, and the stage is open.

    Update from 2015, when the world of data is miles away from what it was in 2008: List of awesome datasets.

    Tags: , , , ,