• Posted by Konstantin 27.09.2008 No Comments
    Python 2.6 is out!

    "Python" by xkcd, abridged version. Full version here.

    Tags: , , ,

  • Posted by Konstantin 25.09.2008 13 Comments

    Money

    Money is something we take for granted and we usually don't think too much into what it actually is. As a result, when it comes up as a discussion topic, one may easily discover that nearly everyone has a personal twist on the understanding of what exactly is "money", and how this notion should be interpreted. At least that is the case within the circle of people, who are far from economics; and computer science people most often are. Last week this fact got itself another confirmation, when Swen came up with some fiscal policy-related problem, which didn't seem like a problem to me at all. After a long round of discussion we found out that when we said "money" we meant different things. So I thought I'd put down my understanding of this notion in words, so that next time I could point here and say: look, that's what I mean.

    Of course, I don't pretend to be even a bit authoritative. After all, I don't know anything about economics (although I did take the elementary course, as any computer science student should). But I do find my understanding quite consistent with observation and common knowledge.

    The Ideal Case

    It's easy to start by imagining a country, call it "Veggieland", where people are completely inactive. They don't eat, work or play. Don't shop, produce or consume. Just lie around like vegetables, enjoy the sun and somehow manage to stay alive nonetheless. It is clear, that in this world, no money is needed: there is simply nothing to do with it.

    Next, imagine that suddenly one veggielander Adam started longing for a glass of water, but he had no idea where to get it. At the same time another veggielander Betty saw Adam's problem and said: "Hey Adam, I can give you a glass of water, but you have to scratch my back for that", because Betty really liked when someone scratched her back. They exchanged services and were happy. Still no money was needed in Veggieland.

    After that day Adam discovered that he was really good at scratching Betty's back, and he even liked it. So next time he came to Betty and proposed to scratch her back even though he didn't want water. What he asked in return was a promise that she would get him a glass of water next time he would ask. After a month or so of scratching Betty's back he collected 100 water glass promises, each written with Betty's hand on a nice pink piece of paper, and he found out that this was much more water than he would need until the end of his life. But he had to do something with these promises, because he couldn't "unscratch" Betty's back, right? So he went to that other guy Carl and said: "Hey, Betty has promised me a glass of water, and I can pass this promise to you, if you give me that shiny glass bead that you have, because I adore glass beads". Carl always trusted Betty's promises so he agreed. Now Adam had a bead and 99 water promises, Carl had one water promise. Carl then went to Betty, showed her the pink piece of paper, Betty exchanged it for a glass of water and burned it. Now there were just 99 promises left in circulation in Veggieland.

    A month has passed and veggielanders understood that they could actually pass Betty's promises around, and give out their own promises in exchange for services. They wrote promises on pink pieces of paper and picked the "water glass promise" as a universal measure: now Daffy was providing floozies in exchange for 2 water glass promises, and Ewan was selling tiddlybums at a price 3 water glasses per portion. The system worked, because it satisfied two simple conditions:

    1. Firstly, at any given moment the number of undone work was equal the number of these pink paper promises in circulation.
    2. Secondly, the number of promised work was always feasible.

    These conditions were easy to satisfy in Veggieland, because veggielanders were all honest and friendly people. They decided that any time someone fulfills his own promise, she should destroy one pink paper, and any time someone gives a promise, she should write a new pink paper, but no one should give a promise he can't hold.

    As a result the people of Veggieland ended up with a perfect monetary system, driving its economics, and allowing citizens to trade services and goods. The analogy of this system with the real monetary systems around is straightforward, yet it is much easier to understand its properties, and thus the properties of the real-world money. The following are the important observations:

    • Money itself is just an abstract promise, it has no form or inherent value. As long conditions 1 and 2 are satisfied, there is really no difference whether the function of money is performed using pink papers, glass beads, or even word-of-mouth promises. Money is not a resource, don't be misled by the idea of the gold standard times, that money equals gold. The point of the gold standard was just that, at the time, gold could somewhy satisfy properties 1 and 2 better than printed money and checks. Otherwise, you don't really need to "support" money with goods for no purpose, but to provide an illusion to people that money is indeed worth it's promise.
    • The amount of money in circulation does not equal the amount of natural resources of the land. It is equal to the amount of economic activity, i.e., the number of unfulfilled promises. If someone discovers a gold mine, or simply brings in a lot of gold into the country, this will begin having impact only when people start actually demanding this resource and giving out promises in exchange.
    • Who makes money? Well, ideally, each person that gives a reasonable promise in exchange for a real service increases the amount of money in circulation, and the person that fulfills a promise - decreases. Of course, real people are not honest and therefore real monetary systems are a crude approximation, with banks being in response of this dynamic money emission process.
    • The choice of currency does make a difference. If someone comes to Veggieland with a bunch of euros (which are, in a sense, promises given by the europeans), the veggielanders won't be inclined to exchange them for their own promises, because although the newcomer can easily make use of the veggielanders' services, the contrary is not true: Veggielanders just never leave their country! A corollary of this observation is that the larger is the geographical span of a particular currency, the further away from the ideal trade unit it will be.

    The Real Life

    Now let's look at how monetary system is implemented in real countries. There are two main aspects to this implementation.

    First issue is the choice and social support for the base monetary unit. Each country typically figures out its favourite name and denomination for the unit and calls it its "currency". A social contract is then made among the citizens, that all promises in the country are to be measured in this unit, and everyone should accept this unit in return for services. Note that this social contract guarantees the absolute liquidity of the currency and puts it strictly aside from anything else you might imagine to be a "close analogue" for money - be it gold, government bonds or securities. The latter are just not liquid enough to work as a general promise-container.

    Secondly, the process of emission and subtraction of money from circulation. This certainly cannot be trusted to people, like it was in Veggieland. A banking system serves the purpose. The idea is that a single central trusted party is chosen, which emits and discards money. A person comes to the bank asking for a loan, the bank confirms that the person is trustworthy enough to keep his promises and gives him some money: money has just been emitted into circulation. Note that in our digital world most money is emitted in digital form, no paper money needs to be printed for that. In other words, the number you see on your account is actually larger than the number of paper money stored "behind" it. This is very natural, because otherwise the bank would have to keep paper-money-printing and destroying machines in it, which would be dangerous.

    Most real monetary systems are just a bit more complicated, having a two-level banking system. The central bank crudely regulates money supply by emitting paper money, giving loans to commercial banks and instructing them to as to how much "more" money they may emit via the reserve requirements. The commercial banks then provide actual banking services to the people. With this distributed system people have a choice of several banks to trust their finances to, not just one. And if it turns out that one bank emits too much (i.e. gives too many loans), so that the emitted promises cannot be met, only the people trusting this bank (i.e. holding "this bank's" money) will suffer.

    Final Questions

    Despite the fact that the above theory seems consistent to me, it is not without complications. As I believe you've already tired of reading this post, I'll avoid spilling too much further thought upon you and just list the three points I find most confusing briefly.

    • Despite being an abstract trade unit, money actually has value: it can bring interests. How do these interests "earned from nowhere" correspond to money emission? Why should owing lots of abstract promises necessarily generate more abstract promises?
    • Internet banks take a fee for transfers, which makes digital money somewhat less liquid than paper money. I find it incorrect. Why should I pay to pay? I understand that banks need resources to support their services, but the amount of resources should not be proportional to the number of operations performed.
    • If all the customers of a bank one day come to claim their savings in cash, the bank won't have enough cash and it will be perceived as a collapse for the bank. Considering the fact that digital money is, in fact, still real money, I see this as a certain drawback of the current two-level banking system. Or is it a necessary condition to distribute trust among the commercial banks?

    If you think I'm completely wrong with all that, you are welcome to state your opinion in the comments.

    Tags: ,

  • Posted by Konstantin 19.09.2008 No Comments

    In the yesterday's early-early-morning discussion with Meelis we had an idea that it would be nice to have some kind of an interactive Javascript-based version of the Patmatch-like binding site visualization. I found the idea to be relevant and easy to implement so I couldn't resist hacking up a small example.

     Threshold: 80%





    Tags: , , ,

  • Posted by Konstantin 17.09.2008 2 Comments

    Suppose the other day you came up with a fresh and brilliant idea: to make an automated system that would track the social and economical situation in various countries and predict when and where a new war is going to happen. So you collected historical data on numerous incidents in the world, fed that data into your favourite machine learning model, played around with the parameters, and voilà - you've got something that seems to work. There is just one thing left to do - assess whether the system you created is any good. At the very least, you need to measure its precision: the probability that a positive prediction will in fact be true.

    But how do you measure precision in this case? The straightforward way would be to wait a pair of years and compare the numbers of actual and predicted wars on the planet - this would provide an approximate estimate of the true generalization ability of your system. However, this is not a very convenient option, so you better figure out how to assess your new system using only the existing data. And this is when things can get confusing (at least if you think long enough). The most popular solutions here are, of course, holdout testing and cross-validation, but their interpretation is often overlooked, so I believe they are worth thinking about for a minute.

    Holdout

    Conceptually the simplest way is the holdout validation: you randomly split your data into a training set of size n and a test set of size m, use former for training and estimate performance on the latter. The two things to note here are the following:

    • Due to the finite size of the test set you will only get an approximate idea of the true performance. The rule of thumb is that the performance that you measure has an error of the order 1/sqrt(m). You get this number if you try to compute the 99% confidence interval under the normality assumption.
    • You are actually measuring the performance of the classifier constructed on your training set, and not the one that you are interested in. This introduces another bias in your measurement. Most often this bias is pessimistic (i.e. the performance of a classifier trained on a smaller set of data is usually worse than the performance of the classifier, trained on the full dataset), however if the dataset contains outliers this need not be the case. I am not aware of any results describing the magnitude of this bias, but I presume that for small datasets it depends on the proportion of data used for training.

    Cross-validation

    The idea of cross-validation is to repeat the holdout experiment described above k times and average the results.

    • If on each iteration your training would result in the same classifier, you could say that the resulting measured performance has an error of the order of 1/sqrt(km). However, each iteration of cross-validation results in a different classifier being built on the selected training set. As a result, you are testing k different classifiers, each with precision 1/sqrt(m), take their average performance, and hope to obtain a better estimate for the performance of your "main" classifier. This hope is justified, because you believe that all of the k classifiers are similar to the "main" one. But what if it is not true? Can it happen so that the variance introduced by the differences in the k classifiers is significant enough to be considered? It is, after all, possible, if the data is very scarce. But again, cross-validation is used precisely when the data is scarce.
    • Similarly to the case with holdout, the cross-validation estimate will most often be pessimistically biased, and it seems that no one really knows how large the bias is going to be.

    Double Cross-validation

    Finally, things get more complicated when you have some "hyper" parameters in your model that you need to tune (such as the λ tradeoff parameter for regularized models). If you simply estimate the performance for a range of parameter values and then pick the one reporting the best value, you are introducing an optimistic bias, because you have in fact tested your algorithm on the same data that was used for training. Therefore, it would be fair to "re-test" your algorithm on some additional data. It is not uncommon to see cases, when a round of cross-validation is used to select the model parameters, and another round is used to estimate the performance (that is, you split the training set in turn into a training and testing sets for model selection). But now things get strange.

    • If on each iteration of "outer" cross validation, we re-select the "high-impact" model parameters using "inner" cross-validation, aren't we introducing just too much variability into the k models of the outer round? What performance are we measuring? Is it optimistically or pessimistically biased now? In general, how close should this doubly-cross-validated performance measure to the real model performance? How should this number be interpreted?

    If some minutes of thought will help you answer these questions, I believe these would be minutes well worth spending.

    Tags: , ,

  • Posted by Konstantin 11.09.2008 7 Comments

    The internet nowadays is a data miner's paradise, providing unlimited ground for novel ideas and experiments. With just a brief look around one will easily find well-structured data about anything ranging from macroeconomic and business indicators to networks and genes.

    However, among this wild variety of datasets there are a few, which seem to possess an especially high wow-factor, either in terms of their fame, size, content, or the amount of potentially interesting information that is still waiting to be uncovered from them. Here is a short list of my text- and graph-mining favourites.

    1. Enron Emails. (download)
      Need a test set for your brand new text mining algorithm? Take this massive set of about 500 000 email messages from about 150 employees of the Enron Corporation, that were made public during the investigation of the famous fraud incident in 2004.
    2. AOL Search Logs. (download)
      A set of about 20 000 000 search queries, made by about 500 000 users, released by the AOL in 2006. The data contains an (anonymous) user id, search keywords and the rank of the the results link, that was clicked. NB: Besides being a nice research resource, it's a comprehensive collection of (probably outdated) porn links.
    3. OpenCyc Common Knowledge. (download)
      A large database of seemingly arbitrary common knowledge facts, such as "girls' moccasin" is a "moccasin used by female children", which is, in other words, "(ArtifactTypeForUserTypeFn Moccasin FemaleChild)". Comes with a reasoning engine and some serious ambitions.
    4. RealityMining: People Tracks. (on request)
      A continuous activity log of 100 people over the period of 9 months, recorded via a Symbian phone application. Includes call logs, Bluetooth devices in proximity, cell tower IDs, application usage, and phone status. If you ever want to construct a model of human social behaviour (or just interested in the emerging stalker technologies), this is certainly a place to look at. In contrast to some other entries in this list, the download is just a 38 megabyte file.
    5. DMOZ100k06: Internet Documents and Tags. (download)
      This is a sample of 100 000 webpages from the Mozilla Open Directory project together with their metadata and tags. A must see for those interested in bookmarks, tagging and how it relates to PageRank. Somewhat similar information could probably also be mined from CiteULike data.
    6. SwetoDblp: Ontology of CS Publications. (download)
      Interested in very large graphs and ontologies? Why not take the Computer Science Digital Bibliography and Library Project's data, converted to RDF. Contains about 11 000 000 triples, relating authors, universities and publications.
    7. Wikipedia3: Wikipedia in RDF. (download)
      This is just a conversion of English Wikipedia links and category relations into RDF, which produces a massive 47 000 000 triple dataset, available as a single download. The maintainers of the project, however, promise to develop it further and add more interesting relations to it in the future.
    8. Overstock Product Data. (download)
      Information on more than a million various products and their review ratings, available in a convenient tabular format. I'm not really sure about what can be done with it, but the sheer size of this data requires mentioning it here.
    9. Ensembl: Genomes of 40+ organisms. (download)
      This dataset is of a slightly different flavour than all of the above, and comes from a completely different weight category, but nonetheless it is purely textual and I thought it would be unfair not to mention it. In fact, this is probably one of the most expensive, promising and yet the least understood textual datasets in the world. And as long as noone really knows what to do with it, bioinformatics will prosper.

    I still lack one entry to make it a "top 10", so your suggestions are welcome.

    Update:

    1. The Netflix Challenge Dataset.(register)
      A data mining challenge that raised the awareness of the commercial importance of machine learning as a field to a new level. Netflix promises $1 000 000 for the first one to beat their movie rating prediction algorithm by just 10%. The contest will soon celebrate its second year but no one has yet claimed the grand prize, and the stage is open.

    Update from 2015, when the world of data is miles away from what it was in 2008: List of awesome datasets.

    Tags: , , , ,

  • Posted by Konstantin 09.09.2008 3 Comments

    Every time you visit this page, a piece of Javascript code will run within your browser, render a small part of the picture below (which is, for the sake of beauty and simplicity, a fragment of the Mandelbrot fractal) and submit the resulting pixels to the server. After 100 visits the whole picture will be complete (and the rendering restarts). If I hadn't told you that, you wouldn't have the slightest chance of noticing how this page steals your CPU cycles, and that is why one might refer to such practice as parasitic or leech computing.

    Mandelbrot fractal

    The Mandelbrot fractal

    In this simple example, I am probably not winning much by outsourcing the rendering procedure. The computation of each pixel requires about 800 arithmetic operations on average, and this is comparable to the overhead imposed by the need to communicate the results back to the server via HTTP. However, if I chose to render somewhat larger chunks of the image at higher precision, the gains would be much more significant. Additionally, the script could be written so that it would keep running continuously for as long as you are staying at the page, thus sacrificing the user experience somewhat, yet blatantly robbing you of CPU power.

    It seems that this approach to distributed computing has not reached the masses yet. I believe, however, that we are going to see the spread of such parasitic code someday, because it is the second easiest way to monetize website traffic. Indeed, we are already used to watching ads in return for free service. Moreover, quite a lot of the ads are rather heavy Flash applications that spend your CPU cycles with the sole purpose of annoying you. Now, if someone replaced that annoying Flashing banner with a script, that computed something useful behind the scenes, you wouldn't be too disappointed, would you? And that someone could then sell his website traffic not in terms of "banner displays", but in terms of "CPU seconds". Or, well, he could sell both.

    Of course, not every distributed computation can be easily implemented within such an environment. Firstly, it should be possible to divide the problem into a large number of independent parts: this is precisely the case when you need to compute the values of a certain function f for a large number of parameters. The Mandelbrot example above fits this description. Here is one other similar problem. Less obviously, various other tasks could be fit within the framework with the help of the Map-Reduce trick.

    Secondly, the computation of each value f(x) should be reasonably complex, preferably superlinear, i.e. Ω(n^2) or worse. Otherwise, the overhead of sending the inputs (which is O(n)) would offset the benefits too much.

    Thirdly, the description of the function f should be reasonably compact, otherwise the overhead of transferring it to each visitor would be too costly. Note, however, that this issue slightly depends on the kind of traffic being leeched upon: if a website has a small number of dedicated users, each user would only need to download the function definition once and refer to the cached version on his subsequent visits to the site.

    Finally, the function f, as well as its inputs and outputs must be public. This restriction severely limits the use of the approach. For example, although numerous data analysis tasks could satisfy the above conditions, in many practical contexts the data is private and it is thus not possible to openly distribute it to arbitrary visitors of an arbitrary website.

    Besides the theoretical difficulties, there are some technical issues that need to be solved before the whole thing can work, such as the security aspects (you can't trust the results!), implementation (Linear Algebra libraries for Javascript or Flash, please?), ethical concerns and some more.

    Nonetheless, the whole thing still looks rather promising to me, and is at least as worthy of academic and industrial attention, as are all of these overhyped Grid, P2P and SOA technologies around.

    PS: By the way, I find the topic well-suitable for a proper student project/thesis.


    Tags: , , ,

  • Posted by Konstantin 03.09.2008 4 Comments
    Google Chrome logo

    Geeks all over the world have just gained a new hot topic to flame or panic about. Web designers now have to verify their applications and websites against a yet another browser. Developers learned about that new open-source embeddable Javascript engine. All the normal people will have a choice of a yet another, hopefully well-made, browser to work with. Thus groweth the Church of Google.

    What is it to Google? Apart from the obvious increase in the user base and the potential to advertise suggest websites right in the address bar, wide distribution of Chrome should somewhat increase the amount of user-generated traffic flowing into Google servers. Indeed, the default configuration of Chrome, equipped with that marvelous auto-suggestion feature, seems to start sending stuff out as soon as you type your first character into the address line.

    Although the term "privacy violation" is the first one to pop out, let's keep that aside for now. The really interesting question concerns the nature of this constant influx of half-typed URLs and "search terms", annotated with timestamps and host IPs. Firstly, it most certainly contains additional value over whatever is already indexed in the web: global events, current social trends, new websites, ideas and random creative thoughts leave a mark on your address line. It therefore makes sense to look into this data and search for patterns. Secondly, the volume of this data stream is probably quite large whilst the noise is significant, hence it does not pay off to store it somewhere. And that's where we reach a nice observation: for many purposes you don't need to store this data.

    If the bandwidth of the stream is constantly high, you can afford to throw it away. If at any moment you should need data, just turn on the sniffer "put your bucket in", and you'll collect megabytes of interesting stuff in a matter of seconds, if not less. A simplistic version of such kind of "stream analysis" looks as follows: you ask Google "what do people read right now", it then listens to the stream for a second and responds with something meaningful, and I believe much cooler things can be thought of. Anyway, the important point is that no "global" indexing or data collection is needed to perform this service, just a thumb on the pulse of the web.

    Tags: , , ,

  • Posted by Konstantin 01.09.2008 10 Comments

    There is no easy way to explain why would a person, who has spent 11 years at school, 4 years at bachelor's and 3 years doing master's studies, then decide to enter PhD studies, risking yet another 4 years of his life, if not more. Unlike the case with the bachelor's or master's, there does not seem to be any social pressure encouraging to get a PhD. Neither is there much economical motivation, because getting a PhD does not guarantee higher pay. Finally,  although the PhD degree is indeed a damn cool thing to have, it is doubtful whether people possessing it enjoy life more than all the rest do.

    Nonetheless, PhD is a required attribute of anyone aspiring for the academic career, and is regarded as a qualitatively higher step on the education ladder. So, the question is, what makes this qualitative difference and what issues should one focus on most during the 4 years. Here's my initial guess:

    Your doctorate studies did not go in vain, if:

    1. You can generate a publishable paper in a month, a really good paper in 2-3 months,
    2. You can write a convincing grant/project proposal and you know when and where to submit it,
    3. You have realistic but useful ideas for future work and research,
    4. You know how to supervise/direct/collaborate with others and be actually useful at it,
    5. You know the most important people in your field, and they know you,
    6. You are good at lecturing or other kinds of oral presentation,
    7. You know what to do after you defend.

    I'm sure there's something missing, but this list has some aims complicated enough already. Hopefully, this blog will help me with points 3,5,7 of the above agenda as well as keep reminding of the fact that I don't want to loose my 4 years for nothing. It's day 1 today. 1460 days left. Yay!

    Tags: , ,