• Posted by Konstantin 22.01.2009 2 Comments

    I bet most of you haven't heard of MDX, because it seems to be a technology that is difficult to stumble upon. It is not covered in university courses, not mentioned in the headlines of PC-magazine articles, and the corresponding google search returns about 20000 results, which makes it thousands (and even tens of millions) times less popular than most other related keywords. This is completely unfair. I believe MDX is compulsory knowledge for everyone who is educated in databases and data analysis enough to appreciate the virtues of both SQL queries and Excel-like spreadsheet processing. A famous quote by Alan Perlis says: "A language that doesn't affect the way you think about programming, is not worth knowing". In these terms, MDX is a language well worth knowing.

    MDX stands for Multidimensional Expressions. It is a language for specifying computations and performing queries on a multidimensional database, which effectively provides the computing power of spreadsheets in the form of a query language. Although it is not possible to explain all of MDX in a blog post, I'll try to show the gist of it in a couple of examples. I hope it will raise at least some interest in those, who are patient enough to read to the end. To understand MDX you need some experience with spreadsheets, so I'll assume you do and use the analogies with Excel.

    Multidimensional data. You use spreadsheets to work with tabular data, i.e. something like that:

    Tartu Tallinn
    Jan 21 -1.0 °C -2.1 °C
    Jan 22 -0.2 °C -0.2 °C
    Jan 23 0.3 °C -0.2 °C
    Average temperature

    This is a two-dimensional grid of numbers. Each point in the grid can be indexed by a tuple (Date, City). A multidimensional database is essentially the same grid but without the limitation of two dimensions. For instance, if there are several methods of measuring temperature, you can add a new dimension and index each cell by a tuple (Date, City, MeasurementMethod). And if you wish to keep both the average temperature and humidity, you just add a fourth dimension MeasureType to the database and store both, etc. Once your database becomes multidimensional it becomes impossible to display all of it as a two-dimensional table, so you will have to explore it by slices. For example, if the database did indeed contain 4 dimensions (Date, City, MeasurementMethod, MeasureType) then the above table would be a slice of it for (MeasureType = Temperature, MeasurementMethod = Usual). The MDX query corresponding to the slice would look as follows:

        { City.Tartu, City.Tallinn } on columns,
        { Date.Jan.21, Date.Jan.22, Date.Jan.23 } on rows
      where (MeasureType.Temperature, MeasurementMethod.Usual)

    Cell calculations. The second thing you use spreadsheets for is to compute new cell values from the existing ones. For example, you might wish to augment the table above with a column showing the temperature difference between Tartu and Tallinn and a row showing the average temperature over three days:

    Tartu Tallinn Difference
    Jan 21 -1.0 °C -2.1 °C 1.1 °C
    Jan 22 -0.2 °C -0.2 °C 0.0 °C
    Jan 23 0.3 °C -0.2 °C 0.5 °C
    Average -0.3 °C -0.83 °C 0.53 °C
    Average temperature

    To do that in Excel you would have to enter a formula for each cell in the new column and row. In MDX you analogously write:

      create member City.Difference as (City.Tartu - City.Tallinn)
      create member Date.Jan.Average as 
             (Avg({ Date.Jan.21, Date.Jan.22, Date.Jan.23 }))

    Note that once you have defined the new members, they apply to any slice of your data. I.e., you have just defined the way to compute City.Difference and Date.Jan.Average for all existing measure types and measurement methods. Moreover, many of the useful aggregate computations are already predefined for you. For example, the MDX server would implicitly understand that Date.Jan denotes be the aggregation (e.g. average) of values over all the days of January, and Date is the aggregation of values over all the dates ever. Similarly, City denotes the aggregation over all cities. Thus, selecting the average temperature over all cities in January is a matter of requesting

      where (Date.Jan, City, MeasureType.Temperature, 

    Filters, orders and more. You often need to query the data for things like "cities with the highest average temperature in January", or request to "order days by the temperature difference between Tartu and Tallinn". This is where both the power and complexity of MDX becomes visible. The first query would look as follows:

        Filter(City.Members, Date.Jan == Max(Date.Jan)) on columns
      where (MeasureType.Temperature, MeasurementMethod.Usual)

    Notice how much more expressive this is in comparison to the equivalent query you would have to come up with in SQL. Besides slicing, filtering and ordering, an MDX server can support lots of various generic data processing and analysis functions (here, the exact capabilities depend on the vendor). Thus, when properly tuned, even the tasks such as "selecting differentially expressed genes that play a significant role in a linear model for predicting cancer stage from microarray expression data" could become a matter of a single concise query.

    Pretty cool, don't you think?

    Tags: , , , , ,

  • Posted by Konstantin 13.11.2008 1 Comment

    It is good to attend lectures from time to time, even if these are on the topics that you "already know". Because, once in a while, you get to find out something new about things that are so "simple" and "familiar", that you thought there was nothing new you could possibly discover about them. So, today I've had an accidental revelation regarding a simple yet amusing property of the maximum likelihood estimation method. I think I should have been told that years ago, right on the first statistics course.

    Maximum Likelihood

    Maximum likelihood estimation is one of the simplest and most popular methods for fitting mathematical models to given data. It is most easily illustrated by an example. Suppose you are given a list of numbers (x_1, x_2, \dots, x_n), which, you believe, are all instances of a normally distributed random variable with mean \mu and variance \sigma^2. The problem is that the value of \mu is unknown to you and you would like to estimate it from the data. How do you do it? The maximum likelihood method suggests you to search for a value of \mu, for which the probability (or probability density) of obtaining exactly this set of numbers would be maximal. That is:

    \hat\mu = \mathrm{argmax}_{\mu} P[x_1, x_2, \dots, x_n\,|\,\mu]

    In most cases this turns out to be quite a reasonable suggestion, which leads to good estimates (although, there are some rare exceptions). Here, for example, it turns out that to estimate the mean you should simply compute the average of the given numbers:

    \hat\mu = \frac{1}{n}\sum_i x_i

    This is more-or-less all what a typical textbook has to say about maximal likelihood, but there's more to it.

    Counting Lightbulbs

    Consider the following example. Suppose you are willing to measure the average lifetime of a common lightbulb (that is, the amount of time it will glow until burning out). For that you buy 1000 lightbulbs in a general store, switch all of them on and wait for a year, keeping track of the timepoints at which the  individual lightbulbs burn out. Suppose the year has passed and you managed to record the lifetime of just 50 "dead" lightbulbs (the remaining 950 keep on glowing even after the whole year of uninterrupted work). What do you do now? Do you wait for other lightbulbs to burn out? But this can take ages! Do you just discard the 950 lightbulbs and only consider the 50 for which you have "available data"? This is also not good, because, clearly, the knowledge that 95% of lightbulbs live longer than a year is important. Indeed, from this fact alone you could infer that the average lifetime must be between 18 and 21 years with 95% confidence (an interested reader is invited to verify this claim). So what do you do?

    Assume that the lifetime of a lightbulb can be modeled using exponential distribution with mean \lambda (this is a natural assumption because exponential distribution is well-suited for describing the lifetime of electric equipment). Then the probability (probability density, in fact) that lightbulb i burns out after t_i years can be expressed as:

    P[X=t_i\,|\,\lambda] = \frac{1}{\lambda}e^{-\frac{t_i}{\lambda}}.

    Naturally, the likelihood of lightbulbs 1,2,3,...,50 burning out after t_1,t_2,\dots,t_{50} years correspondingly, is just the product of the corresponding expressions:


    Now what about the lightbulbs that are still alive? The likelihood that a lightbulb will glow longer than a year can be easily expressed as:

    P[X > 1\,|\,\lambda] = e^{-\frac{1}{\lambda}}.

    Therefore, the total likelihood of having 50 lightbulbs burn out at time moments t_1,t_2,\dots,t_{50} and 950 lightbulbs keep working longer than a year is the following product:

    P[X_1 = t_1, X_2 = t_2, \dots, X_{50} = t_{50}, X_{51} > 1, \dots, X_{1000} > 1|\,\lambda] =
    = \left(\prod_{i=1}^{50}P[X_i = t_i\,|\,\lambda]\right) \cdot \left(P[X > 1 \,|\,\lambda]\right)^{950} =
    = \left(\prod_{i=1}^{50}\frac{1}{\lambda}e^{-\frac{t_i}{\lambda}}\right)\cdot e^{-\frac{950}{\lambda}}.

    Finding the estimate of \lambda that maximizes this expression is now a simple technical detail of minor importance. What is important is the ease with which we managed to operate the "uncertain measurements" for the 950 lightbulbs.

    Final Remarks

    Note that the idea can be extended to nearly arbitrary kinds of uncertainty. For example, if, for some lightbulb we knew that its lifetime was a number between a and b, we would include this piece of knowledge by using the term P[a < X < b] in the likelihood function. This inherent flexibility of the maximum likelihood method looks like a feature quite useful for the analysis of noisy data.

    Surprisingly, however, it does not seem to be applied much in practice. The reason for that could lie in the fact that many nontrivial constraints on the data will make the likelihood function highly nonconvex and hard to optimize. Consider, for example, the probability density

    P[X = x_1 \vee X = x_2 \vee X = x_3 \,|\,\lambda] = P[x_1\,|\,\lambda] + P[x_2\,|\,\lambda] + P[x_3\,|\,\lambda]

    for a normally distributed variable X. Depending on the choice of x_1, x_2, and x_3, the likelihood function will have either three different peaks at \lambda = x_1, x_2, x_3, two peaks or just one. Thus, finding an exact optimum is much harder than in the typical textbook case. But on the other hand, there are various cases where the likelihood function is nicely convex (consider the interval constraint mentioned above) and besides there are numerous less efficient algorithms around anyway. So maybe the trick is not used much, simply because it's not in the textbook?

    Tags: , ,

  • Posted by Konstantin 22.10.2008 No Comments

    A beloved child has many names, and so does the field of pattern analysis. Some people call it soft computing, others refer to it as machine learning or data-driven statistics. Although in practice all those terms denote the same approach to data analysis, there is always a certain bias hidden behind each name that relates it to its history and origins. I'm not sure whether I understand these biases correctly, but here's how I would define the terms:

    Pattern Analysis
    This is probably the most general term that refers to a kind of data analysis, where the goal is to find "something interesting" in a given dataset. We typically know what kind of things (patterns) we consider interesting (this might be an association rule, a frequent subsequence, a classifier, etc), and the task is to detect the instances of this kind of patterns. Conceptually somewhat opposite to pattern analysis would be statistical hypothesis testing, where the task is to test a given pattern for interestingness. In this sense pattern analysis is very close to the multiple hypothesis testing statistical problem.
    The simplest way to formalize this generic notion I've seen in this exposition by Tilj de Bie. Let D denote the dataset, let \Pi be the pattern space (e.g. "the space of all sequences", or "the space of all classifiers", etc), and let for each \pi\in\Pi there be a pattern function p_\pi(\cdot), so that p_\pi(D) measures the "interestingness" of this pattern with respect to the data D. Then the general task of pattern analysis is just the following maximization problem:

    \max_{\pi\in\Pi} p_\pi(D)

    Machine Learning
    Machine learning can be regarded as a specific type of pattern analysis, where the central interest lies in finding classifiers or regression functions. The dataset in this case consists of a number of "instances" x_i \in X, and the task is to find a function f: X \to Y that maps these instances to corresponding "outputs" \hat{y}_i = f(x_i) in various ways.
    In the case of supervised machine learning, a "true" output y_i is provided with each instance x_i, and the resulting function should aim to match this output for given instances (i.e. f(x_i) \approx y_i) as well as generalize to the unseen ones. Statisticians call this task regression analysis. A classical example is the problem of detecting incoming spam mail by learning the classification rule from previously labeled messages.
    The task of unsupervised machine learning is to find a classifier f without any prespecified labeling of data. Typically one searches for a function that maps similar inputs to similar outputs. If the set of outputs Y is small and discrete, the task is referred to as clustering and somewhat resembles the problem of quantization. If the outputs are continuous, the task can often be related to either factor analysis or noise reduction.
    Finally, the problem of semi-supervised machine learning is to find a classifier for a dataset where some instances have labels and some don't. This can sometimes be regarded as task of missing value imputation.
    The typical approach to all types of machine learning task usually employs the definition of a loss measure L(f), that estimates the goodness of fit for a given classifier f, and then searching for a classifier that optimizes this measure, i.e.:

    \max_f L(f)

    This defines machine learning as a certain kind of pattern analysis.

    Soft Computing
    If the previous terms denote purely mathematical approaches, for which the exact computational method of obtaining a solution is somewhat irrelevant, then this term, on the contrary, denotes a set of certain metaheuristical computational techniques for obtaining these solutions. In the narrowest sense, by soft computing one might collectively denote the areas of evolutionary algorithms, fuzzy logic and neural networks. This gives soft computing a somewhat ad-hockish flavour: "If you don't know how to solve it, simply put it into a neural network, neural networks just work".

    Data Mining
    This might be just my opinion, but I define data mining as any kind of pattern analysis, that is applied in an offline setting. Moreover, I'd say that, ideally, true data mining happens when you attempt to search for patterns in data that was not initially meant for that specific analysis, thus contradicting the classical statistical ideology "specify your hypothesis first, collect the data later".

    Knowledge Discovery from Databases
    As far as I understand it, KDD is just a YABA and marketingspeak for "data mining", somewhat in the same manner as OLAP and Business Intelligence are marketingspeak for "descriptive statistics".

    Data-Driven Statistics
    In the narrowest sense, data-driven statistics denotes all kinds of nonparametric statistical approaches. These are those methods, where one manages to perform data analysis without specifying any parameterized distributions for the inputs or the like. Typical examples would be the various resampling and randomization techniques. In the broad sense, again, this might very much be the same as pattern analysis in general.

    Any other popular terms for the same thing I've forgotten here?

    Tags: , , , , ,

  • Posted by Konstantin 11.09.2008 7 Comments

    The internet nowadays is a data miner's paradise, providing unlimited ground for novel ideas and experiments. With just a brief look around one will easily find well-structured data about anything ranging from macroeconomic and business indicators to networks and genes.

    However, among this wild variety of datasets there are a few, which seem to possess an especially high wow-factor, either in terms of their fame, size, content, or the amount of potentially interesting information that is still waiting to be uncovered from them. Here is a short list of my text- and graph-mining favourites.

    1. Enron Emails. (download)
      Need a test set for your brand new text mining algorithm? Take this massive set of about 500 000 email messages from about 150 employees of the Enron Corporation, that were made public during the investigation of the famous fraud incident in 2004.
    2. AOL Search Logs. (download)
      A set of about 20 000 000 search queries, made by about 500 000 users, released by the AOL in 2006. The data contains an (anonymous) user id, search keywords and the rank of the the results link, that was clicked. NB: Besides being a nice research resource, it's a comprehensive collection of (probably outdated) porn links.
    3. OpenCyc Common Knowledge. (download)
      A large database of seemingly arbitrary common knowledge facts, such as "girls' moccasin" is a "moccasin used by female children", which is, in other words, "(ArtifactTypeForUserTypeFn Moccasin FemaleChild)". Comes with a reasoning engine and some serious ambitions.
    4. RealityMining: People Tracks. (on request)
      A continuous activity log of 100 people over the period of 9 months, recorded via a Symbian phone application. Includes call logs, Bluetooth devices in proximity, cell tower IDs, application usage, and phone status. If you ever want to construct a model of human social behaviour (or just interested in the emerging stalker technologies), this is certainly a place to look at. In contrast to some other entries in this list, the download is just a 38 megabyte file.
    5. DMOZ100k06: Internet Documents and Tags. (download)
      This is a sample of 100 000 webpages from the Mozilla Open Directory project together with their metadata and tags. A must see for those interested in bookmarks, tagging and how it relates to PageRank. Somewhat similar information could probably also be mined from CiteULike data.
    6. SwetoDblp: Ontology of CS Publications. (download)
      Interested in very large graphs and ontologies? Why not take the Computer Science Digital Bibliography and Library Project's data, converted to RDF. Contains about 11 000 000 triples, relating authors, universities and publications.
    7. Wikipedia3: Wikipedia in RDF. (download)
      This is just a conversion of English Wikipedia links and category relations into RDF, which produces a massive 47 000 000 triple dataset, available as a single download. The maintainers of the project, however, promise to develop it further and add more interesting relations to it in the future.
    8. Overstock Product Data. (download)
      Information on more than a million various products and their review ratings, available in a convenient tabular format. I'm not really sure about what can be done with it, but the sheer size of this data requires mentioning it here.
    9. Ensembl: Genomes of 40+ organisms. (download)
      This dataset is of a slightly different flavour than all of the above, and comes from a completely different weight category, but nonetheless it is purely textual and I thought it would be unfair not to mention it. In fact, this is probably one of the most expensive, promising and yet the least understood textual datasets in the world. And as long as noone really knows what to do with it, bioinformatics will prosper.

    I still lack one entry to make it a "top 10", so your suggestions are welcome.


    1. The Netflix Challenge Dataset.(register)
      A data mining challenge that raised the awareness of the commercial importance of machine learning as a field to a new level. Netflix promises $1 000 000 for the first one to beat their movie rating prediction algorithm by just 10%. The contest will soon celebrate its second year but no one has yet claimed the grand prize, and the stage is open.

    Update from 2015, when the world of data is miles away from what it was in 2008: List of awesome datasets.

    Tags: , , , ,

  • Posted by Konstantin 03.09.2008 4 Comments
    Google Chrome logo

    Geeks all over the world have just gained a new hot topic to flame or panic about. Web designers now have to verify their applications and websites against a yet another browser. Developers learned about that new open-source embeddable Javascript engine. All the normal people will have a choice of a yet another, hopefully well-made, browser to work with. Thus groweth the Church of Google.

    What is it to Google? Apart from the obvious increase in the user base and the potential to advertise suggest websites right in the address bar, wide distribution of Chrome should somewhat increase the amount of user-generated traffic flowing into Google servers. Indeed, the default configuration of Chrome, equipped with that marvelous auto-suggestion feature, seems to start sending stuff out as soon as you type your first character into the address line.

    Although the term "privacy violation" is the first one to pop out, let's keep that aside for now. The really interesting question concerns the nature of this constant influx of half-typed URLs and "search terms", annotated with timestamps and host IPs. Firstly, it most certainly contains additional value over whatever is already indexed in the web: global events, current social trends, new websites, ideas and random creative thoughts leave a mark on your address line. It therefore makes sense to look into this data and search for patterns. Secondly, the volume of this data stream is probably quite large whilst the noise is significant, hence it does not pay off to store it somewhere. And that's where we reach a nice observation: for many purposes you don't need to store this data.

    If the bandwidth of the stream is constantly high, you can afford to throw it away. If at any moment you should need data, just turn on the sniffer "put your bucket in", and you'll collect megabytes of interesting stuff in a matter of seconds, if not less. A simplistic version of such kind of "stream analysis" looks as follows: you ask Google "what do people read right now", it then listens to the stream for a second and responds with something meaningful, and I believe much cooler things can be thought of. Anyway, the important point is that no "global" indexing or data collection is needed to perform this service, just a thumb on the pulse of the web.

    Tags: , , ,