I bet most of you haven't heard of MDX, because it seems to be a technology that is difficult to stumble upon. It is not covered in university courses, not mentioned in the headlines of PC-magazine articles, and the corresponding google search returns about 20000 results, which makes it thousands (and even *tens of millions*) times less popular than most other related keywords. This is completely unfair. I believe MDX is compulsory knowledge for everyone who is educated in databases and data analysis enough to appreciate the virtues of both SQL queries and Excel-like spreadsheet processing. A famous quote by Alan Perlis says: "A language that doesn't affect the way you think about programming, is not worth knowing". In these terms, MDX is a language well worth knowing.

MDX stands for Multidimensional Expressions. It is a language for specifying computations and performing queries on a multidimensional database, which effectively provides the computing power of spreadsheets in the form of a query language. Although it is not possible to explain all of MDX in a blog post, I'll try to show the gist of it in a couple of examples. I hope it will raise at least some interest in those, who are patient enough to read to the end. To understand MDX you need some experience with spreadsheets, so I'll assume you do and use the analogies with Excel.

**Multidimensional data.** You use spreadsheets to work with tabular data, i.e. something like that:

Tartu | Tallinn | |
---|---|---|

Jan 21 | -1.0 °C | -2.1 °C |

Jan 22 | -0.2 °C | -0.2 °C |

Jan 23 | 0.3 °C | -0.2 °C |

This is a two-dimensional grid of numbers. Each point in the grid can be indexed by a tuple *(Date, City)*. A *multidimensional database* is essentially the same grid but without the limitation of two dimensions. For instance, if there are several methods of measuring temperature, you can add a new dimension and index each cell by a tuple *(Date, City, MeasurementMethod)*. And if you wish to keep both the average temperature and humidity, you just add a fourth dimension *MeasureType* to the database and store both, etc. Once your database becomes multidimensional it becomes impossible to display all of it as a two-dimensional table, so you will have to explore it by *slices*. For example, if the database did indeed contain 4 dimensions *(Date, City, MeasurementMethod, MeasureType)* then the above table would be a slice of it for *(MeasureType = Temperature, MeasurementMethod = Usual)*. The MDX query corresponding to the slice would look as follows:

select{ City.Tartu, City.Tallinn }on columns, { Date.Jan.21, Date.Jan.22, Date.Jan.23 }on rowswhere(MeasureType.Temperature, MeasurementMethod.Usual)

**Cell calculations.** The second thing you use spreadsheets for is to compute new cell values from the existing ones. For example, you might wish to augment the table above with a column showing the temperature difference between Tartu and Tallinn and a row showing the average temperature over three days:

Tartu | Tallinn | Difference | |
---|---|---|---|

Jan 21 | -1.0 °C | -2.1 °C | 1.1 °C |

Jan 22 | -0.2 °C | -0.2 °C | 0.0 °C |

Jan 23 | 0.3 °C | -0.2 °C | 0.5 °C |

Average | -0.3 °C | -0.83 °C | 0.53 °C |

To do that in Excel you would have to enter a formula for each cell in the new column and row. In MDX you analogously write:

create memberCity.Differenceas(City.Tartu - City.Tallinn)create memberDate.Jan.Averageas(Avg({ Date.Jan.21, Date.Jan.22, Date.Jan.23 }))

Note that once you have defined the new members, they apply to *any slice* of your data. I.e., you have just defined the way to compute *City.Difference* and *Date.Jan.Average* for *all* existing measure types and measurement methods. Moreover, many of the useful aggregate computations are already predefined for you. For example, the MDX server would implicitly understand that *Date.Jan* denotes be the aggregation (e.g. average) of values over all the days of January, and *Date* is the aggregation of values over all the dates ever. Similarly, *City* denotes the aggregation over all cities. Thus, selecting the average temperature over all cities in January is a matter of requesting

selectwhere(Date.Jan, City, MeasureType.Temperature, MeasurementMethod.Usual)

**Filters, orders and more.** You often need to query the data for things like "cities with the highest average temperature in January", or request to "order days by the temperature difference between Tartu and Tallinn". This is where both the power and complexity of MDX becomes visible. The first query would look as follows:

selectFilter(City.Members, Date.Jan == Max(Date.Jan))on columnswhere(MeasureType.Temperature, MeasurementMethod.Usual)

Notice how much more expressive this is in comparison to the equivalent query you would have to come up with in SQL. Besides slicing, filtering and ordering, an MDX server can support lots of various generic data processing and analysis functions (here, the exact capabilities depend on the vendor). Thus, when properly tuned, even the tasks such as "selecting differentially expressed genes that play a significant role in a linear model for predicting cancer stage from microarray expression data" could become a matter of a single concise query.

Pretty cool, don't you think?

now that I think of it, most of my programs with some parameter estimation use consequently included hashmaps with the conditions: so to estimate p(a|b, c, d, e) you do

for my $item (@set) {

$mle->{'specific'}->{b}->{c}->{d}->{e}->{a}++

$mle->{'general'}->{b}->{c}->{d}->{e}++

}

Now if I understand correctly that looks like filling 2 multidimensional tables. My actual question is: when dealing with really large data -- i.e. when the estimation doesn't fit into memory -- would it be logical and/or efficient to replace filling the hashes with filling an MDX database? Is there an MDXlite? 🙂

There is no MDXLite (yet, I hope), but there's Mondrian, which is an open-source reasonably lightweight (for a J2EE application, at least) MDX server, which basically translates MDX queries into SQL. It also knows how to exploit pre-aggregated data a bit.

I guess that the high-end expensive MDX servers a-la MSSQL should at least try to do their best to work as efficiently as possible on datasets of any size.

But in any case, MDX is just a language for describing what you need to be done, there is nothing inherenty efficient in it.

For example, I guess your example would translate to something like

It's now up to the server to translate it to something that executes quickly and uses preaggregations properly.