Four Years Remaining

Un-ordering Data

Posted by Konstantin 11.10.2009 2 Comments

I've recently stumbled upon a simple observation, which does not seem to be common knowledge and yet looks quite enlightening. Namely: polynomials provide an excellent way of modeling data in an order-agnostic manner.

The situations when you need to represent data in an order-agnostic way are actually fairly common. Suppose that you are given a traditional sample $x_1, x_2, \dots, x_n$ and are faced with a task of devising a generic function of the sample, which could only depend on the values in the sample, but not on the ordering of these values. Alternatively, you might need to prove that a given statistic is constant with respect to all permutations of the sample. Finally, you might simply wish to have a convenient mapping for your feature vectors that would lose the ordering information, but nothing else.

The most common way of addressing this problem is sorting the sample and working with the order statistics $x_{(1)}, x_{(2)}, \dots, x_{(n)}$ instead of the original values. This is not always convenient. Firstly, the mapping of the original sample to the corresponding vector of order statistics (i.e. the sorting operation) is quite complicated to express mathematically. Secondly, the condition that the vector of order statistics is always sorted is not very pleasant to work with. A much better idea is to represent your data as a polynomial of the form

$p_x(z) = (z+x_1)(z+x_2)\dots(z+x_n)\,.$

This will immediately provide you with a marvellous tool: two polynomials $p_x$ and $p_y$ are equal if and only if their roots are equal, which means, in our case, that the samples $x_1,\dots,x_n$ and $y_1,\dots,y_n$ are equal up to a reordering.

Now in order to actually represent the polynomial we can either directly compute its coefficients

$p_x(z) = z^n + a_1z^{n-1} + \dots + a_n\,,$

or calculate its values at any $n$ different points (e.g. at $0,1,\dots,n-1$ ) - in any case we end up with the same amount of data as we had originally (i.e. $n$ values), but the new representation is order-agnostic and has, arguably, much nicer properties than the order statistics vector.

It is not without its own problems, of course. Firstly, it requires at least $\Omega(n^2)$ time to compute. Secondly, not every polynomial will have $n$ real-valued roots. And thirdly, the interpretation of the new "feature vector" is not necessarily intuitive or meaningful. Yet nonetheless, it's a trick to consider.

Tags: Data analysis, Hacks, Mathematics, Statistics, Theory