Four Years Remaining

What is the Covariance Matrix?

Posted by Konstantin 23.11.2016 16 Comments
Basic linear algebra, introductory statistics and some familiarity with core machine learning concepts (such as PCA and linear models) are the prerequisites of this post. Otherwise it will probably make no sense. An abridged version of this text is also posted on Quora.

Most textbooks on statistics cover covariance right in their first chapters. It is defined as a useful "measure of dependency" between two random variables:

$\mathrm{cov}(X,Y) = E[(X - E[X])(Y - E[Y])].$

The textbook would usually provide some intuition on why it is defined as it is, prove a couple of properties, such as bilinearity, define the covariance matrix for multiple variables as ${\bf\Sigma}_{i,j} = \mathrm{cov}(X_i, X_j)$ , and stop there. Later on the covariance matrix would pop up here and there in seeminly random ways. In one place you would have to take its inverse, in another - compute the eigenvectors, or multiply a vector by it, or do something else for no apparent reason apart from "that's the solution we came up with by solving an optimization task".

In reality, though, there are some very good and quite intuitive reasons for why the covariance matrix appears in various techniques in one or another way. This post aims to show that, illustrating some curious corners of linear algebra in the process.

Meet the Normal Distribution

The best way to truly understand the covariance matrix is to forget the textbook definitions completely and depart from a different point instead. Namely, from the the definition of the multivariate Gaussian distribution:

We say that the vector $\bf x$ has a normal (or Gaussian) distribution with mean $\bf \mu$ and covariance $\bf \Sigma$ if:

$\Pr({\bf x}) =|2\pi{\bf\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}({\bf x} - {\bf\mu})^T{\bf\Sigma}^{-1}({\bf x} - {\bf \mu})\right).$

To simplify the math a bit, we will limit ourselves to the centered distribution (i.e. ${\bf\mu} = {\bf 0}$ ) and refrain from writing out the normalizing constant $|2\pi{\bf\Sigma}|^{-1/2}$ . Now, the definition of the (centered) multivariate Gaussian looks as follows:

$\Pr({\bf x}) \propto \exp\left(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}\right).$

Much simpler, isn't it? Finally, let us define the covariance matrix as nothing else but the parameter of the Gaussian distribution. That's it. You will see where it will lead us in a moment.

Transforming the Symmetric Gaussian

Consider a symmetric Gaussian distribution, i.e. the one with ${\bf \Sigma = \bf I}$ (the identity matrix). Let us take a sample from it, which will of course be a symmetric, round cloud of points:

We know from above that the likelihood of each point in this sample is

(1) $P({\bf x}) \propto \exp(-0.5 {\bf x}^T {\bf x}).$

Now let us apply a linear transformation ${\bf A}$ to the points, i.e. let ${\bf y} ={\bf Ax}$ . Suppose that, for the sake of this example, ${\bf A}$ scales the vertical axis by 0.5 and then rotates everything by 30 degrees. We will get the following new cloud of points ${\bf y}$ :

What is the distribution of ${\bf y}$ ? Just substitute ${\bf x}={\bf A}^{-1}{\bf y}$ into (1), to get:

(2) $\begin{align*} P({\bf y}) &\propto \exp(-0.5 ({\bf A}^{-1}{\bf y})^T({\bf A}^{-1}{\bf y}))\\ &=\exp(-0.5{\bf y}^T({\bf AA}^T)^{-1}{\bf y}) \end{align*}$

This is exactly the Gaussian distribution with covariance ${\bf \Sigma} = {\bf AA}^T$ . The logic works both ways: if we have a Gaussian distribution with covariance $\bf \Sigma$ , we can regard it as a distribution which was obtained by transforming the symmetric Gaussian by some ${\bf A}$ , and we are given ${\bf AA}^T$ .

More generally, if we have any data, then, when we compute its covariance to be $\bf\Sigma$ , we can say that if our data were Gaussian, then it could have been obtained from a symmetric cloud using some transformation $\bf A$ , and we just estimated the matrix ${\bf AA}^T$ , corresponding to this transformation.

Note that we do not know the actual $\bf A$ , and it is mathematically totally fair. There can be many different transformations of the symmetric Gaussian which result in the same distribution shape. For example, if $\bf A$ is just a rotation by some angle, the transformation does not affect the shape of the distribution at all. Correspondingly, ${\bf AA}^T = {\bf I}$ for all rotation matrices. When we see a unit covariance matrix we really do not know, whether it is the “originally symmetric” distribution, or a “rotated symmetric distribution”. And we should not really care - those two are identical.

There is a theorem in linear algebra, which says that any symmetric matrix $\bf \Sigma$ can be represented as:

(3) ${\bf \Sigma} = {\bf VDV}^T,$

where ${\bf V}$ is orthogonal (i.e. a rotation) and ${\bf D}$ is diagonal (i.e. a coordinate-wise scaling). If we rewrite it slightly, we will get:

(4) ${\bf \Sigma} = ({\bf VD}^{1/2})({\bf VD}^{1/2})^T = {\bf AA}^T,$

where ${\bf A} = {\bf VD}^{1/2}$ . This, in simple words, means that any covariance matrix $\bf \Sigma$ could have been the result of transforming the data using a coordinate-wise scaling ${\bf D}^{1/2}$ followed by a rotation $\bf V$ . Just like in our example with $\bf x$ and $\bf y$ above.

Principal Component Analysis

Given the above intuition, PCA already becomes a very obvious technique. Suppose we are given some data. Let us assume (or “pretend”) it came from a normal distribution, and let us ask the following questions:
1. What could have been the rotation $\bf V$ and scaling ${\bf D}^{1/2}$ , which produced our data from a symmetric cloud?
2. What were the original, “symmetric-cloud” coordinates $\bf x$ before this transformation was applied.
3. Which original coordinates were scaled the most by $\bf D$ and thus contribute most to the spread of the data now. Can we only leave those and throw the rest out?
All of those questions can be answered in a straightforward manner if we just decompose $\bf \Sigma$ into $\bf V$ and $\bf D$ according to (3). But (3) is exactly the eigenvalue decomposition of $\bf\Sigma$ . I’ll leave you to think for just a bit and you’ll see how this observation lets you derive everything there is about PCA and more.

The Metric Tensor

Bear me for just a bit more. One way to summarize the observations above is to say that we can (and should) regard ${\bf\Sigma}^{-1}$ as a metric tensor. A metric tensor is just a fancy formal name for a matrix, which summarizes the deformation of space. However, rather than claiming that it in some sense determines a particular transformation $\bf A$ (which it does not, as we saw), we shall say that it affects the way we compute angles and distances in our transformed space.

Namely, let us redefine, for any two vectors $\bf v$ and $\bf w$ , their inner product as:

(5) $\langle {\bf v}, {\bf w}\rangle_{\Sigma^{-1}} = {\bf v}^T{\bf \Sigma}^{-1}{\bf w}.$

To stay consistent we will also need to redefine the norm of any vector as

(6) $|{\bf v}|_{\Sigma^{-1}} = \sqrt{{\bf v}^T{\bf \Sigma}^{-1}{\bf v}},$

and the distance between any two vectors as

(7) $|{\bf v}-{\bf w}|_{\Sigma^{-1}} = \sqrt{({\bf v}-{\bf w})^T{\bf \Sigma}^{-1}({\bf v}-{\bf w})}.$

Those definitions now describe a kind of a “skewed world” of points. For example, a unit circle (a set of points with “skewed distance” 1 to the center) in this world might look as follows:

And here is an example of two vectors, which are considered “orthogonal”, a.k.a. “perpendicular” in this strange world:

Although it may look weird at first, note that the new inner product we defined is actually just the dot product of the “untransformed” originals of the vectors:

(8) ${\bf v}^T{\bf \Sigma}^{-1}{\bf w} = {\bf v}^T({\bf AA}^T)^{-1}{\bf w}=({\bf A}^{-1}{\bf v})^T({\bf A}^{-1}{\bf w}),$

The following illustration might shed light on what is actually happening in this $\Sigma$ -“skewed” world. Somehow “deep down inside”, the ellipse thinks of itself as a circle and the two vectors behave as if they were (2,2) and (-2,2).

Getting back to our example with the transformed points, we could now say that the point-cloud $\bf y$ is actually a perfectly round and symmetric cloud “deep down inside”, it just happens to live in a skewed space. The deformation of this space is described by the tensor ${\bf\Sigma}^{-1}$ (which is, as we know, equal to $({\bf AA}^T)^{-1}$ . The PCA now becomes a method for analyzing the deformation of space, how cool is that.

The Dual Space

We are not done yet. There’s one interesting property of “skewed” spaces worth knowing about. Namely, the elements of their dual space have a particular form. No worries, I’ll explain in a second.

Let us forget the whole skewed space story for a moment, and get back to the usual inner product ${\bf w}^T{\bf v}$ . Think of this inner product as a function $f_{\bf w}({\bf v})$ , which takes a vector $\bf v$ and maps it to a real number, the dot product of $\bf v$ and $\bf w$ . Regard the $\bf w$ here as the parameter (“weight vector”) of the function. If you have done any machine learning at all, you have certainly come across such linear functionals over and over, sometimes in disguise. Now, the set of all possible linear functionals $f_{\bf w}$ is known as the dual space to your “data space”.

Note that each linear functional is determined uniquely by the parameter vector $\bf w$ , which has the same dimensionality as $\bf v$ , so apparently the dual space is in some sense equivalent to your data space - just the interpretation is different. An element $\bf v$ of your “data space” denotes, well, a data point. An element $\bf w$ of the dual space denotes a function $f_{\bf w}$ , which projects your data points on the direction $\bf w$ (recall that if $\bf w$ is unit-length, ${\bf w}^T{\bf v}$ is exactly the length of the perpendicular projection of $\bf v$ upon the direction $\bf w$ ). So, in some sense, if $\bf v$ -s are “vectors”, $\bf w$ -s are “directions, perpendicular to these vectors”. Another way to understand the difference is to note that if, say, the elements of your data points numerically correspond to amounts in kilograms, the elements of $\bf w$ would have to correspond to “units per kilogram”. Still with me?

Let us now get back to the skewed space. If $\bf v$ are elements of a skewed Euclidean space with the metric tensor ${\bf\Sigma}^{-1}$ , is a function $f_{\bf w}({\bf v}) = {\bf w}^T{\bf v}$ an element of a dual space? Yes, it is, because, after all, it is a linear functional. However, the parameterization of this function is inconvenient, because, due to the skewed tensor, we cannot interpret it as projecting vectors upon $\bf w$ nor can we say that $\bf w$ is an “orthogonal direction” (to a separating hyperplane of a classifier, for example). Because, remember, in the skewed space it is not true that orthogonal vectors satisfy ${\bf w}^T{\bf v}=0$ . Instead, they satisfy ${\bf w}^T{\bf \Sigma}^{-1}{\bf v} = 0$ . Things would therefore look much better if we parameterized our dual space differently. Namely, by considering linear functionals of the form $f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf \Sigma}^{-1}{\bf v}$ . The new parameters $\bf z$ could now indeed be interpreted as an “orthogonal direction” and things overall would make more sense.

However when we work with actual machine learning models, we still prefer to have our functions in the simple form of a dot product, i.e. $f_{\bf w}$ , without any ugly $\bf\Sigma$ -s inside. What happens if we turn a “skewed space” linear functional from its natural representation into a simple inner product?

(9) $f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf\Sigma}^{-1}{\bf v} = ({\bf \Sigma}^{-1}{\bf z})^T{\bf v} = f_{\bf w}({\bf v}),$

where ${\bf w} = {\bf \Sigma}^{-1}{\bf z}$ . (Note that we can lose the transpose because $\bf \Sigma$ is symmetric).

What it means, in simple terms, is that when you fit linear models in a skewed space, your resulting weight vectors will always be of the form ${\bf \Sigma}^{-1}{\bf z}$ . Or, in other words, ${\bf\Sigma}^{-1}$ is a transformation, which maps from “skewed perpendiculars” to “true perpendiculars”. Let me show you what this means visually.

Consider again the two “orthogonal” vectors from the skewed world example above:

Let us interpret the blue vector as an element of the dual space. That is, it is the $\bf z$ vector in a linear functional ${\bf z}^T{\bf\Sigma}^{-1}{\bf v}$ . The red vector is an element of the “data space”, which would be mapped to 0 by this functional (because the two vectors are “orthogonal”, remember).

For example, if the blue vector was meant to be a linear classifier, it would have its separating line along the red vector, just like that:

If we now wanted to use this classifier, we could, of course, work in the “skewed space” and use the expression ${\bf z}^T{\bf\Sigma}^{-1}{\bf v}$ to evaluate the functional. However, why don’t we find the actual normal $\bf w$ to that red separating line so that we wouldn’t need to do an extra matrix multiplication every time we use the function?

It is not too hard to see that ${\bf w}={\bf\Sigma}^{-1}{\bf z}$ will give us that normal. Here it is, the black arrow:

Therefore, next time, whenever you see expressions like ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ or $({\bf v}-{\bf w})^T{\bf\Sigma}^{-1}({\bf v}-{\bf w})$ , remember that those are simply inner products and (squared) distances in a skewed space, while ${\bf \Sigma}^{-1}{\bf z}$ is a conversion from a skewed normal to a true normal. Also remember that the “skew” was estimated by pretending that the data were normally-distributed.

Once you see it, the role of the covariance matrix in some methods like the Fisher’s discriminant or Canonical correlation analysis might become much more obvious.

The Dual Space Metric Tensor

“But wait”, you should say here. “You have been talking about expressions like ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ all the time, while things like ${\bf w}^T{\bf\Sigma}{\bf v}$ are also quite common in practice. What about those?”

Hopefully you know enough now to suspect that ${\bf w}^T{\bf\Sigma}{\bf v}$ is again an inner product or a squared norm in some deformed space, just not the “internal data metric space”, that we considered so far. Which space is it? It turns out it is the “internal dual metric space”. That is, whilst the expression ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ denoted the “new inner product” between the points, the expression ${\bf w}^T{\bf\Sigma}{\bf v}$ denotes the “new inner product” between the parameter vectors. Let us see why it is so.

Consider an example again. Suppose that our space transformation $\bf A$ scaled all points by 2 along the $x$ axis. The point (1,0) became (2,0), the point (3, 1) became (6, 1), etc. Think of it as changing the units of measurement - before we measured the $x$ axis in kilograms, and now we measure it in pounds. Consequently, the norm of the point (2,0) according to the new metric, $|(2,0)|_{\Sigma^{-1}}$ will be 1, because 2 pounds is still just 1 kilogram “deep down inside”.

What should happen to the parameter ("direction") vectors due to this transformation? Can we say that the parameter vector (1,0) also got scaled to (2,0) and that the norm of the parameter vector (2,0) is now therefore also 1? No! Recall that if our initial data denoted kilograms, our dual vectors must have denoted “units per kilogram”. After the transformation they will be denoting “units per pound”, correspondingly. To stay consistent we must therefore convert the parameter vector (”1 unit per kilogram”, 0) to its equivalent (“0.5 units per pound”,0). Consequently, the norm of the parameter vector (0.5,0) in the new metric will be 1 and, by the same logic, the norm of the dual vector (2,0) in the new metric must be 4. You see, the “importance of a parameter/direction” gets scaled inversely to the “importance of data” along that parameter or direction.

More formally, if ${\bf x}'={\bf Ax}$ , then

(10) $\begin{align*} f_{\bf w}({\bf x}) &= {\bf w}^T{\bf x} = {\bf w}^T{\bf A}^{-1}{\bf x}'\\ & =(({\bf A}^{-1})^T{\bf w})^T{\bf x}'=f_{({\bf A}^{-1})^T{\bf w}}({\bf x}'). \end{align*}$

This means, that the transformation $\bf A$ of the data points implies the transformation ${\bf B}:=({\bf A}^{-1})^T$ of the dual vectors. The metric tensor for the dual space must thus be:

(11) $({\bf BB}^T)^{-1}=(({\bf A}^{-1})^T{\bf A}^{-1})^{-1}={\bf AA}^T={\bf \Sigma}.$

Remember the illustration of the “unit circle” in the ${\bf \Sigma}^{-1}$ metric? This is how the unit circle looks in the corresponding $\bf\Sigma$ metric. It is rotated by the same angle, but it is stretched in the direction where it was squished before.

Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by $\bf A$ becomes proportionally larger (note that the “unit circle” is, on the contrary, “squished” along those directions).

But the “stretch” of the space deformation in any direction can be measured by the variance of the data. It is therefore not a coincidence that ${\bf w}^T{\bf \Sigma}{\bf w}$ is exactly the variance of the data along the direction $\bf w$ (assuming $|{\bf w}|=1$ ).

The Covariance Estimate

Once we start viewing the covariance matrix as a transformation-driven metric tensor, many things become clearer, but one thing becomes extremely puzzling: why is the inverse covariance of the data a good estimate for that metric tensor? After all, it is not obvious that ${\bf X}^T{\bf X}/n$ (where $\bf X$ is the data matrix) must be related to the $\bf\Sigma$ in the distribution equation $\exp(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x})$ .

Here is one possible way to see the connection. Firstly, let us take it for granted that if $\bf X$ is sampled from a symmetric Gaussian, then ${\bf X}^T{\bf X}/n$ is, on average, a unit matrix. This has nothing to do with transformations, but just a consequence of pairwise independence of variables in the symmetric Gaussian.

Now, consider the transformed data, ${\bf Y}={\bf XA}^T$ (vectors in the data matrix are row-wise, hence the multiplication on the right with a transpose). What is the covariance estimate of $\bf Y$ ?

(12) ${\bf Y}^T{\bf Y}/n=({\bf XA}^T)^T{\bf XA}^T/n={\bf A}({\bf X}^T{\bf X}){\bf A}^T/n\approx {\bf AA}^T,$

the familiar tensor.

This is a place where one could see that a covariance matrix may make sense outside the context of a Gaussian distribution, after all. Indeed, if you assume that your data was generated from any distribution $P$ with uncorrelated variables of unit variance and then transformed using some matrix $\bf A$ , the expression ${\bf X}^T{\bf X}/n$ will still be an estimate of ${\bf AA}^T$ , the metric tensor for the corresponding (dual) space deformation.

However, note that out of all possible initial distributions $P$ , the normal distribution is exactly the one with the maximum entropy, i.e. the “most generic”. Thus, if you base your analysis on the mean and the covariance matrix (which is what you do with PCA, for example), you could just as well assume your data to be normally distributed. In fact, a good rule of thumb is to remember, that whenever you even mention the word "covariance matrix", you are implicitly fitting a Gaussian distribution to your data.
Tags: Algebra, Explanation, Machine learning, Mathematics, Probability theory, Statistics, Visualization
The Secrets of Spring Motion

Posted by Konstantin 17.11.2016 3 Comments

Imagine a weight hanging on a spring. Let us pull the weight a bit and release it into motion. What will its motion look like? If you remember some of your high-school physics, you should probably answer that the resulting motion is a simple harmonic oscillation, best described by a sinewave. Although this is a fair answer, it actually misses an interesting property of real-life springs. A property most people don't think much about, because it goes a bit beyond the high school curriculum. This property is best illustrated by

The Slinky Drop

The "slinky drop" is a fun little experiment which has got its share of internet fame.

The Slinky Drop

When the top end of a suspended slinky is released, the bottom seems to patiently wait for the top to arrive before starting to fall as well. This looks rather unexpected. After all, we know that things fall down according to a parabola, and we know that springs collapse according to a sinewave, however neither of the two rules seem to apply here. If you browse around, you will see lots of awesome videos demonstrating or explaining this effect. There are news articles, forum discussions, blog posts and even research papers dedicated to the magical slinky. However, most of them are either too sketchy or too complex, and none seem to mention the important general implications, so let me give a shot at another explanation here.

The Slinky Drop Explained Once More

Let us start with the classical, "high school" model of a spring. The spring has some length $L$ in the relaxed state, and if we stretch it, making it longer by $\Delta L$ , the two ends of the spring exert a contracting force of $k\Delta L$ . Assume we hold the top of the spring at the vertical coordinate $y_{\mathrm{top}}=0$ and have it balance out. The lower end will then position at the coordinate $y_{\mathrm{bot}} = -(L+mg/k)$ , where the gravity force $mg$ is balanced out exactly by the spring force.

How would the two ends of the spring behave if we let go off the top now? Here's how:

The horozontal axis here denotes the time, the vertical axis - is the vertical position. The blue curve is the trajectory of the top end of the spring, the green curve - trajectory of the bottom end. The dotted blue line is offset from the blue line by exactly $L$ - the length of the spring in relaxed state.

Observe that the lower end (the green curve), similarly to the slinky, "waits" for quite a long time for the top to approach before starting to move with discernible velocity. Why is it the case? The trajectory of the lower point can be decomposed in two separate movements. Firstly, the point is trying to fall down due to gravity, following a parabola. Secondly, the point is being affected by string tension and thus follows a cosine trajectory. Here's how the two trajectories look like separately:

They are surprisingly similar at the start, aren't they? And indeed, the cosine function does resemble a parabola up to $o(x^3)$ . Recall the corresponding Taylor expansion:

$\cos(x) = 1 - \frac{x^2}{2} + \frac{x^4}{24} + \dots \approx 1 - \frac{x^2}{2}.$

If we align the two curves above, we can see how well they match up at the beginning:

Consequently, the two forces happen to "cancel" each other long enough to leave an impression that the lower end "waits" for the upper for some time. This effect is, however, much more pronounced in the slinky. Why so?

Because, of course, a single spring is not a good model for the slinky. It is more correct to regard a slinky as a chain of strings. Observe what happens if we model the slinky as a chain of just three simple springs:

Each curve here is the trajectory of one of the nodes inbetween the three individual springs. We can see that the top two curves behave just like a single spring did - the green node waits a bit for the blue and then starts moving. The red one, however, has to wait longer, until the green node moves sufficiently far away. The bottom, in turn, will only start moving observably when the red node approaches it close enough, which means it has to wait even longer yet - by that time the top has already arrived. If we consider a more detailed model, the movement of a slinky composed of, say, 9 basic springs, the effect of intermediate nodes "waiting" becomes even more pronounced:

To make a "mathematically perfect" model of a slinky we have to go to the limit of having infinitely many infinitely small springs. Let's briefly take a look at how that solution looks like.

The Continuous Slinky

Let $x$ denote the coordinate of a point on a "relaxed" slinky. For example, in the two discrete models above the slinky had 4 and 10 points, numbered $1,\dots, 4$ and $1,\dots, 10$ respectively. The continuous slinky will have infinitely many points numbered $[0,1]$ .

Let $h(x,t)$ denote the vertical coordinate of a point $x$ at time $t$ . The acceleration of point $x$ at time $t$ is then, by definition $\frac{\partial^2 h(x,t)}{\partial^2 t}$ , and there are two components affecting it: the gravitational pull $-g$ and the force of the spring.

The spring force acting on a point $x$ is proportional to the stretch of the spring at that point $\frac{\partial h(x,t)}{\partial x}$ . As each point is affected by the stretch from above and below, we have to consider a difference of the "top" and "bottom" stretches, which is thus the derivative of the stretch, i.e. $\frac{\partial^2 h(x,t)}{\partial^2 x}$ . Consequently, the dynamics of the slinky can be described by the equation:

$\frac{\partial^2 h(x,t)}{\partial^2 t} = a\frac{\partial^2 h(x,t)}{\partial^2 x} - g.$

where $a$ is some positive constant. Let us denote the second derivatives by $h_{tt}$ and $h_{xx}$ , replace $a$ with $v^2$ and rearrange to get:

(1) $h_{tt} - v^2 h_{xx} = -g,$

which is known as the wave equation. The name stems from the fact that solutions to this equation always resemble "waves" propagating at a constant speed $v$ through some medium. In our case the medium will be the slinky itself. Now it becomes apparent that, indeed, the lower end of the slinky should not move before the wave of disturbance, unleashed by releasing the top end, reaches it. Most of the explanations of the slinky drop seem to refer to that fact. However when it is stated alone, without the wave-equation-model context, it is at best a rather incomplete explanation.

Given how famous the equation is, it is not too hard to solve it. We'll need to do it twice - first to find the initial configuration of a suspended slinky, then to compute its dynamics when the top is released.

In the beginning the slinky must satisfy $h_t(x, t) = 0$ (because it is not moving at all), $h(0, t) = 0$ (because the top end is located at coordinate 0), and $h_x(1, t) = 0$ (because there is no stretch at the bottom). Combining this with (1) and searching for a polynomial solution, we get:

$h(x, t) = h_0(x) = \frac{g}{2v^2}x(x-2).$

Next, we release the slinky, hence the conditions $h_t(x,t)=0$ and $h(0,t)=0$ disappear and we may use the d'Alembert's formula with reflected boundaries to get the solution:

$h(x,t) = \frac{1}{2}(\phi(x-vt) + \phi(x+vt)) - \frac{gt^2}{2},$

$\text{ where }\phi(x) = h_0(\mathrm{mod}(x, 2)).$

Here's how the solution looks like visually:

Notice how the part of the slinky to which the wave has not arrived yet, stays completely fixed in place. Here are the trajectories of 4 equally-spaced points on the slinky:

Note how, quite surprisingly, all points of the slinky are actually moving with a constant speed, changing it abruptly at certain moments. Somewhat magically, the mean of all these piecewise-linear trajectories (i.e. the trajectory of the center of mass of the slinky) is still a smooth parabola, just as it should be:

The Secret of Spring Motion

Now let us come back to where we started. Imagine a weight on a spring. What will its motion be like? Obviously, any real-life spring is, just like the slinky, best modeled not as a Hooke's simple spring, but rather via the wave equation. Which means that when you let go off the weight, the weight will send a deformation wave, which will move along the spring back and forth, affecting the pure sinewave movement you might be expecting from the simple Hooke's law. Watch closely:

Here is how the movement of the individual nodes looks like:

The fat red line is the trajectory of the weight, and it is certainly not a sinewave. It is a curve inbetween the piecewise-linear "sawtooth" (which is the limit case when the weight is zero) and the true sinusoid (which is the limit case when the mass of the spring is zero). Here's how the zero-weight case looks like:

And this is the other extreme - the massless spring:

These observations can be summarized into the following obviously-sounding conclusion: the basic Hooke's law applies exactly only to the the massless spring. Any real spring has a mass and thus forms an oscillation wave traveling back and forth along its length, which will interfere with the weight's simple harmonic oscillation, making it "less simple and harmonic". Luckily, if the mass of the weight is large enough, this interference is negligible.

And that is, in my opinion, one of the interesting, yet often overlooked aspects of spring motion.

Tags: Explanation, Mathematics, Paradox, Physics, Science, Theory, Visualization
The 3D Web

Posted by Konstantin 25.07.2016 No Comments
The web started off with the simple idea of adding hyperlinks to words within text documents. The hyperlinks would let the reader to easily "jump" from one document or page to another, thus undermining the need to read the text sequentially, page by page, like a book. Suddenly, the information available to a computer user expanded from single documents into a whole network of disparate, interconnected pages. The old-school process of reading the pages in a fixed order became the process of browsing this network.

Hyperlinks could be added to the corresponding words by annotating these words with appropriate mark-up tags. For example, "this sentence" would become "this <a href="other_document">sentence</a>". This looked ugly on a text-only screen, hence browsers were soon born - applications, which could render such mark-up in a nicer way. And once you view your text through a special application which knows how to interpret mark-up (a serious advance back in the 90s), you do not have to limit yourself to only tagging your text with hyperlinks - text appearance (such as whether it should be bold, italic, large or small) could just as well be specified using the same kind of mark-up.

Fast-forward 25 years, most documents on the web consist nearly entirely of mark-up - some have nearly no text at all. The browsers now are extremely complex systems whose job goes way beyond turning hyperlinks into underlined words. They run code, display graphics and show videos. The web experience becomes more graphical and interactive with each year.

There is one area, however, that the web has not yet fully embraced - 3D graphics. First attempts to enrich the web experience with 3D go back as far as 94, when VRML was born. It picked some traction in the scientific community - projects were made which used VRML to, say, visualize molecules. Unfortunately, the common web developers of the time mostly regarded VRML as an arcane technology irrelevant to their tasks, and the layman user would not care to install a heavyweight VRML plug-in in order to view a molecule in the browser. Now, if it were possible to make, say, an addictive 3D shooter game with VRML, things would probably be different - a critical mass of users would be tempted to install the plug-in to play the game, and the developers would become tempted to learn the arcane tech to develop a selling product for that critical mass. Apparently, no selling games were created with VRML.

It took about fifteen years for the browsers to develop native support for 3D rendering technology. It is now indeed possible to create addictive 3D games, which run in the browser. And although a whole market for in-browser 3D applications has been born, the common web developer of our time still regards 3D as an arcane technology irrelevant to their tasks. It requires writing code in an unfamiliar API and it does not seem to interoperate with the rest of your webpage well. The successors of VRML still look like rather niche products for the specialized audience.

I have recently discovered the A-Frame project and I have a feeling that this might finally bring 3D into the web as a truly common primitive. It interoperates smoothly with HTML and Javascript, it works out of the box on most browsers, it supports 3D virtual reality headsets, and it relies on an extremely intuitive Entity-Component approach to modeling (just like the Unity game engine, if you know what that means). Let me show you by example what I mean:
```
<a-scene>
    <a-cube color="#d22" rotation="0 13 0">
        <a-animation attribute="position"
                     dur="1000"
                     easing="ease-in-out-quad"
                     direction="alternate"
                     to="0 2 0"
                     repeat="indefinite">
         </a-animation>
     </a-cube>
</a-scene>
```
This piece of markup can be included anywhere within your HTML page, you can use your favourite Javascript framework to add interactivity or even create the scene dynamically. You are not limited to any fixed set of entities or behaviours - adding new components is quite straightforward. The result seems to work on most browsers, even the mobile ones.

Of course, things are not perfect, A-Frame's version number is 0.2.0 at the moment, and there are some rough edges to be polished and components to be developed. None the less, the next time you need to include a visualization on your webpage, try using D3 with A-frame, for example. It's quite enjoyable and feels way more natural than any of the 3D-web technologies I've tried so far.
Tags: 3D, Programming, Visualization, Web
Holography in Simple Terms

Posted by Konstantin 24.07.2015 No Comments

Holography is a fascinating way of storing and displaying visual information - one which will, hopefully, one day become as commonplace as photography and video is today. Unfortunately, at the current moment it is still a largely exotic trade, limited to artistic exhibitions or science museum displays. While the explanation of how photography works is usually part of the standard school curriculum, holography, at the same time, is often regarded as some mysterious sci-fi technology. The problem is made worse by the fact that there do not seem to be too many explanations around in the Internet, which would aim to properly dispel the aura of mystery. Here is my take at fixing that problem.

Let us start by considering the following illustration, where two hypothetical eyes are observing two hypothetical light sources.

The light from the two lamps reaches the eyes at different angles and intensities, which depend on the positions of the eyes relative to the lights. Because of that difference in angles and intensities, the brain, attached to the eyes, is capable of "triangulating" the location of the lamps, thus forming an enjoyable "three-dimensional" perception of the whole scene for itself.

Let us position a piece of glass in front of the light sources. We can regard this glass as a set of "pixels", through which light rays fly from the lamps towards the eyes. The figure below illustrates three such randomly selected "pixels" on the glass along with the directions of rays that leave them.

What if we could design a piece of glass which would fully consist of such hypothetical "pixels", with each pixel emitting two light rays in the appropriate directions? The eyes, observing such a piece of glass, would not know that the rays are really produced "in the pixels". Indeed, they perceive exactly the same light rays as if the light was coming from two lamps, positioned somewhere behind the glass. Hence, the brain, attached to the eyes, has to perform the same triangulation as the one it did when the lamps were real, and is not be able to distinguish the "fake glass" from a real scene with the lamps.

Although this idea of faking reality might look nice, the task of manufacturing small enough "pixels", which would be able to emit light in a number of given directions with given intensities and colors, is, unfortunately, technologically too complex to be practical.

This is where Physics comes to help us out. "Hey, guys," - she says, - "you are forgetting that light is not really a set of rays flying from from point A to point B along straight lines. Light is a wave. Hence, assuming that your light sources are emitting coherent light of a given frequency, a correct illustration of what happens in your scene with two lamps would be the following:"

If we now consider an arbitrary point-"pixel" on our glass, the electromagnetic waves which pass through it are perceived there in the form of some oscillation. The whole situation above could be modeled by considering waves on a water surface. To simulate the "glass" (the blue line) let us have a horizontal row of buoys float on the surface. Each of the buoys will be moving up and down in its place due to the waves, and we can record this movement.

Here comes a question: what if we remove the original two oscillation sources and instead will start "replaying" the recorded movements of the buoys buy forcing them to move up and down and thus form their own waves? The example below simulates this situation using 19 "buoys":

Does not look nice at all, does it? Now Mathematics comes along to help: "Try harder!" - she says. "If your buoys are dense enough, everything will work out!". Hence, we increase the number of buoys and try again:

Indeed, now we observe how a row of oscillating buoys (an analogue of a "glass" in our model) is generating a wave in front of it, which is completely identical to the one that was produced when two sources were oscillating behind it. No matter where we position ourselves in front of the buoy row, our perception will happily fool us into thinking that the wave we see is really a result of two point sources oscillating somewhere in a distance. Our brain will even help us to estimate the exact position of those sources somewhere behind the row. He does not know he is being fooled. Try covering the upper half of the animation above to get a better understanding of this illusion.

Hence, if we can make a glass, which consists of "pixels" each of which could "record" a profile of a light wave and then be able to reproduce it, the "image" observed by looking at such glass would be perfectly identical to an image that would be observed if the glass would simply pass through it the light from some sources located behind it. But is it possible? Isn't asking each "pixel" to record an arbitrary oscillation too much?

Here is when Mathematics comes along to help again. "Do you want to see a magic trick?" - she asks. "Yes!" - we answer.

Remember that you have two light sources emitting waves behind your piece of glass. As we agreed, they are emitting coherent light, i.e. light of the same frequency (for example, you could shine a laser light on the two points and have them reflect this light). Now let us see what happens at one of the pixels-"buoys" of your glass.

This point receives light waves from both sources. As the distances to the two sources are different, the two light waves reach the pixel with different intensities and phases. Formally, suppose the wave from the first source reaches the pixel with amplitude $a_1$ and phase $b_1$ and the wave from the second source reaches the pixel with amplitude $a_2$ and phase $b_2$ . The two waves add up, so the pixel "feels" the overall oscillation of the form

$a_1 \sin(ft + b_1) + a_2 \sin(ft + b_2).$

Now if you have not two, but three, four, ten, or a million of point sources behind the glass, the arriving signal will be, correspondingly, a sum of three, four, ten, or even a million sinewaves with different amplitudes and phases.

Now here comes the trick, hold your breath: Sum of any number of such sinewaves is itself just a single sinewave with some amplitude and phase.

In other words, even if there is a million point sources emitting coherent light from behind the glass, each "pixel-buoy" of our glass will still perform a simple oscillation of the form $A \sin(ft + B)$ . Hence, to "record" the oscillation of each "pixel-buoy" we must store only two parameters: its intensity and phase.

Now everything starts to look much simpler. Storing light intensity of a pixel is simple - any photograph does that. Dark spots on a photo or a screen emit (or reflect) low intensity light, white spots emit high intensity. It only remains to store the phase. Several methods have been proposed for that. Conceptually the simplest one (but technologically the most advanced) makes use of particular crystals which can shift the phase of light that passes through them. Each "pixel" could then be made out of such a crystal with an appropriate phase shift coded in it. When a "glass" made of such crystals is lit from the back by a coherent light source, each pixel would introduce the necessary phase shift into the light which passes through it, and thus reproduce the necessary wave front. I presume that this HoloVisio project is relying on this idea to design a proper holographic monitor.

The conventional holograms, the ones you typically see in science museums, are made using a simpler principle, though. Omitting some technical details, the idea behind it is the following. Let us add to the scene a reference ray of coherent light.

Now at some pixels on our "glass" the phase of the reference ray will coincide with the phase of the light coming from the scene. The overall wave oscillation at those points will be increased. At other points the phases of the reference and the scene light will mismatch and the signal will be weakened. We can record the resulting interference pattern by storing light spots on the glass at the pixels where the phases matched and dark spots where the phases did not match.

Now, whenever we shine the reference ray again through the resulting plate, only the parts of the ray with the "correct" phase will be let through, as those are exactly the places where the glass is transparent to the reference ray. As a result, you'll obtain a decent holographic reproduction. Here's a figure from Wikipedia, illustrating this principle of recording and reproducing holograms:

Considering the speed at which our photocameras have been transitioning over the recent decades from megapixel resolutions into gigapixels and observing the fast spread of stereoscopic "3D" from movie theaters into TV-sets and portable devices, it is both surprising and sad to see no notable signs of proper holographic display technology being developed in parallel. I do hope to see mass-produced holographic displays along with appropriate software within my own lifetime, though.

Tags: 3D, Mathematics, Physics, Visualization
My Favourite Statistical Method

Posted by Konstantin 05.04.2015 4 Comments

When it comes to data analysis, there are hundreds of exciting approaches: simple summary statistics and hypothesis tests, various clustering methods, linear and nonlinear regression or classification techniques, neural networks of various types and depths, decision rules and frequent itemsets, feature extractors and dimension reductors, ensemble methods, bayesian approaches and graphical models, logic-based approaches and fuzzy stuff, ant colonies, genetic algorithms and other optimization methods, monte-carlo algorithms, sampling and density estimation, logic-based and graph methods. Don't even get me started on the numerous visualization techniques.

This sheer number of options is, however, both a blessing and a curse at the same time. In many practical situations just having those methods at your disposal may pose more problems than solutions. First you need to pick one of the approaches that might possibly fit your purpose. Then you will try to adapt it appropriately, spend several iterations torturing the data only to obtain very dubious first results, come to the conclusion that most probably you are doing something wrong, reconvince yourself that you need to try harder in that direction, spend some more iterations testing various parameter settings. Nothing works as you want it to, so you start everything from scratch with another method to find yourself obtaining new, even more dubious results, torturing the data even further, getting tired of that and finally settling on something "intermediately decent", which "probably makes sense", although you are not so sure any more and feel frustrated.

I guess life of a statistician was probably way simpler back in the days when you could run a couple of t-tests, or an F-test from a linear regression and call it a day. In fact, it seems that many experimental (e.g. wetlab) scientists still live in that kind of world, when it comes to analyzing their experimental results. The world of T-tests is cozy and safe. They don't get you frustrated. Unfortunately, t-tests can feel ad-hockish, because they force you to believe that something "is normally distributed". Also, in practice, they are mainly used to confirm the obvious rather than discover something new from the data. A simple scatterplot will most often be better than a t-test as an analysis method. Hence, I am not a big fan of T-tests. However, I do have my own favourite statistical method, which always feels cozy and safe, and never gets me frustrated. I tend to apply it whenever I see a chance. It is the Fisher exact test in the particular context of feature selection.

My appreciation of it stems from my background in bioinformatics and some experience with motif detection in particular. Suppose you have measured the DNA sequences for a bunch of genes. What can you do to learn something new about the sequence structure from that data? One of your best bets is to first group your sequences according to some known criteria. Suppose you know from previous experiments that some of the genes are cancer-related whereas others are not. As soon as you have specified those groups, you can start making observations like the following: "It seems that 10 out of my 20 cancer-related genes have the subsequence GATGAG in their DNA code. The same sequence is present in only 5 out of 100 non-cancer-related ones. How probable would it be to obtain similar counts of GATGAG, if the two groups were picked randomly?" If the probability to get those counts at random is very low, then obviously there is something fishy about GATGAG and cancer - perhaps they are related. To compute this probability you will need to use the hypergeometric distribution, and the resulting test (i.e. the question "how probable is this situation in a random split?") is known as the Fishers' exact test.

This simple logic (with a small addition of a multiple testing correction on top) has worked wonders for finding actually important short sequences on the DNA. Of course it is not limited to sequence search. One of our research group's most popular web tools uses the same approach to discover functional annotations, that are "significantly overrepresented" in a given group of genes. The same approach can be used to construct decision trees, and in pretty much any other "supervised learning" situation, where you have groups of objects and want to find binary features of those objects, associated with the groups.

Although in general the Fisher test is just one particular measure of association, it is, as I noted above, rather "cozy and comfortable". It does not force me to make any weird assumptions, there is no "ad-hoc" aspect to it, it is simple to compute and, most importantly, in my experience it nearly always produces "relevant" results.

Words overrepresented in the speeches of Greece MPs

A week ago me, Ilya and Alex happened to take part in a small data analysis hackathon, dedicated to the analysis of speech transcripts from the European Parliament. Somewhat analogously to DNA sequences, speeches can be grouped in various ways: you can group them by the speaker who gave them, by country, gender or political party of that speaker, by the month or year when the speech was given or by any combination of such groupings. The obvious "features" of a speech are words, which can be either present or not present in it. Once you view the problem this way the task of finding group-specific words becomes self-evident and the Fisher test is the natural solution to it. We implemented this idea and extracted "country-specific" and "time-specific" words from the speeches (other options were left out due to time constraints). As is usual the case with my favourite method, the obtained results look relevant, informative and, when shown in the form of a word cloud, fun. Check them out.

The complete source code of the analysis scripts and the visualization application is available on Github.

Tags: Bioinformatics, Data analysis, Data mining, Programming, Project, Statistics, Tool, Visualization
Postscript

Posted by Konstantin 15.03.2015 2 Comments
Anyone who has had to deal with scientific literature must have encountered Postscript (".ps") files. Although the popularity of this format is gradually fading behind the overwhelming use of PDFs, you can still find .ps documents on some major research paper repositores, such as arxiv.org or citeseerx. Most people who happen to produce those .ps or .eps documents, do it using auxiliary tools, most commonly while typesetting their papers in LaTeX, or while preparing images for those papers using a vector graphics editor (e.g. Inkscape). As a result, Postscript tends to be regarded by the majority of its users as some random intermediate file format, akin to any of the myriad of other vector graphics formats.

I find this situation unfortunate and unfair towards Postscript. After all, PS is not your ordinary vector graphics format. It is a fully-fledged Turing-complete programming language, that is well thought through and elegant in its own ways. If it were up to me, I would include a compulsory lecture on Postscript into any modern computer science curriculum. Let me briefly show you why.

Stack-based programming

Firstly, PostScript is perhaps the de-facto standard example of a proper purely stack-based language in the modern world. Other languages of this group are nowadays either dead, too simple, too niche, or not practical. Like any stack language, it looks a bit unusual, yet it is simple to reason about and its complete specification is decently short. Let me show you some examples:
```
2 2 add     % 2+2 (the two numbers are pushed to the stack,
            % then the add command pops them and replaces with
            % the resulting value 4)
/x 2 def                  % x := 2 (push symbol "x", push value 2,
                          %         pop them and create a definition)
/y x 2 add def            % y := x + 2 (you get the idea)
(Hello!) show             % print "Hello!"
x 0 gt {(Yes) show} if    % if (x > 0) print "Yes"
```
Adding in a couple of commands that specify font parameters and current position on the page, we may write a perfectly valid PS file that would perform arithmetic operations, e.g:
```
/Courier 10 selectfont   % Font we'll be using
72 720 moveto            % Move cursor to position (72pt, 720pt)
                         % (0, 0) is the lower-left corner
(Hello! 2+2=) show
2 2 add                  % Compute 2+2
( ) cvs                  % Convert the number to a string.
                         % (First we had to provide a 1-character
                         % string as a buffer to store the result)
show                     % Output "4"
```
Computer graphics

Postscript has all the basic facilities you'd expect from a programming language: numbers, strings, arrays, dictionaries, loops, conditionals, basic input/output. In addition, being primarily a 2D graphics language, it has all the standard graphics primitives. Here's a triangle, for example:
```
newpath           % Define a triangle
    72 720 moveto
    172 720 lineto
    72 620 lineto
closepath
gsave             % Save current path
10 setlinewidth   % Set stroke width
stroke            % Stroke (destroys current path)
grestore          % Restore saved path again
0.5 setgray       % Set fill color
fill              % Fill
```
Postscript natively supports the standard graphics transformation stack:
```
/triangle {       % Define a triangle of size 1 n the 0,0 frame
    newpath
        0 0 moveto
        1 0 lineto
        0 1 lineto
    closepath
    fill
} def

72 720 translate      % Move origin to 72, 720
gsave                 % Push current graphics transform
    -90 rotate        % Draw a rotated triangle
    72 72 scale       % .. with 1in dimensions
    triangle
grestore              % Restore back to non-scaled, non-rotated frame
gsave
    100 0 translate   % Second triangle will be next to the first one
    32 32 scale       % .. smaller than the first one
    triangle          % .. and not rotated
grestore
```
Here is the result of the code above:

Two triangles

The most common example of using a transformation stack is drawing fractals:
```
/triangle {
    newpath
        0 0 moveto
        100 0 lineto
        0 -100 lineto
    closepath
    fill
} def

/sierpinski {
    dup 0 gt
    {
        1 sub
        gsave 0.5 0.5 scale dup sierpinski grestore
        gsave 50 0 translate 0.5 0.5 scale dup sierpinski grestore
        gsave 0 -50 translate 0.5 0.5 scale sierpinski grestore
    }
    { pop triangle }
    ifelse
} def
72 720 translate  % Move origin to 72, 720
5 5 scale
5 sierpinski
```
Sierpinski triangle

With some more effort you can implement nonlinear dynamic system (Mandelbrot, Julia) fractals, IFS fractals, or even proper 3D raytracing in pure PostScript. Interestingly, some printers execute PostScript natively, which means all of those computations can happen directly on your printer. Moreover, it means that you can make a document that will make your printer print infinitely many pages. So far I could not find a printer that would work that way, though.

System access

Finally, it is important to note that PostScript has (usually read-only) access to the information on your system. This makes it possible to create documents, the content of which depends on the user that opens it or machine where they are opened or printed. For example, the document below will print "Hello, %username", where %username is your system username:
```
/Courier 10 selectfont
72 720 moveto
(Hi, ) show
(LOGNAME) getenv {} {(USERNAME) getenv pop} ifelse show
(!) show
```
I am sure, for most people, downloading a research paper from arxiv.org that would greet them by name, would probably seem creepy. Hence this is probably not the kind of functionality one would exploit with good intentions. Indeed, Dominique has an awesome paper that proposes a way in which paper reviewers could possibly be deanonymized by including user-specific typos in the document. Check out the demo with the source code.

I guess this is, among other things, one of the reasons we are going to see less and less of Postscript files around. But none the less, this language is worth checking out, even if only once.
Tags: Computer science, Fun, Language, Programming, Visualization
A Brief Visual Guide to Bar Charts

Posted by Konstantin 26.10.2012 4 Comments

It is annoying how popular it is to ignore the Y-axis limits on bar charts nowadays. Unfortunately, this is also the default mode for most plotting packages, so no one wants to do anything about it. But something must be done.

Tags: Data analysis, Project, Visualization
Venn Diagrams in Python

Posted by Konstantin 13.10.2012 38 Comments

I have recently discovered that simple Venn diagrams are surprisingly popular in bioinformatics. So popular they are, in fact, that there are several bioinformatics research papers devoted solely to their use. And those are highly accessed papers, let me add! Yet, despite this wild popularity, tools that let you render a decent Venn diagram programmatically seem to be rather scarce.

Vennerable plot

If you google a bit, you will find a bunch of on-line tools of varying degrees of quality and ability (1, 2, 3, 4, 5, 6, 7, 8, 9,...), a Java-based tool, a Perl library, a couple of Python scripts (1, 2), some R libraries (1, 2, 3, 4, 5), and lots of forum discussions. Seems to be plenty, doesn't it? Well, it turns out that if you want your diagram to be area-weighted (i.e. the regions of the diagram should be roughly proportional to the corresponding set sizes), 4 of those 18 links won't do. If you want to generate and configure the diagram conveniently from a script, drop another 9. Then, if you want the diagram to look nice, drop 4 more, and all you are left with is the Vennerable R package. Unfortunately, Vennerable plots are still a pain to configure — even adding a plot title seems to be very tricky, not speaking of highlighting and annotating a region on the diagram.

Having been totally disappointed in the state of the art of contemporary Venn-diagramming tools, I made a small Python package for drawing Venn diagrams that has the necessary flexibility. At least it lets me put plot titles and annotate diagram regions as I fancy.

Matplotlib-venn plot

Package installation goes by the standard method: easy_install matplotlib-venn

For basic usage examples, consult the PyPI page.

Tags: Bioinformatics, Project, Python, Tool, Visualization
On Visualization and Data Analysis

Posted by Konstantin 23.01.2012 2 Comments

Visualization is a very powerful method for data analysis. Very often, plotting a bunch of scatterplots, barplots, heatmaps, animations or other kinds of imagery is enough to immediately see by your own eyes, whether there are any interesting patterns in the data (which often means you have nearly solved the problem) or not (which means you should prepare yourself for a long-term battle with the data which might not end up succesfully for you).

Visualization is powerful because by visualizing data you essentially "plug it" directly into your brain's processing engine, using the visual interface that happens to be supported by your brain. You need to convert the data into CSV or an XLS format to load it into Excel. Analogously, you need a 2d image or an animation to load the data into your brain - it is that simple.

This view suggests two immediate developments. Firstly, why don't we use the other "interfaces" that our brain has with the outside world for data processing? Could converting data to something which sounds, feels, tastes or smells be a useful method for exploiting our brain's analytic capabilities even further? Obviously, visual input has the most impact simply due to the fact that the retina is an immediate part of the brain. However, auditory signals, for example, seem to have a powerful processing system in our brain dedicated to them too.

Secondly, if we can appreciate how much our brain is capable of extracting from a single image, why don't we try to automate such an approach? Modern computer vision has reached sufficient maturity to be capable of extracting fairly complex informative features from images. This suggests that a particular 2d plot of a dataset can be used as a kind of an informative "data fingerprint" which, when processed by a computer vision-driven engine, could be analyzed on the presence of "visible" patterns and visual similarity to other datasets.

The fun part is that there has been some research done in this direction. Consider the paper "Computer Vision for Music Identification" by Yan Ke et al. The authors propose to convert pieces of music into a spectrogram image. Those spectrogram images can then be compared to each other using methods of computer vision, thus resulting in an efficient similarity metric, usable for search and identification of musical pieces. The authors claim to achieve 95% precision at 90% recall, which compares favourably to alternative methods. I think it would be exciting to see more of such techniques applied in a wider range of areas.

Representing audio as pictures, figure from (Y.Ke, 2005)

Tags: Computer vision, Data analysis, Pattern analysis, Project, Visualization
On Supervised Visualization of Multivariate Data

Posted by Konstantin 26.02.2009 No Comments
The first step in the analysis of multivariate data is visualization. Histograms of attribute distributions, scatterplots, box-and-whiskers diagrams, parallel coordinate plots, self organizing maps, and even plots of happy faces - are all means of helping a human to visually comprehend multidimensional data and expoit the enormous power of the human brain to detect patterns. Of all these techniques, two-dimensional scatterplots are perhaps the most popular, as they tend to provide an especially "realistic" feel for the data. But when your data has more than two attributes (perhaps hundreds or thousands), how do you choose the two projection coordinates that would provide you with the "best angle" on the data?

The easiest answer to that question is, of course, to pick a pair of attributes A_i and A_j, and simply plot one versus the other. Unfortunately, this doesn't usually work well, especially when the dataset does have hundreds of attributes. Therefore, the most popular approach in practice is to use PCA and project the data onto the two largest principal components, which mostly results in a rather insightful image.

The PCA projection is, however, completely unsupervised. If your data has class labels assigned to points, PCA does not take them into account. No matter what is the labeling, PCA will always produce the same projection onto the coordinates with the highest variation. This might leave an improper impression that the two classes overlap a lot when in fact they do not. Therefore, this is not what you need. Usually, in the case of labeled data you expect from a scatterplot to provide an indication of how separated the two classes are from each other, and how difficult could it be to discriminate between them. It turns out that it is very easy to construct a linear projection with such properties.

The Linear classifier-based Scatterplot

Assume there are two classes in the data and we are interested in a linear projection, that demonstrates how separated the classes are. Let us train a linear classifier to discriminate the two classes. It does not matter which algorithm you use, as long as it results in a separating hyperplane. Now naturally, the normal to this hyperplane is the main coordinate of interest to you: it is the direction along which the data will be classified linearly by your algorithm. If there is a coordinate for demonstrating separation, this must be it. The choice of the second projection coordinate does not matter much, so I would propose picking any direction orthogonal to the first.

When you have three classes you could select the first projection coordinate as the normal of a hyperplane, separating the first class from the second, and the second coordinate as the normal of a hyperplane, separating the first class from the third.

Finally, note that in general you need not limit yourself to linear classifiers only. Any classifier of the form $y_i = \mathrm{sign}(f(x_i))$ will provide you with an informative coordinate projection function $f(x)$ . This is a natural "supervised" alternative to kernel-PCA or SOM.

Naive supervised linear scatterplot (NS-plot)

To be somewhat more specific, here's a suggestion of a very simple implementation for the abovementioned idea. To avoid the use of a potentially complicated linear classifier training algorithm, let us just pick the vector connecting the means of the two classes as the first projection coordinate. The second coordinate is chosen at random and then orthogonalized with the first one. The Scilab code of the whole algorithm is therefore the following:
```
  // Input:
  //   X - the data matrix (instances in rows, attributes in columns)
  //   C - class assignments (C(i) is the class of instance X(i,:))
  mean_1 = mean(X(C ==  1, :), 'r')';
  mean_2 = mean(X(C == -1, :), 'r')';
  v1 = (mean_2 - mean_1)/norm(mean_2 - mean_1);
  v2 = rand(v1);
  v2 = v2 - v2'*v1*v1;
  v2 = v2/norm(v2);
  X_proj = X*[v1 v2];
  // Output:
  //   X_proj - the projected coordinates
```
Notice how much simpler and more efficient it is than PCA. Despite the simplicity, I haven't seen the use of such a plot anywhere else, so let me coin the boring name NS-plot for it. Personal experience shows that the resulting plot is visually never much worse than a PCA plot, and most often the two plots complement each other. Let me illustrate that on two simple examples.

The IRIS dataset. The plots below show the PCA and the NS plots of the famous iris dataset (where I removed the first class). There is clearly no strong advantage of one plot over the other except that PCA is more difficult to compute.

The ARCENE dataset. The following plots depict the 1000-attribute ARCENE sample dataset. We can see how PCA prefers to stress the unsupervised clustering present in the dataset, thus potentially deemphasizing the specifics of class labeling. In this case, I would say, the PCA and the NS plots complement each other.

Bonus

Noticed the circled points on the plots above? This is one other small trick that I find quite useful, and that does not seem to be widely known. The circled points denote the "boundary" - these are the points whose nearest neighbor is of a different class than their own. The more boundary points there are - the more difficult is the classification problem. The boundary is not an absolute notion, because there are various ways to define distance between points. My suggestion would be to standardize all attributes and use the euclidean norm, unless you have good reasons to do something else (e.g. you a-priori know good weights for the attributes, etc).
Tags: Data analysis, Visualization