Four Years Remaining

The Mystery of Early Stopping

Posted by Konstantin 06.12.2017 7 Comments

Early stopping is a technique that is very often used when training neural networks, as well as with some other iterative machine learning algorithms. The idea is quite intuitive - let us measure the performance of our model on a separate validation dataset during the training iterations. We may then observe that, despite constant score improvements on the training data, the model's performance on the validation dataset would only improve during the first stage of training, reach an optimum at some point and then turn to getting worse with further iterations.

The early stopping principle

It thus seems reasonable to stop training at the point when the minimal validation error is achieved. Training the model any further only leads to overfitting. Right? The reasoning sounds solid and, indeed, early stopping is often claimed to improve generalization in practice. Most people seem to take the benefit of the technique for granted. In this post I would like to introduce some skepticism into this view or at least illustrate that things are not necessarily as obvious as they may seem from the diagram with the two lines above.

How does Early Stopping Work?

To get a better feeling of what early stopping actually does, let us examine its application to a very simple "machine learning model" - the estimation of the mean. Namely, suppose we are given a sample of 50 points $\mathbf{x}_i$ from a normal distribution with unit covariance and we need to estimate the mean $\mathbf{w}$ of this distribution.

Sample

The maximum likelihood estimate of $\mathbf{w}$ can be found as the point which has the smallest sum of squared distances to all the points in the sample. In other words, "model fitting" boils down to finding the minimum of the following objective function:

$f_\mathrm{train}(\mathrm{w}) := \sum_{i=1}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2$

As our estimate is based on a finite sample, it, of course, won't necessarily be exactly equal to the true mean of the distribution, which I chose in this particular example to be exactly (0,0):

Sample mean as a minimum of the objective function

The circles in the illustration above are the contours of the objective function, which, as you might guess, is a paraboloid bowl. The red dot marks its bottom and is thus the solution to our optimization problem, i.e. the estimate of the mean we are looking for. We may find this solution in various ways. For example, a natural closed-form analytical solution is simply the mean of the training set. For our purposes, however, we will be using the gradient descent iterative optimization algorithm. It is also quite straightforward: start with any point (we'll pick (-0.5, 0) for concreteness' sake) and descend in small steps downwards until we reach the bottom of the bowl:

Gradient descent

Let us now introduce early stopping into the fitting process. We will split our 50 points randomly into two separate sets: 40 points will be used to fit the model and 10 will form the early stopping validation set. Thus, technically, we now have two different objective functions to deal with:

$f_\mathrm{fit}(\mathrm{w}) := \sum_{i=1}^{40} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2$

and

$f_\mathrm{stop}(\mathrm{w}) := \sum_{i=41}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2.$

Each of those defines its own "paraboloid bowl", both slightly different from the original one (because those are different subsets of data):

Fitting and early stopping objectives

As our algorithm descends towards the red point, we will be tracking the value of $f_\mathrm{stop}$ at each step along the way:

Gradient descent with validation

With a bit of imagination you should see on the image above, how the validation error decreases as the yellow trajectory approaches the purple dot and then starts to increase after some point midway. The spot where the validation error achieves the minimum (and thus the result of the early stopping algorithm) is shown by the green dot on the figure below:

Early stopping

In a sense, the validation function now acts as a kind of a "guardian", preventing the optimization from converging towards the bottom of our main objective. The algorithm is forced to settle on a model, which is neither an optimum of $f_\mathrm{fit}$ nor of $f_\mathrm{stop}$ . Moreover, both $f_\mathrm{fit}$ and $f_\mathrm{stop}$ use less data than $f_\mathrm{train}$ , and are thus inherently a worse representation of the problem altogether.

So, by applying early stopping we effectively reduced our training set size, used an even less reliable dataset to abort training, and settled on a solution which is not an optimum of anything at all. Sounds rather stupid, doesn't it?

Indeed, observe the distribution of the estimates found with (blue) and without (red) early stopping in repeated experiments (each time with a new random dataset):

Solutions found with and without early stopping

As we see, early stopping greatly increases the variance of the estimate and adds a small bias towards our optimization starting point.

Finally, let us see how the quality of the fit depends on the size of the validation set:

Fit quality vs validation set size

Here the y axis shows the squared distance of the estimated point to the true value (0,0), smaller is better (the dashed line is the expected distance of a randomly picked point from the data). The x axis shows all possible sizes of the validation set. We see that using no early stopping at all (x=0) results in the best expected fit. If we do decide to use early stopping, then for best results we should split the data approximately equally into training and validation sets. Interestingly, there do not seem to be much difference in whether we pick 30%, 50% or 70% of data for the validation set - the validation set seems to play just as much role in the final estimate as the training data.

Early Stopping with Non-convex Objectives

The experiment above seems to demonstrate that early stopping should be almost certainly useless (if not harmful) for fitting simple convex models. However, it is never used with such models in practice. Instead, it is most often applied to the training of multilayer neural networks. Could it be the case that the method somehow becomes useful when the objective is highly non-convex? Let us run a small experiment, measuring the benefits of early stopping for fitting a convolutional neural-network on the MNIST dataset. For simplicity, I took the standard example from the Keras codebase, and modified it slightly. Here is the result we get when training the the most basic model:

MNIST - Basic

The y axis depicts log-loss on the 10k MNIST test set, the x axis shows the proportion of the 60k MNIST training set set aside for early stopping. Ignoring small random measurement noise, we may observe that using early stopping with about 10% of the training data does seem to convey a benefit. Thus, contrary to our previous primitive example, when the objective is complex, early stopping does work as a regularization method. Why and how does it work here? Here's one intuition I find believable (there are alternative possible explanations and measurements, none of which I find too convincing or clear, though): stopping the training early prevents the algorithm from walking too far away from the initial parameter values. This limits the overall space of models and is vaguely analogous to suppressing the norm of the parameter vector. In other words, early stopping resembles an ad-hoc version of $\ell_p$ regularization.

Indeed, observe how the use of early stopping affects the results of fitting the same model with a small $\ell_2$ -penalty added to the objective:

MNIST - L2

All of the benefits of early stopping are gone now, and the baseline (non-early-stopped, $\ell_2$ -regularized) model is actually better overall than it was before. Let us now try an even more heavily regularized model by adding dropout (instead of the $\ell_2$ penalty), as is customary for deep neural networks. We can observe an even cleaner result:

MNIST - Dropout

Early stopping is again not useful at all, and the overall model is better than all of our previous attempts.

Conclusion: Do We Need Early Stopping?

Given the reasoning and the anecdotal experimental evidence above, I personally tend to think that beliefs in the usefulness of early stopping (in the context of neural network training) may be well overrated. Even if it may improve generalization for some nonlinear models, you would most probably achieve the same effect more reliably using other regularization techniques, such as dropout or a simple $\ell_2$ penalty.

Note, though, that there is a difference between early stopping in the context of neural networks and, say, boosting models. In the latter case early stopping is actually more explicitly limiting the complexity of the final model and, I suspect, might have a much more meaningful effect. At least we can't directly carry over the experimental examples and results in this blog post to that case.

Also note, that no matter whether early stopping helps or harms the generalization of the trained model, it is still a useful heuristic as to when to stop a lengthy training process automatically if we simply need results that are good enough.

Tags: Data analysis, Explanation, Machine learning, Project, Python, Statistics, Theory, Unclear
The Difficulties of Self-Identification

Posted by Konstantin 07.03.2017 No Comments
Ever since the "Prior Confusion" post I was planning to formulate one of its paragraphs as the following abstract puzzle, but somehow it took me 8 years to write it up.

According to fictional statistical studies, the following is known about a fictional chronic disease "statistite":
1. About 30% of people in the world have statistite.
2. About 35% of men in the world have it.
3. In Estonia, 20% of people have statistite.
4. Out of people younger than 20 years, just 5% have the disease.
5. A recent study of a random sample of visitors to the Central Hospital demonstrated that 40% of them suffer from statistite.
Mart, a 19-year Estonian male medical student is standing in the foyer of the Central Hospital, reading these facts from an information sheet and wondering: what are his current chances of having statistite? How should he model himself: should he consider himself as primarily "an average man", "a typical Estonian", "just a young person", or "an average visitor of the hospital"? Could he combine the different aspects of his personality to make better use of the available information? How? In general, what would be the best possible probability estimate, given the data?
Tags: Bayes, Logic, Machine learning, Philosophy, Probability theory, Puzzle, Statistics, Theory
What is the Covariance Matrix?

Posted by Konstantin 23.11.2016 16 Comments
Basic linear algebra, introductory statistics and some familiarity with core machine learning concepts (such as PCA and linear models) are the prerequisites of this post. Otherwise it will probably make no sense. An abridged version of this text is also posted on Quora.

Most textbooks on statistics cover covariance right in their first chapters. It is defined as a useful "measure of dependency" between two random variables:

$\mathrm{cov}(X,Y) = E[(X - E[X])(Y - E[Y])].$

The textbook would usually provide some intuition on why it is defined as it is, prove a couple of properties, such as bilinearity, define the covariance matrix for multiple variables as ${\bf\Sigma}_{i,j} = \mathrm{cov}(X_i, X_j)$ , and stop there. Later on the covariance matrix would pop up here and there in seeminly random ways. In one place you would have to take its inverse, in another - compute the eigenvectors, or multiply a vector by it, or do something else for no apparent reason apart from "that's the solution we came up with by solving an optimization task".

In reality, though, there are some very good and quite intuitive reasons for why the covariance matrix appears in various techniques in one or another way. This post aims to show that, illustrating some curious corners of linear algebra in the process.

Meet the Normal Distribution

The best way to truly understand the covariance matrix is to forget the textbook definitions completely and depart from a different point instead. Namely, from the the definition of the multivariate Gaussian distribution:

We say that the vector $\bf x$ has a normal (or Gaussian) distribution with mean $\bf \mu$ and covariance $\bf \Sigma$ if:

$\Pr({\bf x}) =|2\pi{\bf\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}({\bf x} - {\bf\mu})^T{\bf\Sigma}^{-1}({\bf x} - {\bf \mu})\right).$

To simplify the math a bit, we will limit ourselves to the centered distribution (i.e. ${\bf\mu} = {\bf 0}$ ) and refrain from writing out the normalizing constant $|2\pi{\bf\Sigma}|^{-1/2}$ . Now, the definition of the (centered) multivariate Gaussian looks as follows:

$\Pr({\bf x}) \propto \exp\left(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}\right).$

Much simpler, isn't it? Finally, let us define the covariance matrix as nothing else but the parameter of the Gaussian distribution. That's it. You will see where it will lead us in a moment.

Transforming the Symmetric Gaussian

Consider a symmetric Gaussian distribution, i.e. the one with ${\bf \Sigma = \bf I}$ (the identity matrix). Let us take a sample from it, which will of course be a symmetric, round cloud of points:

We know from above that the likelihood of each point in this sample is

(1) $P({\bf x}) \propto \exp(-0.5 {\bf x}^T {\bf x}).$

Now let us apply a linear transformation ${\bf A}$ to the points, i.e. let ${\bf y} ={\bf Ax}$ . Suppose that, for the sake of this example, ${\bf A}$ scales the vertical axis by 0.5 and then rotates everything by 30 degrees. We will get the following new cloud of points ${\bf y}$ :

What is the distribution of ${\bf y}$ ? Just substitute ${\bf x}={\bf A}^{-1}{\bf y}$ into (1), to get:

(2) $\begin{align*} P({\bf y}) &\propto \exp(-0.5 ({\bf A}^{-1}{\bf y})^T({\bf A}^{-1}{\bf y}))\\ &=\exp(-0.5{\bf y}^T({\bf AA}^T)^{-1}{\bf y}) \end{align*}$

This is exactly the Gaussian distribution with covariance ${\bf \Sigma} = {\bf AA}^T$ . The logic works both ways: if we have a Gaussian distribution with covariance $\bf \Sigma$ , we can regard it as a distribution which was obtained by transforming the symmetric Gaussian by some ${\bf A}$ , and we are given ${\bf AA}^T$ .

More generally, if we have any data, then, when we compute its covariance to be $\bf\Sigma$ , we can say that if our data were Gaussian, then it could have been obtained from a symmetric cloud using some transformation $\bf A$ , and we just estimated the matrix ${\bf AA}^T$ , corresponding to this transformation.

Note that we do not know the actual $\bf A$ , and it is mathematically totally fair. There can be many different transformations of the symmetric Gaussian which result in the same distribution shape. For example, if $\bf A$ is just a rotation by some angle, the transformation does not affect the shape of the distribution at all. Correspondingly, ${\bf AA}^T = {\bf I}$ for all rotation matrices. When we see a unit covariance matrix we really do not know, whether it is the “originally symmetric” distribution, or a “rotated symmetric distribution”. And we should not really care - those two are identical.

There is a theorem in linear algebra, which says that any symmetric matrix $\bf \Sigma$ can be represented as:

(3) ${\bf \Sigma} = {\bf VDV}^T,$

where ${\bf V}$ is orthogonal (i.e. a rotation) and ${\bf D}$ is diagonal (i.e. a coordinate-wise scaling). If we rewrite it slightly, we will get:

(4) ${\bf \Sigma} = ({\bf VD}^{1/2})({\bf VD}^{1/2})^T = {\bf AA}^T,$

where ${\bf A} = {\bf VD}^{1/2}$ . This, in simple words, means that any covariance matrix $\bf \Sigma$ could have been the result of transforming the data using a coordinate-wise scaling ${\bf D}^{1/2}$ followed by a rotation $\bf V$ . Just like in our example with $\bf x$ and $\bf y$ above.

Principal Component Analysis

Given the above intuition, PCA already becomes a very obvious technique. Suppose we are given some data. Let us assume (or “pretend”) it came from a normal distribution, and let us ask the following questions:
1. What could have been the rotation $\bf V$ and scaling ${\bf D}^{1/2}$ , which produced our data from a symmetric cloud?
2. What were the original, “symmetric-cloud” coordinates $\bf x$ before this transformation was applied.
3. Which original coordinates were scaled the most by $\bf D$ and thus contribute most to the spread of the data now. Can we only leave those and throw the rest out?
All of those questions can be answered in a straightforward manner if we just decompose $\bf \Sigma$ into $\bf V$ and $\bf D$ according to (3). But (3) is exactly the eigenvalue decomposition of $\bf\Sigma$ . I’ll leave you to think for just a bit and you’ll see how this observation lets you derive everything there is about PCA and more.

The Metric Tensor

Bear me for just a bit more. One way to summarize the observations above is to say that we can (and should) regard ${\bf\Sigma}^{-1}$ as a metric tensor. A metric tensor is just a fancy formal name for a matrix, which summarizes the deformation of space. However, rather than claiming that it in some sense determines a particular transformation $\bf A$ (which it does not, as we saw), we shall say that it affects the way we compute angles and distances in our transformed space.

Namely, let us redefine, for any two vectors $\bf v$ and $\bf w$ , their inner product as:

(5) $\langle {\bf v}, {\bf w}\rangle_{\Sigma^{-1}} = {\bf v}^T{\bf \Sigma}^{-1}{\bf w}.$

To stay consistent we will also need to redefine the norm of any vector as

(6) $|{\bf v}|_{\Sigma^{-1}} = \sqrt{{\bf v}^T{\bf \Sigma}^{-1}{\bf v}},$

and the distance between any two vectors as

(7) $|{\bf v}-{\bf w}|_{\Sigma^{-1}} = \sqrt{({\bf v}-{\bf w})^T{\bf \Sigma}^{-1}({\bf v}-{\bf w})}.$

Those definitions now describe a kind of a “skewed world” of points. For example, a unit circle (a set of points with “skewed distance” 1 to the center) in this world might look as follows:

And here is an example of two vectors, which are considered “orthogonal”, a.k.a. “perpendicular” in this strange world:

Although it may look weird at first, note that the new inner product we defined is actually just the dot product of the “untransformed” originals of the vectors:

(8) ${\bf v}^T{\bf \Sigma}^{-1}{\bf w} = {\bf v}^T({\bf AA}^T)^{-1}{\bf w}=({\bf A}^{-1}{\bf v})^T({\bf A}^{-1}{\bf w}),$

The following illustration might shed light on what is actually happening in this $\Sigma$ -“skewed” world. Somehow “deep down inside”, the ellipse thinks of itself as a circle and the two vectors behave as if they were (2,2) and (-2,2).

Getting back to our example with the transformed points, we could now say that the point-cloud $\bf y$ is actually a perfectly round and symmetric cloud “deep down inside”, it just happens to live in a skewed space. The deformation of this space is described by the tensor ${\bf\Sigma}^{-1}$ (which is, as we know, equal to $({\bf AA}^T)^{-1}$ . The PCA now becomes a method for analyzing the deformation of space, how cool is that.

The Dual Space

We are not done yet. There’s one interesting property of “skewed” spaces worth knowing about. Namely, the elements of their dual space have a particular form. No worries, I’ll explain in a second.

Let us forget the whole skewed space story for a moment, and get back to the usual inner product ${\bf w}^T{\bf v}$ . Think of this inner product as a function $f_{\bf w}({\bf v})$ , which takes a vector $\bf v$ and maps it to a real number, the dot product of $\bf v$ and $\bf w$ . Regard the $\bf w$ here as the parameter (“weight vector”) of the function. If you have done any machine learning at all, you have certainly come across such linear functionals over and over, sometimes in disguise. Now, the set of all possible linear functionals $f_{\bf w}$ is known as the dual space to your “data space”.

Note that each linear functional is determined uniquely by the parameter vector $\bf w$ , which has the same dimensionality as $\bf v$ , so apparently the dual space is in some sense equivalent to your data space - just the interpretation is different. An element $\bf v$ of your “data space” denotes, well, a data point. An element $\bf w$ of the dual space denotes a function $f_{\bf w}$ , which projects your data points on the direction $\bf w$ (recall that if $\bf w$ is unit-length, ${\bf w}^T{\bf v}$ is exactly the length of the perpendicular projection of $\bf v$ upon the direction $\bf w$ ). So, in some sense, if $\bf v$ -s are “vectors”, $\bf w$ -s are “directions, perpendicular to these vectors”. Another way to understand the difference is to note that if, say, the elements of your data points numerically correspond to amounts in kilograms, the elements of $\bf w$ would have to correspond to “units per kilogram”. Still with me?

Let us now get back to the skewed space. If $\bf v$ are elements of a skewed Euclidean space with the metric tensor ${\bf\Sigma}^{-1}$ , is a function $f_{\bf w}({\bf v}) = {\bf w}^T{\bf v}$ an element of a dual space? Yes, it is, because, after all, it is a linear functional. However, the parameterization of this function is inconvenient, because, due to the skewed tensor, we cannot interpret it as projecting vectors upon $\bf w$ nor can we say that $\bf w$ is an “orthogonal direction” (to a separating hyperplane of a classifier, for example). Because, remember, in the skewed space it is not true that orthogonal vectors satisfy ${\bf w}^T{\bf v}=0$ . Instead, they satisfy ${\bf w}^T{\bf \Sigma}^{-1}{\bf v} = 0$ . Things would therefore look much better if we parameterized our dual space differently. Namely, by considering linear functionals of the form $f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf \Sigma}^{-1}{\bf v}$ . The new parameters $\bf z$ could now indeed be interpreted as an “orthogonal direction” and things overall would make more sense.

However when we work with actual machine learning models, we still prefer to have our functions in the simple form of a dot product, i.e. $f_{\bf w}$ , without any ugly $\bf\Sigma$ -s inside. What happens if we turn a “skewed space” linear functional from its natural representation into a simple inner product?

(9) $f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf\Sigma}^{-1}{\bf v} = ({\bf \Sigma}^{-1}{\bf z})^T{\bf v} = f_{\bf w}({\bf v}),$

where ${\bf w} = {\bf \Sigma}^{-1}{\bf z}$ . (Note that we can lose the transpose because $\bf \Sigma$ is symmetric).

What it means, in simple terms, is that when you fit linear models in a skewed space, your resulting weight vectors will always be of the form ${\bf \Sigma}^{-1}{\bf z}$ . Or, in other words, ${\bf\Sigma}^{-1}$ is a transformation, which maps from “skewed perpendiculars” to “true perpendiculars”. Let me show you what this means visually.

Consider again the two “orthogonal” vectors from the skewed world example above:

Let us interpret the blue vector as an element of the dual space. That is, it is the $\bf z$ vector in a linear functional ${\bf z}^T{\bf\Sigma}^{-1}{\bf v}$ . The red vector is an element of the “data space”, which would be mapped to 0 by this functional (because the two vectors are “orthogonal”, remember).

For example, if the blue vector was meant to be a linear classifier, it would have its separating line along the red vector, just like that:

If we now wanted to use this classifier, we could, of course, work in the “skewed space” and use the expression ${\bf z}^T{\bf\Sigma}^{-1}{\bf v}$ to evaluate the functional. However, why don’t we find the actual normal $\bf w$ to that red separating line so that we wouldn’t need to do an extra matrix multiplication every time we use the function?

It is not too hard to see that ${\bf w}={\bf\Sigma}^{-1}{\bf z}$ will give us that normal. Here it is, the black arrow:

Therefore, next time, whenever you see expressions like ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ or $({\bf v}-{\bf w})^T{\bf\Sigma}^{-1}({\bf v}-{\bf w})$ , remember that those are simply inner products and (squared) distances in a skewed space, while ${\bf \Sigma}^{-1}{\bf z}$ is a conversion from a skewed normal to a true normal. Also remember that the “skew” was estimated by pretending that the data were normally-distributed.

Once you see it, the role of the covariance matrix in some methods like the Fisher’s discriminant or Canonical correlation analysis might become much more obvious.

The Dual Space Metric Tensor

“But wait”, you should say here. “You have been talking about expressions like ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ all the time, while things like ${\bf w}^T{\bf\Sigma}{\bf v}$ are also quite common in practice. What about those?”

Hopefully you know enough now to suspect that ${\bf w}^T{\bf\Sigma}{\bf v}$ is again an inner product or a squared norm in some deformed space, just not the “internal data metric space”, that we considered so far. Which space is it? It turns out it is the “internal dual metric space”. That is, whilst the expression ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ denoted the “new inner product” between the points, the expression ${\bf w}^T{\bf\Sigma}{\bf v}$ denotes the “new inner product” between the parameter vectors. Let us see why it is so.

Consider an example again. Suppose that our space transformation $\bf A$ scaled all points by 2 along the $x$ axis. The point (1,0) became (2,0), the point (3, 1) became (6, 1), etc. Think of it as changing the units of measurement - before we measured the $x$ axis in kilograms, and now we measure it in pounds. Consequently, the norm of the point (2,0) according to the new metric, $|(2,0)|_{\Sigma^{-1}}$ will be 1, because 2 pounds is still just 1 kilogram “deep down inside”.

What should happen to the parameter ("direction") vectors due to this transformation? Can we say that the parameter vector (1,0) also got scaled to (2,0) and that the norm of the parameter vector (2,0) is now therefore also 1? No! Recall that if our initial data denoted kilograms, our dual vectors must have denoted “units per kilogram”. After the transformation they will be denoting “units per pound”, correspondingly. To stay consistent we must therefore convert the parameter vector (”1 unit per kilogram”, 0) to its equivalent (“0.5 units per pound”,0). Consequently, the norm of the parameter vector (0.5,0) in the new metric will be 1 and, by the same logic, the norm of the dual vector (2,0) in the new metric must be 4. You see, the “importance of a parameter/direction” gets scaled inversely to the “importance of data” along that parameter or direction.

More formally, if ${\bf x}'={\bf Ax}$ , then

(10) $\begin{align*} f_{\bf w}({\bf x}) &= {\bf w}^T{\bf x} = {\bf w}^T{\bf A}^{-1}{\bf x}'\\ & =(({\bf A}^{-1})^T{\bf w})^T{\bf x}'=f_{({\bf A}^{-1})^T{\bf w}}({\bf x}'). \end{align*}$

This means, that the transformation $\bf A$ of the data points implies the transformation ${\bf B}:=({\bf A}^{-1})^T$ of the dual vectors. The metric tensor for the dual space must thus be:

(11) $({\bf BB}^T)^{-1}=(({\bf A}^{-1})^T{\bf A}^{-1})^{-1}={\bf AA}^T={\bf \Sigma}.$

Remember the illustration of the “unit circle” in the ${\bf \Sigma}^{-1}$ metric? This is how the unit circle looks in the corresponding $\bf\Sigma$ metric. It is rotated by the same angle, but it is stretched in the direction where it was squished before.

Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by $\bf A$ becomes proportionally larger (note that the “unit circle” is, on the contrary, “squished” along those directions).

But the “stretch” of the space deformation in any direction can be measured by the variance of the data. It is therefore not a coincidence that ${\bf w}^T{\bf \Sigma}{\bf w}$ is exactly the variance of the data along the direction $\bf w$ (assuming $|{\bf w}|=1$ ).

The Covariance Estimate

Once we start viewing the covariance matrix as a transformation-driven metric tensor, many things become clearer, but one thing becomes extremely puzzling: why is the inverse covariance of the data a good estimate for that metric tensor? After all, it is not obvious that ${\bf X}^T{\bf X}/n$ (where $\bf X$ is the data matrix) must be related to the $\bf\Sigma$ in the distribution equation $\exp(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x})$ .

Here is one possible way to see the connection. Firstly, let us take it for granted that if $\bf X$ is sampled from a symmetric Gaussian, then ${\bf X}^T{\bf X}/n$ is, on average, a unit matrix. This has nothing to do with transformations, but just a consequence of pairwise independence of variables in the symmetric Gaussian.

Now, consider the transformed data, ${\bf Y}={\bf XA}^T$ (vectors in the data matrix are row-wise, hence the multiplication on the right with a transpose). What is the covariance estimate of $\bf Y$ ?

(12) ${\bf Y}^T{\bf Y}/n=({\bf XA}^T)^T{\bf XA}^T/n={\bf A}({\bf X}^T{\bf X}){\bf A}^T/n\approx {\bf AA}^T,$

the familiar tensor.

This is a place where one could see that a covariance matrix may make sense outside the context of a Gaussian distribution, after all. Indeed, if you assume that your data was generated from any distribution $P$ with uncorrelated variables of unit variance and then transformed using some matrix $\bf A$ , the expression ${\bf X}^T{\bf X}/n$ will still be an estimate of ${\bf AA}^T$ , the metric tensor for the corresponding (dual) space deformation.

However, note that out of all possible initial distributions $P$ , the normal distribution is exactly the one with the maximum entropy, i.e. the “most generic”. Thus, if you base your analysis on the mean and the covariance matrix (which is what you do with PCA, for example), you could just as well assume your data to be normally distributed. In fact, a good rule of thumb is to remember, that whenever you even mention the word "covariance matrix", you are implicitly fitting a Gaussian distribution to your data.
Tags: Algebra, Explanation, Machine learning, Mathematics, Probability theory, Statistics, Visualization
Bayesian Statistics in Layman's Terms

Posted by Konstantin 09.01.2016 No Comments
This is a (slightly updated) repost of my quora answer to the corresponding question.

There are many ways in which smart people tend to explain Bayesian statistics and contrast it with a "non-Bayesian" one. One usually highlights that the primary concept of a Bayesian approach is the the desire to model everything as a probability distribution. Once this is fact is clear, many smart people would proceed to claim that this is, in fact, what fundamentally sets Bayesian statistics aside from the "classical" one. However, I feel that this kind of explanation is somewhat incomplete. It is not like classical statisticians do not use complete probability distributions. The difference is in general somewhat more subtle and philosophical.

Consider the question "what is your height?". For a classical statistician there exists some abstract "true answer", say "180cm", which is a fixed number - your one and only height. The problem is, of course, you do not know this number because every measurement is slightly different, so the classical statistician will add that "there is a normally-distributed measurement error". In the world of a pure Bayesian there are almost no "fixed numbers" - everything is a probability distribution, and so is your height! That is, a Bayesian should say that "your height is a Normal distribution centered around 180cm".

Note that from the mathematical perspective there is no difference between the two representations - in both cases the number 180cm is mentioned, and the normal distribution. However, from a philosophical, syntactical, methodological and "mental" perspectives this tends to have serious implications, and there has been historically a kind of an ongoing intellectual feud between the statisticians who lend more towards the first or the second approach (it is somewhat resemblant of how there is a divide among the physicists with regard to their support of the Copenhagen interpretation of quantum mechanics).

One of the implications of denying the fact that things in the world are mostly fixed (and are all pure distributions instead) is that you may not use many of the common sense inference methods directly. What is my height if I stand on a chair? "Well, it is your height plus the height of a chair", a classical statistician would say. He can keep in mind the measurement errors, if necessary, but those could be dealt with later. In the Bayesian world heights are not numbers, so the procedure of adding heights implies convoluting two distributions to get the resulting distribution. If both distributions are Gaussian, the result will match that of the "common sense", but note that now the common sense somehow became "just one special case". Moreover, a Bayesian might even keep the possibility that "your height and the height of the chair are dependent" in the back of his mind, just in case. Because when you speak about two numbers in the Bayesian world, you must immediately start thinking about their joint distribution.

On the other hand, modeling everything in probabilities lets you use probability theory inference methods (Bayes rule, convolutions, marginalizations, etc) everywhere, without the need to differentiate between "fixed numbers" and "random measurement errors" and this adds peace of mind as well as tends to make your explanations clearer. A Bayesian confidence interval, for example, is a "fixed interval such that 95% of height measurements fall into it". A classical confidence interval, on the other hand, is "a random interval such that the true height may fall into it with 95% probability". Again, mathematically and numerically those may often be the same, but think how different the two explanations are.

Bayesian "thinking" tends to be more flexible for complex models. Many classical statistics models would stick to fixed parameters, point or "interval" inferences, and try to "hide" the complexity of probability distributions as much as possible. As a result, reasoning about a system with many highly interconnected concepts becomes flawed. Consider a sequence of three questions:
- What the height of this truck?
- Will it fit under this 3m bridge?
- Do we need pick another route?
In the "classical" mindset you would tend to give fixed answers to the questions.
- "Height of the truck is 297".
- "Yes, 297<300, hence it will fit".
- "No, we do not need".
Sometimes you may be more careful and work with confidence intervals, but it still feels unwieldy:
- "The confidence interval on the height of the truck is 290..310"
- ".. aahm, it might not fit..."
- "let's pick another route, just in case"
Note, if a followup question appears that depends on the previous inferences (e.g. "do we need to remodel the truck") answering it becomes even harder because the true uncertainty is "lost" in the intermediate steps. Such problems are never present if you are disciplined as a Bayesian. Note the answers:
- "The height of the truck is a normal distribution N(297, 10)"
- "It will fit under the bridge with probability 60%"
- "We need another route with probability 40%"
At any point is information about the uncertainty is preserved in the distributions and you are free to combine it further, or apply a decision-theoretic utility model. This makes Bayesian networks possible, for example.

It is interesting to see how this largely philosophical preference leads to two completely different (albeit complementary) sets of techniques. Indeed, if you are a true classical statistician, your work revolves around parameterized probability distributions. You write them down like $P_\alpha(x)$ , where $x$ is the "truly random" value from some probability space, and $\alpha$ is the "fixed but unknown" parameter. Your whole "school of thought" is now focused on clever ad-hoc techniques for computing estimates of this fixed parameter from the provided distribution.

For a pure Bayesian, however, there is no "fixed" $\alpha$ that has to be treated somehow separately. Instead, $\alpha$ is also a part of some probability space, and instead of writing $P_\alpha(x)$ he would safely write $P(x| \alpha)$ , $P(\alpha | x)$ , or $P(x, \alpha)$ . As a result, the probability distribution he works with are not parameterized any more, and all of the clever techniques that the classical statisticians have invented over the centuries for estimating parameters become seemingly useless. At this point a classical statistician puts his hands down and goes home, as there is nothing to do for him - there are no "unknowns". The Bayesian is, however, left to struggle with mathematically trivial, yet computationally incredibly heavy methods for extracting essentially the same values that the classical statistician could have obtained using his "parameter estimation" approaches. That's why the Bayesian "school of thought" is mostly focused on computationally-efficient methods for marginalization and sampling.

In reality, of course, a Bayesian would quite often give up and "cheat", at least partially parameterizing his models and making use of the classical estimation methods, while a "classical" statistician might happen to write $P(x|\alpha)$ and apply the Bayes rule here and there, whenever it seems appropriate. A number of computations derived from the two theoretical backgrounds end up exactly the same.

Thus, in practice, labeling things as "Bayesian" or "non-Bayesian" is still largely a philosophical choice. For example, there are methods in machine learning, ensemble learners, that are somewhy never labeled/marketed as being "Bayesian" nor were they probably invented by someone "Bayesian", although at their core those would be among the best examples of where a Bayesian approach is different from a classical one. Those are also among the best performant models quite often, by the way.
Tags: Bayes, Machine learning, Probability theory, Statistics, Theory
When the Best is not the Best

Posted by Konstantin 04.01.2016 7 Comments

Collecting large amounts of data and then using it to "teach" computers to automatically recognize patterns is pretty much standard practice nowadays. It seems that, given enough data and the right methods, computers can get quite precise at detecting or predicting nearly anything, whether it is face recognition, fraud detection or movie recommendations.

Whenever a new classification system is created, it is taken for granted that the system should be as precise as possible. Of course, classifiers that never make mistakes are rare, but if it possible, we should strive to have them make as few mistakes as possible, right? Here is a fun example, where things are not as obvious.

Consider a bank, which, as is normal for a bank, makes money by giving loans to its customers. Of course, there is always a risk that a customer will default (i.e. not repay the loan). To account for that, the bank has a risk scoring system which, for a given loan application, assesses the probability that the corresponding customer may default. This probability is later used to compute the interest rate offered for the customer. To simplify a bit, the issued interest on a loan might be computed as the sum of customer's predicted default risk probability and a fixed profit margin. For example, if a customer is expected to default with probability 10% and the bank wants 5% profit on its loans on average, the loan might be issued at slightly above 15% interest. This would cover both the expected losses due to non-repayments as well as the profit margin.

Now, suppose the bank managed to develop a perfect scoring algorithm. That is, each application gets a rating of either having 0% or 100% risk. Suppose as well that within a month the bank processes 1000 applications, half of which are predicted to be perfectly good, and half - perfectly bad. This means that 500 loans get issued with a 5% interest rate, while 500 do not get issued at all.

Think what would happen, if the system would not do such a great job and confused 50 of the bad applications with the good ones? In this case 450 applications would be classified as "100%" risk, while 550 would be assigned a risk score of "9.1%" (we still require the system to provide valid risk probability estimates). In this case the bank would issue a total of 550 loans at 15%. Of course, 50 of those would not get repaid, yet this loss would be covered from the increased interest paid by the honest lenders. The financial returns are thus exactly the same as with the perfect classifier. However, the bank now has more clients. More applications were signed, and more contract fees were received.

True, the clients might be a bit less happy for getting a higher interest rate, but assuming they were ready to pay it anyway, the bank does not care. In fact, the bank would be more than happy to segment its customers by offering higher interest rates to low-risk customers anyway. It cannot do it openly, though. The established practices usually constrain banks to make use of "reasonable" scorecards and offer better interest rates to low-risk customers.

Hence, at least in this particular example, a "worse" classifier is in fact better for business. Perfect precision is not really the ultimately desired feature. Instead, the system is much more useful when it provides a relevant and "smooth" distribution of predicted risk scores, making sure the scores themselves are decently precise estimates for the probability of a default.

Tags: Data analysis, Economics, Machine learning, Paradox, Probability theory, Project, Statistics
Overwhelming Statistical Evidence

Posted by Konstantin 21.05.2015 No Comments

Here's a curious quote from Alan Turing's famous paper from 1950:

Makes you appreciate how seriously one person's wishful thinking, coupled with dedication and publicity skills, may sometimes affect the scientific world.

Tags: Fun, History, Science, Statistics, Superstition
My Favourite Statistical Method

Posted by Konstantin 05.04.2015 4 Comments

When it comes to data analysis, there are hundreds of exciting approaches: simple summary statistics and hypothesis tests, various clustering methods, linear and nonlinear regression or classification techniques, neural networks of various types and depths, decision rules and frequent itemsets, feature extractors and dimension reductors, ensemble methods, bayesian approaches and graphical models, logic-based approaches and fuzzy stuff, ant colonies, genetic algorithms and other optimization methods, monte-carlo algorithms, sampling and density estimation, logic-based and graph methods. Don't even get me started on the numerous visualization techniques.

This sheer number of options is, however, both a blessing and a curse at the same time. In many practical situations just having those methods at your disposal may pose more problems than solutions. First you need to pick one of the approaches that might possibly fit your purpose. Then you will try to adapt it appropriately, spend several iterations torturing the data only to obtain very dubious first results, come to the conclusion that most probably you are doing something wrong, reconvince yourself that you need to try harder in that direction, spend some more iterations testing various parameter settings. Nothing works as you want it to, so you start everything from scratch with another method to find yourself obtaining new, even more dubious results, torturing the data even further, getting tired of that and finally settling on something "intermediately decent", which "probably makes sense", although you are not so sure any more and feel frustrated.

I guess life of a statistician was probably way simpler back in the days when you could run a couple of t-tests, or an F-test from a linear regression and call it a day. In fact, it seems that many experimental (e.g. wetlab) scientists still live in that kind of world, when it comes to analyzing their experimental results. The world of T-tests is cozy and safe. They don't get you frustrated. Unfortunately, t-tests can feel ad-hockish, because they force you to believe that something "is normally distributed". Also, in practice, they are mainly used to confirm the obvious rather than discover something new from the data. A simple scatterplot will most often be better than a t-test as an analysis method. Hence, I am not a big fan of T-tests. However, I do have my own favourite statistical method, which always feels cozy and safe, and never gets me frustrated. I tend to apply it whenever I see a chance. It is the Fisher exact test in the particular context of feature selection.

My appreciation of it stems from my background in bioinformatics and some experience with motif detection in particular. Suppose you have measured the DNA sequences for a bunch of genes. What can you do to learn something new about the sequence structure from that data? One of your best bets is to first group your sequences according to some known criteria. Suppose you know from previous experiments that some of the genes are cancer-related whereas others are not. As soon as you have specified those groups, you can start making observations like the following: "It seems that 10 out of my 20 cancer-related genes have the subsequence GATGAG in their DNA code. The same sequence is present in only 5 out of 100 non-cancer-related ones. How probable would it be to obtain similar counts of GATGAG, if the two groups were picked randomly?" If the probability to get those counts at random is very low, then obviously there is something fishy about GATGAG and cancer - perhaps they are related. To compute this probability you will need to use the hypergeometric distribution, and the resulting test (i.e. the question "how probable is this situation in a random split?") is known as the Fishers' exact test.

This simple logic (with a small addition of a multiple testing correction on top) has worked wonders for finding actually important short sequences on the DNA. Of course it is not limited to sequence search. One of our research group's most popular web tools uses the same approach to discover functional annotations, that are "significantly overrepresented" in a given group of genes. The same approach can be used to construct decision trees, and in pretty much any other "supervised learning" situation, where you have groups of objects and want to find binary features of those objects, associated with the groups.

Although in general the Fisher test is just one particular measure of association, it is, as I noted above, rather "cozy and comfortable". It does not force me to make any weird assumptions, there is no "ad-hoc" aspect to it, it is simple to compute and, most importantly, in my experience it nearly always produces "relevant" results.

Words overrepresented in the speeches of Greece MPs

A week ago me, Ilya and Alex happened to take part in a small data analysis hackathon, dedicated to the analysis of speech transcripts from the European Parliament. Somewhat analogously to DNA sequences, speeches can be grouped in various ways: you can group them by the speaker who gave them, by country, gender or political party of that speaker, by the month or year when the speech was given or by any combination of such groupings. The obvious "features" of a speech are words, which can be either present or not present in it. Once you view the problem this way the task of finding group-specific words becomes self-evident and the Fisher test is the natural solution to it. We implemented this idea and extracted "country-specific" and "time-specific" words from the speeches (other options were left out due to time constraints). As is usual the case with my favourite method, the obtained results look relevant, informative and, when shown in the form of a word cloud, fun. Check them out.

The complete source code of the analysis scripts and the visualization application is available on Github.

Tags: Bioinformatics, Data analysis, Data mining, Programming, Project, Statistics, Tool, Visualization
Frequentists vs Bayesians (xkcd strip 9.11.2012)

Posted by Konstantin 12.11.2012 No Comments

This relates nicely to several previous posts here.

Frequentists vs Bayesians

Copyright © xkcd.

Tags: Bayes, Comic, Fun, Probability theory, Statistics
Unlucky Numbers

Posted by Konstantin 30.05.2012 No Comments

Sportloto lottery ticket

Consider the following hypothetical lottery scheme. A player pays a dollar for a lottery ticket. On the ticket he has to mark a number between 1 and 100. When sufficient number of tickets was sold, the single "lucky number" is drawn randomly using a lotto machine, and half of the proceeds from ticket sales are shared equally among all tickets that bet on that number. The other half goes to charity.

Now, given the whole "charity" deal, the amount of money going into the lottery is greater than the amount of money paid back, hence the game is obviously disadvantageous for the players. The expected returns of an average ticket are just $0.5, hence "the house always wins", and "lottery is a tax on people who do not understand probability theory", as they say. Right? It turns out things are not that simple.

Suppose there are 100 000 people in the country who play this lottery, each one buying a single ticket. Let us imagine that, for some reason, everyone who plays the lottery is extremely superstitious and will never bet on the number 13 because it is universally despised as unlucky. Knowing that, let us now go and buy a single ticket, betting on 13. Behold: we have just paid one dollar for a 1% chance to win 50 000 dollars! Indeed, there is a 1% chance the lottery machine will draw 13 as the winning number and if this happens we will be the only candidate for receiving the whole winning fund - $50 000 in our case. Consequently, the expected returns for our ticket are $50 000 x 0.01 = $500, and the bet is well worth its price.

In general, it is easy to show that betting on any number which is sufficiently unpopular, namely any number which less than 500 of the 100 000 participants will decide to bet on, results in positive expected returns (note that on average we expect about 1000 people to bet on a "random" number).

To highlight the concept a bit more, let us consider an even better hypothetical possibility. Let us say that all of the 100 000 lottery players decide to bet on their birth dates. This means that their bets would cover only the numbers between 1 and 31. The smart idea then would be to buy 69 tickets, betting on each of the remaining numbers (32..100). Such a bet costs $69 and wins the sum of $50 034 with probability 69%. The expected returns per each dollar invested are still around $500, but in addition you win with astonishing certainty.

Does this have anything to do with reality? It turns out it does. This article from 1980 (in Russian) studies the popular Soviet lottery "Sportloto", in which the players had to select 5 or 6 numbers from a grid of 36 or 45 numbers respectively (see illustration above). The drawing was performed, and the players who managed to guess enough numbers would share a portion of the lottery fund. Note that this is just a more elaborate variation of the "pick one out of 100" lottery above. And of course, psychological aspects play a large role in biasing players' number selections. People tend to prefer numbers towards the bottom of the grid to those on the first lines. People prefer smaller numbers to larger ones. And, most importantly, people tend to avoid picking regular patterns (e.g. all numbers in a sequence, or those forming a nice rectangle), as such combinations intuitively seem to be "too improbable to happen".

This results in a situation where betting on a "psychologically improbable" pattern of numbers may turn out to be profitable in terms of expected returns. The authors of the mentioned article actually used historical lottery drawing data to estimate the returns they would have if they would constantly participate in Sportloto using such patterns, and reported the ratio of winnings to spending of around 1.15 to 1.39.

This is not meant to encourage you to gamble (moreover, most lotteries do not work this way), but if you have to, do not underestimate the luckiness of the unlucky numbers.

Tags: Gambling, Paradox, Psychology, Statistics
Making Your Results Look Convincing

Posted by Konstantin 10.01.2012 No Comments
It is not uncommon when a long-running scientific study or an experiment produces results which are, at best, uninteresting. The measured effect may be too weak to be reported on convincingly given the data at hand. None the less, resources have been put into it, many man-months have been spent, and thus a paper must be published. The researcher must therefore present his results in a way convincing enough for the reviewers to be lulled into acceptance.

The following are the three best methods for doing that (and I have seen those being used in practice). Next time you read someone's paper (or write your own), keep them in mind.
1. Use an irrelevant (and preferably strict) hypothesis test.
  Suppose you want to show that a set of measurements in one group differs from the set of measurements in the other group. The typical approach here is the T-test or the Wilcoxon test, both of which detect whether elements in one group are on average greater than those in the other group. If, however, you find that the tests fail on your data (i.e., there is no easily detectable difference in measurement magnitudes), why don't you try something like the Kolmogorov-Smirnov test, which checks whether the distributions of the two groups are different. It is a much stricter condition. In fact the tiniest outlier in your data will easily get you a low p-value and thus something to stick in the face of a reviewer. If even the KS test did not work, try testing something even less relevant, such as, whether your data is normally distributed. Most probably it is not, here's your low p-value! Remember - the smaller your p-values, the better is your paper!
2. Avoid significance testing completely
  If you can't get a low p-value anywhere, do not worry. Significance testing is going somewhat out of fashion nowadays anyway, so it is possible to avoid it and still sound convincing. If one group of measurements has 40% of successes and the other has 42% - why not simply present those two numbers as obvious proof that the second group is better. Using ratios is also a smart idea. Say, some baseline algorithm has a 1% chance of success. You now test your algorithm and discover that out of 10 trials it had 1 success. That means your algorithm has just demonstrated a 10% success rate, which is ten times better than the baseline! Finally, ROC curves can often be used to hide the fact that your data is too tiny to make any conclusions. No one really ever checks for significance of those.
3. Sweep multiple testing under the carpet
  If you are analyzing a dataset with 1000 attributes and 50 datapoints, it is not really very surprising if one of those attributes will seem "interesting" (e.g. highly correlated with the target effect) purely by chance - there is often nothing significant in finding one out of a thousand. However, if you only mention that one (or perhaps 10-50) of the original attributes, your results will magically become significant and no reviewer will be able to catch your cheating.
There are certainly more, and I'll keep the post updated if I come up with a worthy addition. If you have something to add, please do comment.
Tags: Data analysis, Fun, How to, Statistics