Four Years Remaining

The Mystery of Early Stopping

Posted by Konstantin 06.12.2017 7 Comments

Early stopping is a technique that is very often used when training neural networks, as well as with some other iterative machine learning algorithms. The idea is quite intuitive - let us measure the performance of our model on a separate validation dataset during the training iterations. We may then observe that, despite constant score improvements on the training data, the model's performance on the validation dataset would only improve during the first stage of training, reach an optimum at some point and then turn to getting worse with further iterations.

The early stopping principle

It thus seems reasonable to stop training at the point when the minimal validation error is achieved. Training the model any further only leads to overfitting. Right? The reasoning sounds solid and, indeed, early stopping is often claimed to improve generalization in practice. Most people seem to take the benefit of the technique for granted. In this post I would like to introduce some skepticism into this view or at least illustrate that things are not necessarily as obvious as they may seem from the diagram with the two lines above.

How does Early Stopping Work?

To get a better feeling of what early stopping actually does, let us examine its application to a very simple "machine learning model" - the estimation of the mean. Namely, suppose we are given a sample of 50 points $\mathbf{x}_i$ from a normal distribution with unit covariance and we need to estimate the mean $\mathbf{w}$ of this distribution.

Sample

The maximum likelihood estimate of $\mathbf{w}$ can be found as the point which has the smallest sum of squared distances to all the points in the sample. In other words, "model fitting" boils down to finding the minimum of the following objective function:

$f_\mathrm{train}(\mathrm{w}) := \sum_{i=1}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2$

As our estimate is based on a finite sample, it, of course, won't necessarily be exactly equal to the true mean of the distribution, which I chose in this particular example to be exactly (0,0):

Sample mean as a minimum of the objective function

The circles in the illustration above are the contours of the objective function, which, as you might guess, is a paraboloid bowl. The red dot marks its bottom and is thus the solution to our optimization problem, i.e. the estimate of the mean we are looking for. We may find this solution in various ways. For example, a natural closed-form analytical solution is simply the mean of the training set. For our purposes, however, we will be using the gradient descent iterative optimization algorithm. It is also quite straightforward: start with any point (we'll pick (-0.5, 0) for concreteness' sake) and descend in small steps downwards until we reach the bottom of the bowl:

Gradient descent

Let us now introduce early stopping into the fitting process. We will split our 50 points randomly into two separate sets: 40 points will be used to fit the model and 10 will form the early stopping validation set. Thus, technically, we now have two different objective functions to deal with:

$f_\mathrm{fit}(\mathrm{w}) := \sum_{i=1}^{40} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2$

and

$f_\mathrm{stop}(\mathrm{w}) := \sum_{i=41}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2.$

Each of those defines its own "paraboloid bowl", both slightly different from the original one (because those are different subsets of data):

Fitting and early stopping objectives

As our algorithm descends towards the red point, we will be tracking the value of $f_\mathrm{stop}$ at each step along the way:

Gradient descent with validation

With a bit of imagination you should see on the image above, how the validation error decreases as the yellow trajectory approaches the purple dot and then starts to increase after some point midway. The spot where the validation error achieves the minimum (and thus the result of the early stopping algorithm) is shown by the green dot on the figure below:

Early stopping

In a sense, the validation function now acts as a kind of a "guardian", preventing the optimization from converging towards the bottom of our main objective. The algorithm is forced to settle on a model, which is neither an optimum of $f_\mathrm{fit}$ nor of $f_\mathrm{stop}$ . Moreover, both $f_\mathrm{fit}$ and $f_\mathrm{stop}$ use less data than $f_\mathrm{train}$ , and are thus inherently a worse representation of the problem altogether.

So, by applying early stopping we effectively reduced our training set size, used an even less reliable dataset to abort training, and settled on a solution which is not an optimum of anything at all. Sounds rather stupid, doesn't it?

Indeed, observe the distribution of the estimates found with (blue) and without (red) early stopping in repeated experiments (each time with a new random dataset):

Solutions found with and without early stopping

As we see, early stopping greatly increases the variance of the estimate and adds a small bias towards our optimization starting point.

Finally, let us see how the quality of the fit depends on the size of the validation set:

Fit quality vs validation set size

Here the y axis shows the squared distance of the estimated point to the true value (0,0), smaller is better (the dashed line is the expected distance of a randomly picked point from the data). The x axis shows all possible sizes of the validation set. We see that using no early stopping at all (x=0) results in the best expected fit. If we do decide to use early stopping, then for best results we should split the data approximately equally into training and validation sets. Interestingly, there do not seem to be much difference in whether we pick 30%, 50% or 70% of data for the validation set - the validation set seems to play just as much role in the final estimate as the training data.

Early Stopping with Non-convex Objectives

The experiment above seems to demonstrate that early stopping should be almost certainly useless (if not harmful) for fitting simple convex models. However, it is never used with such models in practice. Instead, it is most often applied to the training of multilayer neural networks. Could it be the case that the method somehow becomes useful when the objective is highly non-convex? Let us run a small experiment, measuring the benefits of early stopping for fitting a convolutional neural-network on the MNIST dataset. For simplicity, I took the standard example from the Keras codebase, and modified it slightly. Here is the result we get when training the the most basic model:

MNIST - Basic

The y axis depicts log-loss on the 10k MNIST test set, the x axis shows the proportion of the 60k MNIST training set set aside for early stopping. Ignoring small random measurement noise, we may observe that using early stopping with about 10% of the training data does seem to convey a benefit. Thus, contrary to our previous primitive example, when the objective is complex, early stopping does work as a regularization method. Why and how does it work here? Here's one intuition I find believable (there are alternative possible explanations and measurements, none of which I find too convincing or clear, though): stopping the training early prevents the algorithm from walking too far away from the initial parameter values. This limits the overall space of models and is vaguely analogous to suppressing the norm of the parameter vector. In other words, early stopping resembles an ad-hoc version of $\ell_p$ regularization.

Indeed, observe how the use of early stopping affects the results of fitting the same model with a small $\ell_2$ -penalty added to the objective:

MNIST - L2

All of the benefits of early stopping are gone now, and the baseline (non-early-stopped, $\ell_2$ -regularized) model is actually better overall than it was before. Let us now try an even more heavily regularized model by adding dropout (instead of the $\ell_2$ penalty), as is customary for deep neural networks. We can observe an even cleaner result:

MNIST - Dropout

Early stopping is again not useful at all, and the overall model is better than all of our previous attempts.

Conclusion: Do We Need Early Stopping?

Given the reasoning and the anecdotal experimental evidence above, I personally tend to think that beliefs in the usefulness of early stopping (in the context of neural network training) may be well overrated. Even if it may improve generalization for some nonlinear models, you would most probably achieve the same effect more reliably using other regularization techniques, such as dropout or a simple $\ell_2$ penalty.

Note, though, that there is a difference between early stopping in the context of neural networks and, say, boosting models. In the latter case early stopping is actually more explicitly limiting the complexity of the final model and, I suspect, might have a much more meaningful effect. At least we can't directly carry over the experimental examples and results in this blog post to that case.

Also note, that no matter whether early stopping helps or harms the generalization of the trained model, it is still a useful heuristic as to when to stop a lengthy training process automatically if we simply need results that are good enough.

Tags: Data analysis, Explanation, Machine learning, Project, Python, Statistics, Theory, Unclear
Strong Typing

Posted by Konstantin 12.03.2014 No Comments
Whenever you write a program, you want this program to behave correctly and do what you want it to do. Thus, programming always goes together with the mental act of proving to yourself (and sometimes to other people as well), that the code you write is correct. Most often this "proof" is implicit, dissolved in the way you write your code and comment it. In fact, in my personal opinion, "good code" is exactly the one, where a human-reviewer is able to verify its correctness without too much effort.

It is natural to use computers to help us verify correctness. Everyone who has ever programmed in a strictly-typed programming language, such as Java or Haskell, is familiar with the need to specify types of variables and functions and follow strict rules of type-safety. But of course, ensuring type-safety is just the compiler's way to help you ensure some basic claims about the program, such as "this variable will always contain an integer" or "this function will always be invoked with exactly three parameters".

This is very convenient, yet can be somewhat limiting or annoying at times. After all, type-safety requires you to write code that "can be type-checked". Although very often this is expected of "good code" anyway, there are situations where you would like some more flexibility. For this reason some languages impose no type-safety rules at all (e.g. Python or Javascript), and some languages let you disable type-checking for parts of code.

Rather than disabling the type checker, another principled way to allow more flexibility is to make the type-checker smarter. This is the promise of dependent types. In principle, a language, which supports dependent types, would let you make much more detailed statements about your program and have your program automatically checked for correctness with respect to those statements. Rather than being limited to primitive claims like "this variable is an integer", the use of dependent types enables you to assert things like "this is a sorted list", or "this is an odd integer", and so on up to nearly arbitrary level of detail, in the form of a type annotation. At least that much I found out during a course at the recent winter school.

The course was based on the Agda programming language, and the first thing I decided to try implementing in Agda is a well-typed version of the following simple function:
```
f t = if t then true else 0
```
It might look like something trivial, yet most traditional strictly typed languages would not let you write this. After all, a function has to have a return type, and it has to be either a Boolean or an Integer, but not both. In this case, however, we expect our function to have a parameter-dependent type:
```
f : (t : Bool) → if t then Bool else Integer
```
Given that Agda is designed to support dependent types, how complicated could it be to implement such a simple function? It turns out, it takes a beginner more than an hour of thinking and a couple of consultations with the specialists in the field. The resulting code will include at least three different definitions of "if-then-else" statements and, I must admit, some aspects of it are still not completely clear to me.

IF-THEN-ELSE in Agda, including all the boilerplate code

This is the longest code I've ever had to write to specify a simple if-then-else statement. The point of this blog post is to share the amusement and suggest you to check out Agda if you are in the mood for some puzzle-solving.

As for dependent types, I guess those are not becoming mainstream any time soon.
Tags: Computer science, Fun, Procrastination, Programming, Theory, Unclear
The Meaning of Money

Posted by Konstantin 25.09.2008 13 Comments
Money is something we take for granted and we usually don't think too much into what it actually is. As a result, when it comes up as a discussion topic, one may easily discover that nearly everyone has a personal twist on the understanding of what exactly is "money", and how this notion should be interpreted. At least that is the case within the circle of people, who are far from economics; and computer science people most often are. Last week this fact got itself another confirmation, when Swen came up with some fiscal policy-related problem, which didn't seem like a problem to me at all. After a long round of discussion we found out that when we said "money" we meant different things. So I thought I'd put down my understanding of this notion in words, so that next time I could point here and say: look, that's what I mean.

Of course, I don't pretend to be even a bit authoritative. After all, I don't know anything about economics (although I did take the elementary course, as any computer science student should). But I do find my understanding quite consistent with observation and common knowledge.

The Ideal Case

It's easy to start by imagining a country, call it "Veggieland", where people are completely inactive. They don't eat, work or play. Don't shop, produce or consume. Just lie around like vegetables, enjoy the sun and somehow manage to stay alive nonetheless. It is clear, that in this world, no money is needed: there is simply nothing to do with it.

Next, imagine that suddenly one veggielander Adam started longing for a glass of water, but he had no idea where to get it. At the same time another veggielander Betty saw Adam's problem and said: "Hey Adam, I can give you a glass of water, but you have to scratch my back for that", because Betty really liked when someone scratched her back. They exchanged services and were happy. Still no money was needed in Veggieland.

After that day Adam discovered that he was really good at scratching Betty's back, and he even liked it. So next time he came to Betty and proposed to scratch her back even though he didn't want water. What he asked in return was a promise that she would get him a glass of water next time he would ask. After a month or so of scratching Betty's back he collected 100 water glass promises, each written with Betty's hand on a nice pink piece of paper, and he found out that this was much more water than he would need until the end of his life. But he had to do something with these promises, because he couldn't "unscratch" Betty's back, right? So he went to that other guy Carl and said: "Hey, Betty has promised me a glass of water, and I can pass this promise to you, if you give me that shiny glass bead that you have, because I adore glass beads". Carl always trusted Betty's promises so he agreed. Now Adam had a bead and 99 water promises, Carl had one water promise. Carl then went to Betty, showed her the pink piece of paper, Betty exchanged it for a glass of water and burned it. Now there were just 99 promises left in circulation in Veggieland.

A month has passed and veggielanders understood that they could actually pass Betty's promises around, and give out their own promises in exchange for services. They wrote promises on pink pieces of paper and picked the "water glass promise" as a universal measure: now Daffy was providing floozies in exchange for 2 water glass promises, and Ewan was selling tiddlybums at a price 3 water glasses per portion. The system worked, because it satisfied two simple conditions:
1. Firstly, at any given moment the number of undone work was equal the number of these pink paper promises in circulation.
2. Secondly, the number of promised work was always feasible.
These conditions were easy to satisfy in Veggieland, because veggielanders were all honest and friendly people. They decided that any time someone fulfills his own promise, she should destroy one pink paper, and any time someone gives a promise, she should write a new pink paper, but no one should give a promise he can't hold.

As a result the people of Veggieland ended up with a perfect monetary system, driving its economics, and allowing citizens to trade services and goods. The analogy of this system with the real monetary systems around is straightforward, yet it is much easier to understand its properties, and thus the properties of the real-world money. The following are the important observations:
- Money itself is just an abstract promise, it has no form or inherent value. As long conditions 1 and 2 are satisfied, there is really no difference whether the function of money is performed using pink papers, glass beads, or even word-of-mouth promises. Money is not a resource, don't be misled by the idea of the gold standard times, that money equals gold. The point of the gold standard was just that, at the time, gold could somewhy satisfy properties 1 and 2 better than printed money and checks. Otherwise, you don't really need to "support" money with goods for no purpose, but to provide an illusion to people that money is indeed worth it's promise.
- The amount of money in circulation does not equal the amount of natural resources of the land. It is equal to the amount of economic activity, i.e., the number of unfulfilled promises. If someone discovers a gold mine, or simply brings in a lot of gold into the country, this will begin having impact only when people start actually demanding this resource and giving out promises in exchange.
- Who makes money? Well, ideally, each person that gives a reasonable promise in exchange for a real service increases the amount of money in circulation, and the person that fulfills a promise - decreases. Of course, real people are not honest and therefore real monetary systems are a crude approximation, with banks being in response of this dynamic money emission process.
- The choice of currency does make a difference. If someone comes to Veggieland with a bunch of euros (which are, in a sense, promises given by the europeans), the veggielanders won't be inclined to exchange them for their own promises, because although the newcomer can easily make use of the veggielanders' services, the contrary is not true: Veggielanders just never leave their country! A corollary of this observation is that the larger is the geographical span of a particular currency, the further away from the ideal trade unit it will be.
The Real Life

Now let's look at how monetary system is implemented in real countries. There are two main aspects to this implementation.

First issue is the choice and social support for the base monetary unit. Each country typically figures out its favourite name and denomination for the unit and calls it its "currency". A social contract is then made among the citizens, that all promises in the country are to be measured in this unit, and everyone should accept this unit in return for services. Note that this social contract guarantees the absolute liquidity of the currency and puts it strictly aside from anything else you might imagine to be a "close analogue" for money - be it gold, government bonds or securities. The latter are just not liquid enough to work as a general promise-container.

Secondly, the process of emission and subtraction of money from circulation. This certainly cannot be trusted to people, like it was in Veggieland. A banking system serves the purpose. The idea is that a single central trusted party is chosen, which emits and discards money. A person comes to the bank asking for a loan, the bank confirms that the person is trustworthy enough to keep his promises and gives him some money: money has just been emitted into circulation. Note that in our digital world most money is emitted in digital form, no paper money needs to be printed for that. In other words, the number you see on your account is actually larger than the number of paper money stored "behind" it. This is very natural, because otherwise the bank would have to keep paper-money-printing and destroying machines in it, which would be dangerous.

Most real monetary systems are just a bit more complicated, having a two-level banking system. The central bank crudely regulates money supply by emitting paper money, giving loans to commercial banks and instructing them to as to how much "more" money they may emit via the reserve requirements. The commercial banks then provide actual banking services to the people. With this distributed system people have a choice of several banks to trust their finances to, not just one. And if it turns out that one bank emits too much (i.e. gives too many loans), so that the emitted promises cannot be met, only the people trusting this bank (i.e. holding "this bank's" money) will suffer.

Final Questions

Despite the fact that the above theory seems consistent to me, it is not without complications. As I believe you've already tired of reading this post, I'll avoid spilling too much further thought upon you and just list the three points I find most confusing briefly.
- Despite being an abstract trade unit, money actually has value: it can bring interests. How do these interests "earned from nowhere" correspond to money emission? Why should owing lots of abstract promises necessarily generate more abstract promises?
- Internet banks take a fee for transfers, which makes digital money somewhat less liquid than paper money. I find it incorrect. Why should I pay to pay? I understand that banks need resources to support their services, but the amount of resources should not be proportional to the number of operations performed.
- If all the customers of a bank one day come to claim their savings in cash, the bank won't have enough cash and it will be perceived as a collapse for the bank. Considering the fact that digital money is, in fact, still real money, I see this as a certain drawback of the current two-level banking system. Or is it a necessary condition to distribute trust among the commercial banks?
If you think I'm completely wrong with all that, you are welcome to state your opinion in the comments.
Tags: Economics, Unclear
Understanding Cross-validation

Posted by Konstantin 17.09.2008 2 Comments
Suppose the other day you came up with a fresh and brilliant idea: to make an automated system that would track the social and economical situation in various countries and predict when and where a new war is going to happen. So you collected historical data on numerous incidents in the world, fed that data into your favourite machine learning model, played around with the parameters, and voilà - you've got something that seems to work. There is just one thing left to do - assess whether the system you created is any good. At the very least, you need to measure its precision: the probability that a positive prediction will in fact be true.

But how do you measure precision in this case? The straightforward way would be to wait a pair of years and compare the numbers of actual and predicted wars on the planet - this would provide an approximate estimate of the true generalization ability of your system. However, this is not a very convenient option, so you better figure out how to assess your new system using only the existing data. And this is when things can get confusing (at least if you think long enough). The most popular solutions here are, of course, holdout testing and cross-validation, but their interpretation is often overlooked, so I believe they are worth thinking about for a minute.

Holdout

Conceptually the simplest way is the holdout validation: you randomly split your data into a training set of size n and a test set of size m, use former for training and estimate performance on the latter. The two things to note here are the following:
- Due to the finite size of the test set you will only get an approximate idea of the true performance. The rule of thumb is that the performance that you measure has an error of the order 1/sqrt(m). You get this number if you try to compute the 99% confidence interval under the normality assumption.
- You are actually measuring the performance of the classifier constructed on your training set, and not the one that you are interested in. This introduces another bias in your measurement. Most often this bias is pessimistic (i.e. the performance of a classifier trained on a smaller set of data is usually worse than the performance of the classifier, trained on the full dataset), however if the dataset contains outliers this need not be the case. I am not aware of any results describing the magnitude of this bias, but I presume that for small datasets it depends on the proportion of data used for training.
Cross-validation

The idea of cross-validation is to repeat the holdout experiment described above k times and average the results.
- If on each iteration your training would result in the same classifier, you could say that the resulting measured performance has an error of the order of 1/sqrt(km). However, each iteration of cross-validation results in a different classifier being built on the selected training set. As a result, you are testing k different classifiers, each with precision 1/sqrt(m), take their average performance, and hope to obtain a better estimate for the performance of your "main" classifier. This hope is justified, because you believe that all of the k classifiers are similar to the "main" one. But what if it is not true? Can it happen so that the variance introduced by the differences in the k classifiers is significant enough to be considered? It is, after all, possible, if the data is very scarce. But again, cross-validation is used precisely when the data is scarce.
- Similarly to the case with holdout, the cross-validation estimate will most often be pessimistically biased, and it seems that no one really knows how large the bias is going to be.
Double Cross-validation

Finally, things get more complicated when you have some "hyper" parameters in your model that you need to tune (such as the λ tradeoff parameter for regularized models). If you simply estimate the performance for a range of parameter values and then pick the one reporting the best value, you are introducing an optimistic bias, because you have in fact tested your algorithm on the same data that was used for training. Therefore, it would be fair to "re-test" your algorithm on some additional data. It is not uncommon to see cases, when a round of cross-validation is used to select the model parameters, and another round is used to estimate the performance (that is, you split the training set in turn into a training and testing sets for model selection). But now things get strange.
- If on each iteration of "outer" cross validation, we re-select the "high-impact" model parameters using "inner" cross-validation, aren't we introducing just too much variability into the k models of the outer round? What performance are we measuring? Is it optimistically or pessimistically biased now? In general, how close should this doubly-cross-validated performance measure to the real model performance? How should this number be interpreted?
If some minutes of thought will help you answer these questions, I believe these would be minutes well worth spending.
Tags: Machine learning, Model selection, Unclear