• Posted by Konstantin 06.12.2017 6 Comments

    This is still a work in progress, a couple of experimental results are to be added soon.

    Early stopping is a technique that is very often used when training neural networks, as well as with some other iterative machine learning algorithms. The idea is quite intuitive - let us measure the performance of our model on a separate validation dataset during the training iterations. We may then observe that, despite constant score improvements on the training data, the model's performance on the validation dataset would only improve during the first stage of training, reach an optimum at some point and then turn to getting worse with further iterations.

    The early stopping principle

    The early stopping principle

    It thus seems reasonable to stop training at the point when the minimal validation error is achieved. Training the model any further only leads to overfitting. Right? The reasoning sounds solid and, indeed, early stopping is often claimed to improve generalization in practice. Most people seem to take the benefit of the technique for granted. In this post I would like to introduce some skepticism into this view or at least illustrate that things are not necessarily as obvious as they may seem from the diagram with the two lines above.

    How does Early Stopping Work?

    To get a better feeling of what early stopping actually does, let us examine its application to a very simple "machine learning model" - the estimation of the mean. Namely, suppose we are given a sample of 50 points \mathbf{x}_i from a normal distribution with unit covariance and we need to estimate the mean \mathbf{w} of this distribution.

    Sample

    Sample

    The maximum likelihood estimate of \mathbf{w} can be found as the point which has the smallest sum of squared distances to all the points in the sample. In other words, "model fitting" boils down to finding the minimum of the following objective function:

        \[f_\mathrm{train}(\mathrm{w}) := \sum_{i=1}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2\]

    As our estimate is based on a finite sample, it, of course, won't necessarily be exactly equal to the true mean of the distribution, which I chose in this particular example to be exactly (0,0):

    Sample mean as a minimum of the objective function

    Sample mean as a minimum of the objective function

    The circles in the illustration above are the contours of the objective function, which, as you might guess, is a paraboloid bowl. The red dot marks its bottom and is thus the solution to our optimization problem, i.e. the estimate of the mean we are looking for. We may find this solution in various ways. For example, a natural closed-form analytical solution is simply the mean of the training set. For our purposes, however, we will be using the gradient descent iterative optimization algorithm. It is also quite straightforward: start with any point (we'll pick (-0.5, 0) for concreteness' sake) and descend in small steps downwards until we reach the bottom of the bowl:

    Gradient descent

    Gradient descent

    Let us now introduce early stopping into the fitting process. We will split our 50 points randomly into two separate sets: 40 points will be used to fit the model and 10 will form the early stopping validation set. Thus, technically, we now have two different objective functions to deal with:

        \[f_\mathrm{fit}(\mathrm{w}) := \sum_{i=1}^{40} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2\]

    and

        \[f_\mathrm{stop}(\mathrm{w}) := \sum_{i=41}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2.\]

    Each of those defines its own "paraboloid bowl", both slightly different from the original one (because those are different subsets of data):

    Fitting and early stopping objectives

    Fitting and early stopping objectives

    As our algorithm descends towards the red point, we will be tracking the value of f_\mathrm{stop} at each step along the way:

    Gradient descent with validation

    Gradient descent with validation

    With a bit of imagination you should see on the image above, how the validation error decreases as the yellow trajectory approaches the purple dot and then starts to increase after some point midway. The spot where the validation error achieves the minimum (and thus the result of the early stopping algorithm) is shown by the green dot on the figure below:

    Early stopping

    Early stopping

    In a sense, the validation function now acts as a kind of a "guardian", preventing the optimization from converging towards the bottom of our main objective. The algorithm is forced to settle on a model, which is neither an optimum of f_\mathrm{fit} nor of f_\mathrm{stop}. Moreover, both f_\mathrm{fit} and f_\mathrm{stop} use less data than f_\mathrm{train}, and are thus inherently a worse representation of the problem altogether.

    So, by applying early stopping we effectively reduced our training set size, used an even less reliable dataset to abort training, and settled on a solution which is not an optimum of anything at all. Sounds rather stupid, doesn't it?

    Indeed, observe the distribution of the estimates found with (blue) and without (red) early stopping in repeated experiments (each time with a new random dataset):

    Solutions found with and without early stopping

    Solutions found with and without early stopping

    As we see, early stopping greatly increases the variance of the estimate and adds a small bias towards our optimization starting point.

    Finally, let us see how the quality of the fit depends on the size of the validation set:

    Fit quality vs validation set size

    Fit quality vs validation set size

    Here the y axis shows the squared distance of the estimated point to the true value (0,0), smaller is better (the dashed line is the expected distance of a randomly picked point from the data).  The x axis shows all possible sizes of the validation set. We see that using no early stopping at all (x=0) results in the best expected fit. If we do decide to use early stopping, then for best results we should split the data approximately equally into training and validation sets. Interestingly, there do not seem to be much difference in whether we pick 30%, 50% or 70% of data for the validation set - the validation set seems to play just as much role in the final estimate as the training data.

    Early Stopping with Non-convex Objectives

    The experiment above seems to demonstrate that early stopping should be almost certainly useless (if not harmful) for fitting simple convex models. However, it is never used with such models in practice. Instead, it is most often applied to the training of multilayer neural networks. Could it be the case that the method somehow becomes useful when the objective is highly non-convex? I have no simple theory to prove or disprove it, nor would an extensive experimental exploration fit in the scope of this blog post. Let us, however, run a small experiment, measuring the benefits of early stopping for fitting a convolutional neural-network on the MNIST dataset. For simplicity, I took the standard example from the Keras codebase, and modified to include early stopping. Here is the result I got out of a single run:

    MNIST Experiment

    MNIST Experiment

    The y axis depicts log-loss on the 10k MNIST test set, the x axis shows the proportion of the 60k MNIST training set set aside for early stopping. Once again, we see that using no early stopping (and running a fixed number of 100 epochs, which is about twice the number of epochs required with early stopping) results in better test error in the end.

    Is Early Stopping Useful At All?

    The idea that early stopping is a useful regularization method is quite widespread and not without grounds.  However, given the reasoning and the anecdotal experimental evidence above, I personally tend to believe that beliefs in its usefulness in the context of neural network training may be well overrated. We may regard early stopping as a kind of a regularization tool. Indeed, if you start training from a parameter vector of zeroes, then stopping the training early is vaguely analogous to suppressing the norm of the parameter vector by preventing it from leaving a certain area around zero. However, we could achieve a similar effect much cleaner by applying \ell_1 or \ell_2 regularization penalty to the parameters directly.

    Note, though, that there is a slight difference between early stopping in the context of neural networks and, say, boosting models. In the latter case early stopping is quite directly limiting the complexity of the overall model, and I feel this may result in a somewhat different overall effect, not directly comparable to neural network training. In that context it seems to make more sense to me.

    Also note, that no matter whether early stopping helps or harms the generalization of the trained model, it is still a useful heuristic as to when to stop a lengthy training automatically if we just need to get some "good enough" results.

     

    Tags: , , , , , , ,