Four Years Remaining

The Mystery of Early Stopping

Posted by Konstantin 06.12.2017 7 Comments

Early stopping is a technique that is very often used when training neural networks, as well as with some other iterative machine learning algorithms. The idea is quite intuitive - let us measure the performance of our model on a separate validation dataset during the training iterations. We may then observe that, despite constant score improvements on the training data, the model's performance on the validation dataset would only improve during the first stage of training, reach an optimum at some point and then turn to getting worse with further iterations.

The early stopping principle

It thus seems reasonable to stop training at the point when the minimal validation error is achieved. Training the model any further only leads to overfitting. Right? The reasoning sounds solid and, indeed, early stopping is often claimed to improve generalization in practice. Most people seem to take the benefit of the technique for granted. In this post I would like to introduce some skepticism into this view or at least illustrate that things are not necessarily as obvious as they may seem from the diagram with the two lines above.

How does Early Stopping Work?

To get a better feeling of what early stopping actually does, let us examine its application to a very simple "machine learning model" - the estimation of the mean. Namely, suppose we are given a sample of 50 points $\mathbf{x}_i$ from a normal distribution with unit covariance and we need to estimate the mean $\mathbf{w}$ of this distribution.

Sample

The maximum likelihood estimate of $\mathbf{w}$ can be found as the point which has the smallest sum of squared distances to all the points in the sample. In other words, "model fitting" boils down to finding the minimum of the following objective function:

$f_\mathrm{train}(\mathrm{w}) := \sum_{i=1}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2$

As our estimate is based on a finite sample, it, of course, won't necessarily be exactly equal to the true mean of the distribution, which I chose in this particular example to be exactly (0,0):

Sample mean as a minimum of the objective function

The circles in the illustration above are the contours of the objective function, which, as you might guess, is a paraboloid bowl. The red dot marks its bottom and is thus the solution to our optimization problem, i.e. the estimate of the mean we are looking for. We may find this solution in various ways. For example, a natural closed-form analytical solution is simply the mean of the training set. For our purposes, however, we will be using the gradient descent iterative optimization algorithm. It is also quite straightforward: start with any point (we'll pick (-0.5, 0) for concreteness' sake) and descend in small steps downwards until we reach the bottom of the bowl:

Gradient descent

Let us now introduce early stopping into the fitting process. We will split our 50 points randomly into two separate sets: 40 points will be used to fit the model and 10 will form the early stopping validation set. Thus, technically, we now have two different objective functions to deal with:

$f_\mathrm{fit}(\mathrm{w}) := \sum_{i=1}^{40} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2$

and

$f_\mathrm{stop}(\mathrm{w}) := \sum_{i=41}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2.$

Each of those defines its own "paraboloid bowl", both slightly different from the original one (because those are different subsets of data):

Fitting and early stopping objectives

As our algorithm descends towards the red point, we will be tracking the value of $f_\mathrm{stop}$ at each step along the way:

Gradient descent with validation

With a bit of imagination you should see on the image above, how the validation error decreases as the yellow trajectory approaches the purple dot and then starts to increase after some point midway. The spot where the validation error achieves the minimum (and thus the result of the early stopping algorithm) is shown by the green dot on the figure below:

Early stopping

In a sense, the validation function now acts as a kind of a "guardian", preventing the optimization from converging towards the bottom of our main objective. The algorithm is forced to settle on a model, which is neither an optimum of $f_\mathrm{fit}$ nor of $f_\mathrm{stop}$ . Moreover, both $f_\mathrm{fit}$ and $f_\mathrm{stop}$ use less data than $f_\mathrm{train}$ , and are thus inherently a worse representation of the problem altogether.

So, by applying early stopping we effectively reduced our training set size, used an even less reliable dataset to abort training, and settled on a solution which is not an optimum of anything at all. Sounds rather stupid, doesn't it?

Indeed, observe the distribution of the estimates found with (blue) and without (red) early stopping in repeated experiments (each time with a new random dataset):

Solutions found with and without early stopping

As we see, early stopping greatly increases the variance of the estimate and adds a small bias towards our optimization starting point.

Finally, let us see how the quality of the fit depends on the size of the validation set:

Fit quality vs validation set size

Here the y axis shows the squared distance of the estimated point to the true value (0,0), smaller is better (the dashed line is the expected distance of a randomly picked point from the data). The x axis shows all possible sizes of the validation set. We see that using no early stopping at all (x=0) results in the best expected fit. If we do decide to use early stopping, then for best results we should split the data approximately equally into training and validation sets. Interestingly, there do not seem to be much difference in whether we pick 30%, 50% or 70% of data for the validation set - the validation set seems to play just as much role in the final estimate as the training data.

Early Stopping with Non-convex Objectives

The experiment above seems to demonstrate that early stopping should be almost certainly useless (if not harmful) for fitting simple convex models. However, it is never used with such models in practice. Instead, it is most often applied to the training of multilayer neural networks. Could it be the case that the method somehow becomes useful when the objective is highly non-convex? Let us run a small experiment, measuring the benefits of early stopping for fitting a convolutional neural-network on the MNIST dataset. For simplicity, I took the standard example from the Keras codebase, and modified it slightly. Here is the result we get when training the the most basic model:

MNIST - Basic

The y axis depicts log-loss on the 10k MNIST test set, the x axis shows the proportion of the 60k MNIST training set set aside for early stopping. Ignoring small random measurement noise, we may observe that using early stopping with about 10% of the training data does seem to convey a benefit. Thus, contrary to our previous primitive example, when the objective is complex, early stopping does work as a regularization method. Why and how does it work here? Here's one intuition I find believable (there are alternative possible explanations and measurements, none of which I find too convincing or clear, though): stopping the training early prevents the algorithm from walking too far away from the initial parameter values. This limits the overall space of models and is vaguely analogous to suppressing the norm of the parameter vector. In other words, early stopping resembles an ad-hoc version of $\ell_p$ regularization.

Indeed, observe how the use of early stopping affects the results of fitting the same model with a small $\ell_2$ -penalty added to the objective:

MNIST - L2

All of the benefits of early stopping are gone now, and the baseline (non-early-stopped, $\ell_2$ -regularized) model is actually better overall than it was before. Let us now try an even more heavily regularized model by adding dropout (instead of the $\ell_2$ penalty), as is customary for deep neural networks. We can observe an even cleaner result:

MNIST - Dropout

Early stopping is again not useful at all, and the overall model is better than all of our previous attempts.

Conclusion: Do We Need Early Stopping?

Given the reasoning and the anecdotal experimental evidence above, I personally tend to think that beliefs in the usefulness of early stopping (in the context of neural network training) may be well overrated. Even if it may improve generalization for some nonlinear models, you would most probably achieve the same effect more reliably using other regularization techniques, such as dropout or a simple $\ell_2$ penalty.

Note, though, that there is a difference between early stopping in the context of neural networks and, say, boosting models. In the latter case early stopping is actually more explicitly limiting the complexity of the final model and, I suspect, might have a much more meaningful effect. At least we can't directly carry over the experimental examples and results in this blog post to that case.

Also note, that no matter whether early stopping helps or harms the generalization of the trained model, it is still a useful heuristic as to when to stop a lengthy training process automatically if we simply need results that are good enough.

Tags: Data analysis, Explanation, Machine learning, Project, Python, Statistics, Theory, Unclear
Sorting in Linear Time

Posted by Konstantin 25.07.2017 3 Comments
Every student of computer science, who has managed to keep even a tiny shred of attention at their algorithms course, should know that sorting $n$ numbers is a task that requires at least $\Omega(n \log n)$ time in general. There are some special cases, such as sorting small integers, where you can use counting sort or radix sort to beat this baseline, but as long as your numbers are hypothetically arbitrarily large, you are stuck with the $\Omega(n \log n)$ lower bound. Right?

Well, not really. One thing that many algorithms courses tend to skim over rather briefly is the discussion of the choice of the computation model, under which the algorithm of interest is supposed to run. In particular, the $\Omega(n \log n)$ bound for sorting holds for the comparison-only model of computation — the abstract situation where the algorithm may only perform pairwise comparisons of the numbers to be sorted. No arithmetic, bit-shifts or anything else your typical processor is normally trained to do is allowed. This is, obviously, not a very realistic model for a modern computer.

Let us thus consider a different computation model instead, which allows our computer to perform any of the basic arithmetic or bitwise operations on numbers in constant time. In addition, to be especially abstract, let us also assume that our computer is capable of handling numbers of arbitrary size. This is the so-called unit-cost RAM model.

It turns out that in this case one can sort arbitrarily large numbers in linear time. The method for achieving this (presented in the work of W. Paul and J. Simon, not to be confused with Paul Simon) is completely impractical, yet quite insightful and amusing (in the geeky sense). Let me illustrate it here.

Paul-and-Simon Sorting

The easiest way to show an algorithm is to step it through an example. Let us therefore consider the example task of sorting the following array of three numbers:
```
a = [5, 3, 9]
```
Representing the same numbers in binary:
```
[101, 11, 1001]
```
Our algorithm starts with a linear pass to find the bit-width of the largest number in the array. In our case the largest number is 9 and has 4 bits:
```
bits = max([ceil(log2(x)) for x in a])     # bits = 4
n = len(a)                                 # n = 3
```
Next the algorithm will create a $(4+1)\cdot 3^2 = 45$ -bit number A of the following binary form:
```
 1 {5} 1 {5} 1 {5} 1 {3} 1 {3} 1 {3} 1 {9} 1 {9} 1 {9}
```
where {9}, {3} and {5} denote the 4-bit representations of the corresponding numbers. In simple terms, we need to pack each array element repeated $n$ times together into a single number. It can be computed in linear time using, for example, the following code:
```
temp, A = 0, 0
for x in a:
    temp = (temp<<(n*(bits+1))) + (1<<bits) + x
for i in range(n):
    A = (A<<(bits+1)) + temp
```
The result is 23834505373497, namely:
```
101011010110101100111001110011110011100111001
```
Next, we need to compute another 45-bit number B, which will also pack all the elements of the array $n$ times, however this time they will be separated by 0-bits and interleaved as follows:
```
 0 {5} 0 {3} 0 {9} 0 {5} 0 {3} 0 {9} 0 {5} 0 {3} 0 {9}
```
This again can be done in linear time:
```
temp, B = 0, 0
for x in a:
    temp = (temp<<(bits+1)) + x
for i in range(n):
    B = (B<<(n*(bits+1))) + temp
```
The result is 5610472248425, namely:
```
001010001101001001010001101001001010001101001
```
Finally, here comes the magic trick: we subtract B from A. Observe how with this single operation we now actually perform all pairwise subtractions of the numbers in the array:
```
A = 1 {5} 1 {5} 1 {5} 1 {3} 1 {3} 1 {3} 1 {9} 1 {9} 1 {9} 
B = 0 {5} 0 {3} 0 {9} 0 {5} 0 {3} 0 {9} 0 {5} 0 {3} 0 {9}
```
Consider what happens to the bits separating all the pairs. If the number on top is greater or equal to the number on the bottom of the pair, the corresponding separating bit on the left will not be carried in the subtraction, and the corresponding bit of the result will be 1. However, whenever the number on the top is less than the number on the bottom, the resulting bit will be zeroed out due to carrying:
```
A   = 1 {5} 1 {5} 1 { 5} 1 { 3} 1 {3} 1 { 3} 1 {9} 1 {9} 1 {9} 
B   = 0 {5} 0 {3} 0 { 9} 0 { 5} 0 {3} 0 { 9} 0 {5} 0 {3} 0 {9}
A-B = 1 {0} 1 {2} 0 {12} 0 {14} 1 {0} 0 {10} 1 {4} 1 {6} 1 {0}
```
The same in binary (highlighted groups correspond to repetitions of the original array elements in the number A):
```
A   = 1 0101 1 0101 1 0101|1 0011 1 0011 1 0011|1 1001 1 1001 1 1001
B   = 0 0101 0 0011 0 1001|0 0101 0 0011 0 1001|0 0101 0 0011 0 1001
A-B = 1 0000 1 0010 0 1100|0 1110 1 0000 0 1010|1 0100 1 0110 1 0000
```
Each "separator" bit of A-B is effectively the result of a comparison of every array element with every other. Let us now extract these bits using a bitwise AND and sum them within each group. It takes another couple of linear passes:
```
x = A-B >> bits
mask, result = 0, 0
for i in range(n):
    mask = (mask<<(n*(bits+1))) + 1
for i in range(n):
    result += x & mask
    x = x >> (bits+1)
```
The result is now the following number:
```
result = 10|000000000000001|000000000000011
```
It is a packed binary representation of the array r = [2, 1, 3]. The number 2 here tells us that there are two elements in a, which are less or equal than a[0]=5. Similarly, the number 1 says that there is only one element less or equal than a[1]=3, and the number 3 means there are three elements less or equal than a[2]=9. In other words, this is an array of ranks, which tells us how the original array elements should be rearranged into sorted order:
```
r = [result >> (n*(bits+1)*(n-i-1)) & ((1<<(n*(bits+1)))-1) 
                                          for i in range(n)]
a_sorted = [None]*n
for i in range(n):
    a_sorted[r[i]-1] = a[i]
```
And voilà, the sorted array! As presented above, the method would only work for arrays consisting of distinct non-negative integers. However, with some modifications it can be adapted to arbitrary arrays of integers or floats. This is left as an exercise to the reader.

The General Implications

There are several things one can learn from the "Paul-and-Simon sort". Firstly, it shows the immense power of the unit-cost RAM computational model. By packing arbitrary amounts of data into a single register of unlimited size, we may force our imaginary computer to perform enormously complex parallel computations in a single step. Indeed, it is known that PSPACE-complete problems can be solved in polynomial time in the unlimited-precision RAM model. This, however, assumes that the machine can do arbitrary arithmetic operations. If you limit it to only additions, subtractions and multiplications (but not divisions or bit-shifts), you still cannot sort integers faster than $\Omega(n \log n)$ even using infinitely-sized registers (this is the main result of the Paul and Simon's article that inspired this post). Not obvious, is it?

Of course, real computers can usually only perform constant-time operations on registers of a fixed size. This is formalized in the $w$ -bit word-RAM model, and in this model the "Paul and Simon sort" degrades from a $O(n)$ into a $O(n^3)$ algorithm (with $O(n^2)$ memory consumption). This is a nice illustration of how the same algorithm can have different complexity based on the chosen execution model.

The third thing that the "Paul and Simon sort" highlights very clearly is the power of arithmetic operations on packed values and bitstrings. In fact, this idea has been applied to derive practically usable integer sorting algorithms with nearly-linear complexity. The latter paper by Han & Thorup expresses the idea quite well:

In case you need the full code of the step-by-step explanation presented above, here it is.
Tags: Algorithms, Computer science, Programming, Theory
Implication and Provability

Posted by Konstantin 28.03.2017 No Comments
Consider the following question:
Which of the following two statements is logically true?
1. All planets of the Solar System orbit the Sun. The Earth orbits the Sun. Consequently, the Earth is a planet of the Solar System.
2. God is the creator of all things which exist. The Earth exists. Consequently, God created the Earth.
I've seen this question or variations of it pop up as "provocative" posts in social networks several times. At times they might invite lengthy discussions, where the participants would split into camps - some claim that the first statement is true, because Earth is indeed a planet of the Solar System and God did not create the Earth. Others would laugh at the stupidity of their opponents and argue that, obviously, only the second statement is correct, because it makes a valid logical implication, while the first one does not.

Not once, however, have I ever seen a proper formal explanation of what is happening here. And although it is fairly trivial (once you know it), I guess it is worth writing up. The root of the problem here is the difference between implication and provability - something I myself remember struggling a bit to understand when I first had to encounter these notions in a course on mathematical logic years ago.

Indeed, any textbook on propositional logic will tell you in one of the first chapters that you may write

$A \Rightarrow B$

to express the statement " $A$ implies $B$ ". A chapter or so later you will learn that there is also a possibility to write

$A \vdash B$

to express a confusingly similar statement, that " $B$ is provable from $A$ ". To confirm your confusion, another chapter down the road you should discover, that $A \Rightarrow B$ is the same as $\vdash A \Rightarrow B$ , which, in turn, is logically equivalent to $A \vdash B$ . Therefore, indeed, whenever $A \Rightarrow B$ is true, $A \vdash B$ is true, and vice-versa. Is there a difference between $\vdash$ and $\Rightarrow$ then, and why do we need the two different symbols at all? The "provocative" question above provides an opportunity to illustrate this.

The spoken language is rather informal, and there can be several ways of formally interpreting the same statement. Both statements in the puzzle are given in the form " $A$ , $B$ , consequently $C$ ". Here are at least four different ways to put them formally, which make the two statements true or false in different ways.

The Pure Logic Interpretation

Anyone who has enough experience solving logic puzzles would know that both statements should be interpreted as abstract claims about provability (i.e. deducibility):

$A, B \vdash C.$

As mentioned above, this is equivalent to

$(A\,\&\, B) \Rightarrow C.$

or

$\vdash (A\,\&\, B) \Rightarrow C.$

In this interpretation the first statement is wrong and the second is a correct implication.

The Pragmatic Interpretation

People who have less experience with math puzzles would often assume that they should not exclude their common sense knowledge from the task. The corresponding formal statement of the problem then becomes the following:

$[\text{common knowledge}] \vdash (A\,\&\, B) \Rightarrow C.$

In this case both statements become true. The first one is true simply because the consequent $C$ is true on its own, given common knowledge (the Earth is indeed a planet) - the antecedents and provability do not play any role at all. The second is true because it is a valid reasoning, independently of the common knowledge.

This type of interpretation is used in rhetorical phrases like "If this is true, I am a Dutchman".

The Overly Strict Interpretation

Some people may prefer to believe that a logical statement should only be deemed correct if every single part of it is true and logically valid. The two claims must then be interpreted as follows:

$([\text{common}] \vdash A)\,\&\, ([\text{common}] \vdash B)\,\&\, (A, B\vdash C).$

Here the issue of provability is combined with the question about the truthfulness of the facts used. Both statements are false - the first fails on logic, and the second on facts (assuming that God creating the Earth is not part of common knowledge).

The Oversimplified Interpretation

Finally, people very unfamiliar with strict logic would sometimes tend to ignore the words "consequently", "therefore" or "then", interpreting them as a kind of an extended synonym for "and". In their minds the two statements could be regarded as follows:

$[\text{common}] \vdash A\,\&\, B\,\&\, C.$

From this perspective, the first statement becomes true and the second (again, assuming the aspects of creation are not commonly known) is false.

Although the author of the original question most probably did really assume the "pure logic" interpretation, as is customary for such puzzles, note how much leeway there can be when converting a seemingly simple phrase in English to a formal statement. In particular, observe that questions about provability, where you deliberately have to abstain from relying on common knowledge, may be different from questions about facts and implications, where common sense may (or must) be assumed and you can sometimes skip the whole "reasoning" part if you know the consequent is true anyway.

Here is an quiz question to check whether you understood what I meant to explain.

"The sky is blue, and therefore the Earth is round." True or false?
Tags: Explanation, Logic, Mathematics, Puzzle, Theory
The Difficulties of Self-Identification

Posted by Konstantin 07.03.2017 No Comments
Ever since the "Prior Confusion" post I was planning to formulate one of its paragraphs as the following abstract puzzle, but somehow it took me 8 years to write it up.

According to fictional statistical studies, the following is known about a fictional chronic disease "statistite":
1. About 30% of people in the world have statistite.
2. About 35% of men in the world have it.
3. In Estonia, 20% of people have statistite.
4. Out of people younger than 20 years, just 5% have the disease.
5. A recent study of a random sample of visitors to the Central Hospital demonstrated that 40% of them suffer from statistite.
Mart, a 19-year Estonian male medical student is standing in the foyer of the Central Hospital, reading these facts from an information sheet and wondering: what are his current chances of having statistite? How should he model himself: should he consider himself as primarily "an average man", "a typical Estonian", "just a young person", or "an average visitor of the hospital"? Could he combine the different aspects of his personality to make better use of the available information? How? In general, what would be the best possible probability estimate, given the data?
Tags: Bayes, Logic, Machine learning, Philosophy, Probability theory, Puzzle, Statistics, Theory
The Secrets of Spring Motion

Posted by Konstantin 17.11.2016 3 Comments

Imagine a weight hanging on a spring. Let us pull the weight a bit and release it into motion. What will its motion look like? If you remember some of your high-school physics, you should probably answer that the resulting motion is a simple harmonic oscillation, best described by a sinewave. Although this is a fair answer, it actually misses an interesting property of real-life springs. A property most people don't think much about, because it goes a bit beyond the high school curriculum. This property is best illustrated by

The Slinky Drop

The "slinky drop" is a fun little experiment which has got its share of internet fame.

The Slinky Drop

When the top end of a suspended slinky is released, the bottom seems to patiently wait for the top to arrive before starting to fall as well. This looks rather unexpected. After all, we know that things fall down according to a parabola, and we know that springs collapse according to a sinewave, however neither of the two rules seem to apply here. If you browse around, you will see lots of awesome videos demonstrating or explaining this effect. There are news articles, forum discussions, blog posts and even research papers dedicated to the magical slinky. However, most of them are either too sketchy or too complex, and none seem to mention the important general implications, so let me give a shot at another explanation here.

The Slinky Drop Explained Once More

Let us start with the classical, "high school" model of a spring. The spring has some length $L$ in the relaxed state, and if we stretch it, making it longer by $\Delta L$ , the two ends of the spring exert a contracting force of $k\Delta L$ . Assume we hold the top of the spring at the vertical coordinate $y_{\mathrm{top}}=0$ and have it balance out. The lower end will then position at the coordinate $y_{\mathrm{bot}} = -(L+mg/k)$ , where the gravity force $mg$ is balanced out exactly by the spring force.

How would the two ends of the spring behave if we let go off the top now? Here's how:

The horozontal axis here denotes the time, the vertical axis - is the vertical position. The blue curve is the trajectory of the top end of the spring, the green curve - trajectory of the bottom end. The dotted blue line is offset from the blue line by exactly $L$ - the length of the spring in relaxed state.

Observe that the lower end (the green curve), similarly to the slinky, "waits" for quite a long time for the top to approach before starting to move with discernible velocity. Why is it the case? The trajectory of the lower point can be decomposed in two separate movements. Firstly, the point is trying to fall down due to gravity, following a parabola. Secondly, the point is being affected by string tension and thus follows a cosine trajectory. Here's how the two trajectories look like separately:

They are surprisingly similar at the start, aren't they? And indeed, the cosine function does resemble a parabola up to $o(x^3)$ . Recall the corresponding Taylor expansion:

$\cos(x) = 1 - \frac{x^2}{2} + \frac{x^4}{24} + \dots \approx 1 - \frac{x^2}{2}.$

If we align the two curves above, we can see how well they match up at the beginning:

Consequently, the two forces happen to "cancel" each other long enough to leave an impression that the lower end "waits" for the upper for some time. This effect is, however, much more pronounced in the slinky. Why so?

Because, of course, a single spring is not a good model for the slinky. It is more correct to regard a slinky as a chain of strings. Observe what happens if we model the slinky as a chain of just three simple springs:

Each curve here is the trajectory of one of the nodes inbetween the three individual springs. We can see that the top two curves behave just like a single spring did - the green node waits a bit for the blue and then starts moving. The red one, however, has to wait longer, until the green node moves sufficiently far away. The bottom, in turn, will only start moving observably when the red node approaches it close enough, which means it has to wait even longer yet - by that time the top has already arrived. If we consider a more detailed model, the movement of a slinky composed of, say, 9 basic springs, the effect of intermediate nodes "waiting" becomes even more pronounced:

To make a "mathematically perfect" model of a slinky we have to go to the limit of having infinitely many infinitely small springs. Let's briefly take a look at how that solution looks like.

The Continuous Slinky

Let $x$ denote the coordinate of a point on a "relaxed" slinky. For example, in the two discrete models above the slinky had 4 and 10 points, numbered $1,\dots, 4$ and $1,\dots, 10$ respectively. The continuous slinky will have infinitely many points numbered $[0,1]$ .

Let $h(x,t)$ denote the vertical coordinate of a point $x$ at time $t$ . The acceleration of point $x$ at time $t$ is then, by definition $\frac{\partial^2 h(x,t)}{\partial^2 t}$ , and there are two components affecting it: the gravitational pull $-g$ and the force of the spring.

The spring force acting on a point $x$ is proportional to the stretch of the spring at that point $\frac{\partial h(x,t)}{\partial x}$ . As each point is affected by the stretch from above and below, we have to consider a difference of the "top" and "bottom" stretches, which is thus the derivative of the stretch, i.e. $\frac{\partial^2 h(x,t)}{\partial^2 x}$ . Consequently, the dynamics of the slinky can be described by the equation:

$\frac{\partial^2 h(x,t)}{\partial^2 t} = a\frac{\partial^2 h(x,t)}{\partial^2 x} - g.$

where $a$ is some positive constant. Let us denote the second derivatives by $h_{tt}$ and $h_{xx}$ , replace $a$ with $v^2$ and rearrange to get:

(1) $h_{tt} - v^2 h_{xx} = -g,$

which is known as the wave equation. The name stems from the fact that solutions to this equation always resemble "waves" propagating at a constant speed $v$ through some medium. In our case the medium will be the slinky itself. Now it becomes apparent that, indeed, the lower end of the slinky should not move before the wave of disturbance, unleashed by releasing the top end, reaches it. Most of the explanations of the slinky drop seem to refer to that fact. However when it is stated alone, without the wave-equation-model context, it is at best a rather incomplete explanation.

Given how famous the equation is, it is not too hard to solve it. We'll need to do it twice - first to find the initial configuration of a suspended slinky, then to compute its dynamics when the top is released.

In the beginning the slinky must satisfy $h_t(x, t) = 0$ (because it is not moving at all), $h(0, t) = 0$ (because the top end is located at coordinate 0), and $h_x(1, t) = 0$ (because there is no stretch at the bottom). Combining this with (1) and searching for a polynomial solution, we get:

$h(x, t) = h_0(x) = \frac{g}{2v^2}x(x-2).$

Next, we release the slinky, hence the conditions $h_t(x,t)=0$ and $h(0,t)=0$ disappear and we may use the d'Alembert's formula with reflected boundaries to get the solution:

$h(x,t) = \frac{1}{2}(\phi(x-vt) + \phi(x+vt)) - \frac{gt^2}{2},$

$\text{ where }\phi(x) = h_0(\mathrm{mod}(x, 2)).$

Here's how the solution looks like visually:

Notice how the part of the slinky to which the wave has not arrived yet, stays completely fixed in place. Here are the trajectories of 4 equally-spaced points on the slinky:

Note how, quite surprisingly, all points of the slinky are actually moving with a constant speed, changing it abruptly at certain moments. Somewhat magically, the mean of all these piecewise-linear trajectories (i.e. the trajectory of the center of mass of the slinky) is still a smooth parabola, just as it should be:

The Secret of Spring Motion

Now let us come back to where we started. Imagine a weight on a spring. What will its motion be like? Obviously, any real-life spring is, just like the slinky, best modeled not as a Hooke's simple spring, but rather via the wave equation. Which means that when you let go off the weight, the weight will send a deformation wave, which will move along the spring back and forth, affecting the pure sinewave movement you might be expecting from the simple Hooke's law. Watch closely:

Here is how the movement of the individual nodes looks like:

The fat red line is the trajectory of the weight, and it is certainly not a sinewave. It is a curve inbetween the piecewise-linear "sawtooth" (which is the limit case when the weight is zero) and the true sinusoid (which is the limit case when the mass of the spring is zero). Here's how the zero-weight case looks like:

And this is the other extreme - the massless spring:

These observations can be summarized into the following obviously-sounding conclusion: the basic Hooke's law applies exactly only to the the massless spring. Any real spring has a mass and thus forms an oscillation wave traveling back and forth along its length, which will interfere with the weight's simple harmonic oscillation, making it "less simple and harmonic". Luckily, if the mass of the weight is large enough, this interference is negligible.

And that is, in my opinion, one of the interesting, yet often overlooked aspects of spring motion.

Tags: Explanation, Mathematics, Paradox, Physics, Science, Theory, Visualization
The Meaning of the Gaussian Kernel

Posted by Konstantin 11.11.2016 4 Comments

A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMs and Kernel methods, but I never found the time. The post expects basic knowledge of those topics from the reader.

Introductory Background

The concept of kernel methods is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, $x^T y$ , and square it, the result can be regarded as a dot product of two "feature vectors", where the features are all pairwise products of the original inputs:

$\begin{align*} &k(x,y) = (x^Ty)^2 = (\sum_i x_iy_i)^2 = \sum_{i,j} x_ix_jy_iy_j \\ & = \langle (x_1x_1, x_1x_2,\dots,x_ix_j,\dots,x_nx_n), (y_1y_1,\dots,y_iy_j,\dots,y_ny_n)\rangle \end{align*}$

Analogously, if you raise $x^Ty$ to the third power, you are essentially computing a dot product within a space of all possible three-way products of your inputs, and so on, without ever actually having to see those features explicitly.

If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a kernel function, and this will magically convert your model into a linear model with nonlinear features (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

Consider, for example, plain old linear regression: $f(x) = w^Tx + b$ . We can "kernelize" it by first representing $w$ as a linear combination of the data points (this is called a dual representation):

$f(x) = \left(\sum_i \alpha_i x_i\right)^T x + b = \sum_i \alpha_i (x_i^T x) + b,$

and then swapping all the dot products $x_i^T x$ with a custom kernel function:

$f(x) = \sum_i \alpha_i k(x_i,x) + b.$

If we now substitute $k(x,y) = (x^T y)^2$ here, our model becomes a second degree polynomial regression. If $k(x,y) = (x^T y)^5$ it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

It turns out that there are lots of valid choices for the kernel function $k$ , and, of course, the Gaussian function is one of these choices:

$k(x, y) = \exp\left(-\frac{|x - y|^2}{2\sigma^2}\right).$

It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

The Gaussian Kernel

To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power $i$ makes a kernel with a feature space, which includes all $i$ -wise products. Now let us examine what happens if we add two or more kernel functions. Consider $k(x,y) = x^Ty + (x^Ty)^2$ , for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

$(x_1, x_2, \dots, x_n, \quad x_1x_1, x_1x_2,\dots,x_ix_j,\dots, x_n x_n),$

i.e. the concatenation of degree-1 (untransformed) features, and degree-2 (pairwise product) features.

Multiplying a kernel function with a constant $c$ is also meaningful. It corresponds to scaling the corresponding features by $\sqrt{c}$ . For example, $k(x,y) = 4x^Ty = (2x)^T(2y)$ .

Still with me? Great, now let us combine the tricks above and consider the following kernel:

$k(x,y) = 1 + x^Ty + \frac{(x^Ty)^2}{2} + \frac{(x^Ty)^3}{6}.$

Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by $\sqrt{2}$ and all triple products scaled down by $\sqrt{6}$ .

Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to infinity-wise products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

$\sum_{i=0}^\infty \frac{(x^Ty)^i}{i!} = \exp(x^Ty).$

We can conclude here that $k(x,y) = \exp(x^Ty)$ is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

But we are not done yet. Suppose that we decide to normalize the inputs before applying our linear model. That is, we want to convert each vector $x$ to $\frac{x}{|x|}$ before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

Consider again the linear kernel $k(x,y) = x^Ty$ . If we normalize the vectors before taking their inner product, we get

$\left(\frac{x}{|x|}\right)^T\left(\frac{y}{|y|}\right) = \frac{x^Ty}{|x||y|} = \frac{k(x,y)}{\sqrt{k(x,x)k(y,y)}}.$

With some reflection you will see that the latter expression would normalize the features for any kernel.

Let us see what happens if we apply this kernel normalization to the “infinite polynomial” (i.e. exponential) kernel we just derived:

$\begin{align*} \frac{\exp(x^Ty)}{\sqrt{\exp(x^Tx)\exp(y^Ty)}} &= \frac{\exp(x^Ty)}{\exp(|x|^2/2)\exp(|y|^2/2)} \\ &= \exp\left(-\frac{1}{2}(|x|^2+|y|^2-2x^Ty)\right) \\ &= \exp\left(-\frac{|x-y|^2}{2}\right). \end{align*}$

Voilà, the Gaussian kernel. Well, it still lacks $\sigma^2$ in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by $1/\sigma$

To conclude: the Gaussian kernel is a normalized polynomial kernel of infinite degree (where feature products if $i$ -th degree are scaled down by $\sqrt{i!}$ before normalization). Simple, right?

An Example

The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let $x = 1$ , $y = 2$ . The value of the Gaussian kernel $k(x, y)$ for these inputs is:

$k(x, y) = \exp(-0.5|1-2|^2) \approx 0.6065306597...$

Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

$\phi'(x) = \left(1, x, \frac{x^2}{\sqrt{2!}}, \frac{x^3}{\sqrt{3!}}, \frac{x^4}{\sqrt{4!}}, \frac{x^5}{\sqrt{5!}}\dots\right).$

As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

---- $\phi'(1) =$ (1, 1, 0.707, 0.408, 0.204, 0.091, 0.037, 0.014, 0.005, 0.002, 0.001, 0.000, 0.000, ...)

---- $\phi'(2) =$ (1, 2, 2.828, 3.266, 3.266, 2.921, 2.385, 1.803, 1.275, 0.850, 0.538, 0.324, 0.187, 0.104, 0.055, 0.029, 0.014, 0.007, 0.003, 0.002, 0.001, ...)

Let us limit ourselves to just these first 21 features. To obtain the final Gaussian kernel feature representations $\phi$ we just need to normalize:

---- $\phi(1) =\frac{\phi'(1)}{|\phi'(1)|} = \frac{\phi'(1)}{1.649} =$ (0.607, 0.607, 0.429, 0.248, 0.124, 0.055, 0.023, 0.009, 0.003, 0.001, 0.000, ...)

---- $\phi(2) =\frac{\phi'(2)}{|\phi'(2)|} = \frac{\phi'(2)}{7.389} =$ (0.135, 0.271, 0.383, 0.442, 0.442, 0.395, 0.323, 0.244, 0.173, 0.115, 0.073, 0.044, 0.025, 0.014, 0.008, 0.004, 0.002, 0.001, 0.000, ...)

Finally, we compute the simple dot product of these two vectors:

$\scriptstyle\phi(1)^T\phi(2) = 0.607\cdot 0.135 + 0.607\cdot 0.271 + \dots = {\bf 0.6065306}602....$

In boldface are the decimal digits, which match the value of $\exp(-0.5|1-2|^2)$ . The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

A 2D Example

The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

$\begin{align*} \scriptscriptstyle\phi'(&\scriptscriptstyle x_1, x_2) = \left(1, x_1, \frac{x_1x_1}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{2!}}, \frac{x_2x_1}{\sqrt{2!}}, \frac{x_2x_2}{\sqrt{2!}}, \right. \\ &\scriptscriptstyle \left.\frac{x_1x_1x_1}{\sqrt{3!}}, \frac{x_1x_1x_2}{\sqrt{3!}}, \frac{x_1x_2x_1}{\sqrt{3!}}, \frac{x_1x_2x_2}{\sqrt{3!}}, \frac{x_2x_1x_2}{\sqrt{3!}}, \frac{x_2x_2x_1}{\sqrt{3!}}, \dots\right).\\ \end{align*}$

This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

Note that many products are repeated, however (e.g. $x_1x_2 = x_2x_1$ ), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

Basic combinatorics will tell us, that each feature of the form $\frac{x_1^a x_2^b}{\sqrt{n!}}$ must be repeated exactly $\frac{n!}{a!b!}$ times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by $\sqrt{\frac{n!}{a!b!}}$ . "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector $(v, v, v)$ , for example. Its norm is $\sqrt{v^2 + v^2 + v^2} = \sqrt{3}|v|$ , exactly the same as the norm of the single-element vector $(\sqrt{3}v)$ .

As we do this scaling, each feature gets converted to a nice symmetric form:

$\sqrt{\frac{n!}{a!b!}}\frac{x_1^a x_2^b}{\sqrt{n!}} = \frac{x_1^a x_2^b}{\sqrt{a!b!}} = \frac{x^a}{\sqrt{a!}}\frac{x^b}{\sqrt{b!}}.$

This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with $21^2$ = 231 features instead of 2097151. Nice!

Here is a new view of the unnormalized feature vector up to degree 3:

$\phi'_3(x_1, x_2) = \scriptstyle\left(1, x_1, x_2, \frac{x_1^2}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{1!1!}}, \frac{x^2}{\sqrt{2!}}, \frac{x_1^3}{\sqrt{3!}}, \frac{x_1^2x_2}{\sqrt{2!1!}}, \frac{x_1x_2^2}{\sqrt{1!2!}}, \frac{x_2^3}{\sqrt{3!}}\right).$

Let us limit ourselves to this degree-3 example and let $x = (0.7, 0.2)$ , $y = (0.1, 0.4)$ (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

---- $\phi'_3(0.7, 0.2) =$ (1, 0.7, 0.2, 0.346, 0.140, 0.028, 0.140, 0.069, 0.020, 0.003),

---- $\phi'_3(0.1, 0.4) =$ (1, 0.1, 0.4, 0.007, 0.040, 0.113, 0.000, 0.003, 0.011, 0.026).

After normalization:

---- $\phi_3(0.7, 0.2) =$ (0.768, 0.538, 0.154, 0.266, 0.108, 0.022, 0.108, 0.053, 0.015, 0.003),

---- $\phi_3(0.1, 0.4) =$ (0.919, 0.092, 0.367, 0.006, 0.037, 0.104, 0.000, 0.003, 0.010, 0.024).

The dot product of these vectors is $0.8196\dots$ , what about the exact Gaussian kernel value?

$\exp(-0.5|x-y|^2) = 0.{\bf 81}87\dots.$

Close enough. Of course, this could be done for any number of dimensions, so let us conclude the post with the new observation we made:

The features of the unnormalized $d$ -dimensional Gaussian kernel are:

$\phi({\bf x})_{\bf a} = \prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},$

where ${\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d$ .

The Gaussian kernel is just the normalized version of that, and we know that the norm to divide by is $\sqrt{\exp({\bf x}^T{\bf x})}$ . Thus we may also state the following:

The features of the $d$ -dimensional Gaussian kernel are:

$\phi({\bf x})_{\bf a} = \exp(-0.5|{\bf x}|^2)\prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},$

where ${\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d$ .

That's it, now you have seen the soul of the Gaussian kernel.

Tags: Definition, Explanation, Machine learning, Mathematics, SVM, Teaching, Theory
Bayesian Statistics in Layman's Terms

Posted by Konstantin 09.01.2016 No Comments
This is a (slightly updated) repost of my quora answer to the corresponding question.

There are many ways in which smart people tend to explain Bayesian statistics and contrast it with a "non-Bayesian" one. One usually highlights that the primary concept of a Bayesian approach is the the desire to model everything as a probability distribution. Once this is fact is clear, many smart people would proceed to claim that this is, in fact, what fundamentally sets Bayesian statistics aside from the "classical" one. However, I feel that this kind of explanation is somewhat incomplete. It is not like classical statisticians do not use complete probability distributions. The difference is in general somewhat more subtle and philosophical.

Consider the question "what is your height?". For a classical statistician there exists some abstract "true answer", say "180cm", which is a fixed number - your one and only height. The problem is, of course, you do not know this number because every measurement is slightly different, so the classical statistician will add that "there is a normally-distributed measurement error". In the world of a pure Bayesian there are almost no "fixed numbers" - everything is a probability distribution, and so is your height! That is, a Bayesian should say that "your height is a Normal distribution centered around 180cm".

Note that from the mathematical perspective there is no difference between the two representations - in both cases the number 180cm is mentioned, and the normal distribution. However, from a philosophical, syntactical, methodological and "mental" perspectives this tends to have serious implications, and there has been historically a kind of an ongoing intellectual feud between the statisticians who lend more towards the first or the second approach (it is somewhat resemblant of how there is a divide among the physicists with regard to their support of the Copenhagen interpretation of quantum mechanics).

One of the implications of denying the fact that things in the world are mostly fixed (and are all pure distributions instead) is that you may not use many of the common sense inference methods directly. What is my height if I stand on a chair? "Well, it is your height plus the height of a chair", a classical statistician would say. He can keep in mind the measurement errors, if necessary, but those could be dealt with later. In the Bayesian world heights are not numbers, so the procedure of adding heights implies convoluting two distributions to get the resulting distribution. If both distributions are Gaussian, the result will match that of the "common sense", but note that now the common sense somehow became "just one special case". Moreover, a Bayesian might even keep the possibility that "your height and the height of the chair are dependent" in the back of his mind, just in case. Because when you speak about two numbers in the Bayesian world, you must immediately start thinking about their joint distribution.

On the other hand, modeling everything in probabilities lets you use probability theory inference methods (Bayes rule, convolutions, marginalizations, etc) everywhere, without the need to differentiate between "fixed numbers" and "random measurement errors" and this adds peace of mind as well as tends to make your explanations clearer. A Bayesian confidence interval, for example, is a "fixed interval such that 95% of height measurements fall into it". A classical confidence interval, on the other hand, is "a random interval such that the true height may fall into it with 95% probability". Again, mathematically and numerically those may often be the same, but think how different the two explanations are.

Bayesian "thinking" tends to be more flexible for complex models. Many classical statistics models would stick to fixed parameters, point or "interval" inferences, and try to "hide" the complexity of probability distributions as much as possible. As a result, reasoning about a system with many highly interconnected concepts becomes flawed. Consider a sequence of three questions:
- What the height of this truck?
- Will it fit under this 3m bridge?
- Do we need pick another route?
In the "classical" mindset you would tend to give fixed answers to the questions.
- "Height of the truck is 297".
- "Yes, 297<300, hence it will fit".
- "No, we do not need".
Sometimes you may be more careful and work with confidence intervals, but it still feels unwieldy:
- "The confidence interval on the height of the truck is 290..310"
- ".. aahm, it might not fit..."
- "let's pick another route, just in case"
Note, if a followup question appears that depends on the previous inferences (e.g. "do we need to remodel the truck") answering it becomes even harder because the true uncertainty is "lost" in the intermediate steps. Such problems are never present if you are disciplined as a Bayesian. Note the answers:
- "The height of the truck is a normal distribution N(297, 10)"
- "It will fit under the bridge with probability 60%"
- "We need another route with probability 40%"
At any point is information about the uncertainty is preserved in the distributions and you are free to combine it further, or apply a decision-theoretic utility model. This makes Bayesian networks possible, for example.

It is interesting to see how this largely philosophical preference leads to two completely different (albeit complementary) sets of techniques. Indeed, if you are a true classical statistician, your work revolves around parameterized probability distributions. You write them down like $P_\alpha(x)$ , where $x$ is the "truly random" value from some probability space, and $\alpha$ is the "fixed but unknown" parameter. Your whole "school of thought" is now focused on clever ad-hoc techniques for computing estimates of this fixed parameter from the provided distribution.

For a pure Bayesian, however, there is no "fixed" $\alpha$ that has to be treated somehow separately. Instead, $\alpha$ is also a part of some probability space, and instead of writing $P_\alpha(x)$ he would safely write $P(x| \alpha)$ , $P(\alpha | x)$ , or $P(x, \alpha)$ . As a result, the probability distribution he works with are not parameterized any more, and all of the clever techniques that the classical statisticians have invented over the centuries for estimating parameters become seemingly useless. At this point a classical statistician puts his hands down and goes home, as there is nothing to do for him - there are no "unknowns". The Bayesian is, however, left to struggle with mathematically trivial, yet computationally incredibly heavy methods for extracting essentially the same values that the classical statistician could have obtained using his "parameter estimation" approaches. That's why the Bayesian "school of thought" is mostly focused on computationally-efficient methods for marginalization and sampling.

In reality, of course, a Bayesian would quite often give up and "cheat", at least partially parameterizing his models and making use of the classical estimation methods, while a "classical" statistician might happen to write $P(x|\alpha)$ and apply the Bayes rule here and there, whenever it seems appropriate. A number of computations derived from the two theoretical backgrounds end up exactly the same.

Thus, in practice, labeling things as "Bayesian" or "non-Bayesian" is still largely a philosophical choice. For example, there are methods in machine learning, ensemble learners, that are somewhy never labeled/marketed as being "Bayesian" nor were they probably invented by someone "Bayesian", although at their core those would be among the best examples of where a Bayesian approach is different from a classical one. Those are also among the best performant models quite often, by the way.
Tags: Bayes, Machine learning, Probability theory, Statistics, Theory
Playing Card Cryptography

Posted by Konstantin 09.03.2015 No Comments

Playing cards are a great tool for modeling, popularizing and explaining various mathematical and algorithmic notions. Indeed, a deck of cards is a straightforward example for a finite set, a discrete distribution or a character string. Shuffling and dealing cards represents random sampling operations. A card hand denotes information possessed by a party. Turning card face down or face up looks a lot like bit flipping. Finally, card game rules describe an instance of a particular algorithm or a protocol. I am pretty sure that for most concepts in maths or computer science you can find some card game or a card trick that is directly related to it. Here are some examples of how you can simulate a simple computing machine, illustrate inductive reasoning or explain map-reduce, in particular.

Cryptographers seem to be especially keen towards playing cards. Some cryptographic primitives could have been inspired by them. There are decently secure ciphers built upon a deck of cards. Finally, there are a couple of very enlightening card-based illustrations of such nontrivial cryptographic concepts as zero-knowledge proofs and voting protocols. The recent course on secure two-party computation given by abhi shelat at the last week's EWSCS extended my personal collection of such illustrations with another awesome example — a secure two-party protocol for computing the AND function. As I could not find a description of this protocol anywhere in the internet (and abhi did not know who is the author), I thought it was worth being written up in a blog post here (along with a small modification of my own).

The Tinder Game

Consider the situation, where Alice and Bob want to find out, whether they are both interested in each other (just as if they were both users of the Tinder app). More formally, suppose that Alice has her private opinion about Bob represented as a single bit (where "0" means "not interested" and "1" means "interested"). Equivalently, Bob has his private opinion about Alice represented in the same way. They need to find out whether both bits are "1". However, if it is not the case, they would like to keep their opinions private. For example, if Alice is not interested in Bob, Bob would prefer Alice not to know that he is all over her. Because, you know, opinion asymmetry may lead to awkward social dynamics when disclosed, at least among college students.

The following simple card game happens to solve their problem. Take five cards, so that three of them are red and two are black. The cards of one color must be indistinguishable from each other (e.g. you can't simply take three different diamonds for the reds). Deal one black and one red card to Alice, one black and one red card to Bob. Put the remaining red card on the table face up.

Initial table configuration

Now Alice will put her two cards face down to the left of the open card, and Bob will put his two cards to the right of the open card:

Alice and Bob played

The rules for Alice and Bob are simple: if you like the other person, put your red card next to the central red card. Otherwise, put your black card next to the central one. For example, if both Alice and Bob like each other, they would play their cards as follows (albeit still face down):

What Alice and Bob would play if they both liked each other

Now the middle card is also turned face down, and all five cards are collected, preserving their order:

Five cards collected, preserving order

Next, Alice cuts this five-card deck, moving some number of cards from the top to the bottom (Bob should not know exactly how many). Then Bob cuts the deck (also making sure that Alice does not know how many cards he cuts). The result is equivalent to applying some cyclic rotation to the five card sequence yet neither Bob nor Alice knows how many cards were shifted in total.

The five cards can now be opened by dealing them in order onto the table to find out the result. If there happen to be three red cards one after another in a row or two black cards one after another, Alice and Bob both voted "yes". Here is one example of such a situation. It is easy to see that it is a cyclic shift of the configuration with three red aces in the middle shown above.

Both voted "yes"

Otherwise, if neither three red aces nor two black aces are side by side, we may conclude that one or both of the players voted "no". However, there is absolutely no way to find out anything more specific than that. Here is an example:

No mutual affection (or no affection at all)

Obviously, this result could not be obtained as a cyclic shift of the configuration with three aces clumped together. On the other hand, it could have been obtained as a cyclic shift from any of the three other alternatives ("yes-no", "no-no" and "no-yes"), hence if Alice voted "no" she will have no way of figuring out what was Bob's vote. Thus, playing cards along with a cryptographic mindset helped Alice and Bob discover their mutual affection or the lack of it without the risk of awkward situations. Isn't it great?

Throwing in a Zero-Knowledge Proof

There are various ways Alice or Bob can try to "break" this protocol while still trying to claim honesty. One possible example is shown below:

Alice is trying to cheat

Do you see what happened? Alice has put her ace upside down, which will later allow her to understand what was Bob's move (yet she can easily pretend that turning a card upside down was an honest mistake). Although the problem is easily solved by picking a deck with asymmetric backs, for the sake of example, let us assume that such a solution is somewhy unsuitable. Perhaps there are no requisite decks at Alice and Bob's disposal, or they need to have symmetric backs for some other reasons. This offers a nice possibility for us to practice even more playing card cryptography and try to secure the original algorithm from such "attacks" using a small imitation of a zero-knowledge proof for turn correctness.

Try solving it yourself before proceeding.

Read more...

Tags: Computer science, Cryptography, Fun, Games, Mathematics, Teaching, Theory
Strong Typing

Posted by Konstantin 12.03.2014 No Comments
Whenever you write a program, you want this program to behave correctly and do what you want it to do. Thus, programming always goes together with the mental act of proving to yourself (and sometimes to other people as well), that the code you write is correct. Most often this "proof" is implicit, dissolved in the way you write your code and comment it. In fact, in my personal opinion, "good code" is exactly the one, where a human-reviewer is able to verify its correctness without too much effort.

It is natural to use computers to help us verify correctness. Everyone who has ever programmed in a strictly-typed programming language, such as Java or Haskell, is familiar with the need to specify types of variables and functions and follow strict rules of type-safety. But of course, ensuring type-safety is just the compiler's way to help you ensure some basic claims about the program, such as "this variable will always contain an integer" or "this function will always be invoked with exactly three parameters".

This is very convenient, yet can be somewhat limiting or annoying at times. After all, type-safety requires you to write code that "can be type-checked". Although very often this is expected of "good code" anyway, there are situations where you would like some more flexibility. For this reason some languages impose no type-safety rules at all (e.g. Python or Javascript), and some languages let you disable type-checking for parts of code.

Rather than disabling the type checker, another principled way to allow more flexibility is to make the type-checker smarter. This is the promise of dependent types. In principle, a language, which supports dependent types, would let you make much more detailed statements about your program and have your program automatically checked for correctness with respect to those statements. Rather than being limited to primitive claims like "this variable is an integer", the use of dependent types enables you to assert things like "this is a sorted list", or "this is an odd integer", and so on up to nearly arbitrary level of detail, in the form of a type annotation. At least that much I found out during a course at the recent winter school.

The course was based on the Agda programming language, and the first thing I decided to try implementing in Agda is a well-typed version of the following simple function:
```
f t = if t then true else 0
```
It might look like something trivial, yet most traditional strictly typed languages would not let you write this. After all, a function has to have a return type, and it has to be either a Boolean or an Integer, but not both. In this case, however, we expect our function to have a parameter-dependent type:
```
f : (t : Bool) → if t then Bool else Integer
```
Given that Agda is designed to support dependent types, how complicated could it be to implement such a simple function? It turns out, it takes a beginner more than an hour of thinking and a couple of consultations with the specialists in the field. The resulting code will include at least three different definitions of "if-then-else" statements and, I must admit, some aspects of it are still not completely clear to me.

IF-THEN-ELSE in Agda, including all the boilerplate code

This is the longest code I've ever had to write to specify a simple if-then-else statement. The point of this blog post is to share the amusement and suggest you to check out Agda if you are in the mood for some puzzle-solving.

As for dependent types, I guess those are not becoming mainstream any time soon.
Tags: Computer science, Fun, Procrastination, Programming, Theory, Unclear
On Gödel's Theorems and the Universe

Posted by Konstantin 27.01.2013 7 Comments

Most results that are published in mathematics and natural sciences are highly technical and obscure to anyone outside the particular area of research. Consequently, most scientific publications and discoveries do not get any media attention at all. This is a problem both for the public (which would benefit from knowing a bit more about our world), the science in general (which would benefit from wider dissemination) and the scientists themselves (which would certainly appreciate if their work reached more than a couple of other people). The issue is usually addressed by the researchers trying to popularize their findings — provide simple interpretations, which could be grasped by the layman without the need to understand the foundations. Unfortunately, if the researcher was lucky to grab enough media attention, chances are high the process of "popularization" will get completely out of hands, and the message received by the public will have nothing to do with the original finding.

There are some theorems in pure mathematics, which happened to suffer especially seriously from this phenomenon. My two most favourite examples are the Heizenberg's uncertainty principle and Gödel's incompleteness theorems. Every once in a while I stumble on pieces, written by people who have zero knowledge of basic physics (not even mentioning quantum mechanics or Fourier theory), but claim to have clearly understood the meaning and the deep implications of the uncertainty principle upon our daily lives. Even more popular are Gödel's theorems. "Science proves that some things in this world can never be understood or predicted by us, mere humans!!!" is a typical take on Gödel by the press. Oh, and it is loved by religion enthusiasts!

This post is my desperate attempt to bring some sanity back into this world by providing a (yet another, but maybe slightly different) popular explanation for Gödel's ideas.

Gödel's theorems explained (with enough context)

Kurt Gödel

In our daily lives we primarily use symbols to communicate and describe the world. Those may be formulas in mathematics or simply sentences in the English language. Symbol-based systems can be used in different ways. From the perspective of arts and humanities, one is welcome to construct arbitrary sequences of words and sentences, as long as the result is "beautiful", "inspiring" or "seems smart and convincing". From a more technically-minded point of view, however, one is only welcome to operate with "true and logical" statements. However, it is not at all obvious which statements should be universally considered "true and logical". There are two ways the problem can be resolved.

Firstly, we can determine the truth by consulting "reality". For example, the phrase "All people are mortal, hence Socrates is mortal" is true because, indeed (avoiding a lecture-worth of formalities or so), in all possible universes where there are people and they are mortal, Socrates will always also be mortal.

Secondly, we can simply postulate a set of statements, which we shall universally regard as "the true ones". To distinguish those postulated truths from the "actual truths", mentioned in the paragraph above, let us call them "logically correct statements". That is, we can hypothetically write down an (infinite) list of all the "logically correct" sentences in some hypothetical "Book of All Logically Correct Statements". From that time on, every claim not in the book will be deemed "illogical" and hence "false" (the question of whether it is false in reality would, of course, depend on the goodness and relevance of the book we created).

This is really convenient, because now in order to check the correctness of the claim about Socrates we do not have to actually visit all the possible universes. It suffices to look it up in our book! If the book is any good, it will provide good answers, so the concept of "logical correctness" may be close or even equivalent to "actual truth". And in principle, there must exist "The Actual Book of Truth", which would list precisely the "actually true" statements.

Unfortunately, infinite books are somewhat inconvenient to print and carry around. Luckily, we can try to represent an infinite book in a finite way using algorithms. We have all seen how a short algorithm is capable of producing an infinite amount of nonsense, haven't we? Thus, instead of looking for an (infinitely-sized) "book of truth", we should look for a (finitely-sized) "algorithm of truth", which will be either capable of generating the whole book, or checking for any given statement, whether it belongs to the book.

Frege's propositional calculus

For clarity and historical reasons, the algorithms for describing "books of truths" are usually developed and written down using a particular "programming language", which consists of "axioms" and "inference rules". Here is one of the simplest examples of how a "program" might look like.

This is where a natural question arises: even if we assume that we have access to "the actual book of truth", will it really be possible to compress it down to a finite algorithm? Isn't it too much to ask? It turns out it is. There is really no way to create a compact representation for any but the most trivial "books of truths".

Besides the "asking too much" intuition, there are at least two other simple reasons why this is impossible. Firstly, most of the readers here know that there exist problems, that are in principle uncomputable. Those are usually about statements, the correctness of which cannot be checked in finite time by a finite algorithm, with the most famous example being the halting problem, as well as everything related to uncomputable numbers.

Naturally, if we can't even use a finite algorithm to list "all truly halting programs", our hopes for getting a finite description for any more complicated "book of truths" is gone. But there is another funny reason. Once we have decided to define the book with all true statements and devise an algorithm for generating it, we must not forget to include into the book some statements about the book and the algorithm themselves. In particular, we must consider the following phrase: "This sentence will not be included in the book generated by our algorithm."

If our algorithm does include this sentence in the generated book, the claim turns out to be false. As a result, our precious book would have false statements in it: it would be inconsistent. On the other hand, if our algorithm produces a book which does not include the above sentence, the sentence turns out to be true and the book is incomplete. It is just the liar's paradox.

And so it is — we have to put up with the situation, where it is impossible to produce a finite representation for all true statements without stumbling into logical paradoxes. Or, alternatively, that there are mathematical statements, which cannot be checked or proven within a finite time.

This describes what is known as "Gödel's first incompleteness theorem" (namely, the claim that in most cases our algorithm is only capable of producing either an incomplete or an inconsistent "book"). A curious corollary may be derived from it, which is known as the "Gödel's second incompleteness theorem". The theorem says that if our "book-generating algorithm" happens to include into it exactly the phrase "This book is consistent", it must necessarily be a false claim, and the algorithm must actually be inconsistent. This is often interpreted that no "honest" theory is capable of claiming it's own correctness (there are caveats, though).

Gödel's theorems have several interesting implications for the foundations of mathematics. None the less, those are still theorems about the fundamental impossibility of using a finite algorithm for generating a book of logically correct statements without stumbling into logical paradoxes and infinite loops. Nothing more and nothing less. They have nothing to do with the experimental sciences: note how we had to cut away all connections to reality in the very first paragraphs of the explanation. And thus, Gödel's theorems are not related to the causes of "human inability to grasp the universe", "failure or success of startup business plans", or anything else of that kind.

Tags: Computer science, Definition, Logic, Mathematics, Philosophy, Theory, Weltschmertz