Four Years Remaining

Implication and Provability

Posted by Konstantin 28.03.2017 No Comments
Consider the following question:
Which of the following two statements is logically true?
1. All planets of the Solar System orbit the Sun. The Earth orbits the Sun. Consequently, the Earth is a planet of the Solar System.
2. God is the creator of all things which exist. The Earth exists. Consequently, God created the Earth.
I've seen this question or variations of it pop up as "provocative" posts in social networks several times. At times they might invite lengthy discussions, where the participants would split into camps - some claim that the first statement is true, because Earth is indeed a planet of the Solar System and God did not create the Earth. Others would laugh at the stupidity of their opponents and argue that, obviously, only the second statement is correct, because it makes a valid logical implication, while the first one does not.

Not once, however, have I ever seen a proper formal explanation of what is happening here. And although it is fairly trivial (once you know it), I guess it is worth writing up. The root of the problem here is the difference between implication and provability - something I myself remember struggling a bit to understand when I first had to encounter these notions in a course on mathematical logic years ago.

Indeed, any textbook on propositional logic will tell you in one of the first chapters that you may write

$A \Rightarrow B$

to express the statement " $A$ implies $B$ ". A chapter or so later you will learn that there is also a possibility to write

$A \vdash B$

to express a confusingly similar statement, that " $B$ is provable from $A$ ". To confirm your confusion, another chapter down the road you should discover, that $A \Rightarrow B$ is the same as $\vdash A \Rightarrow B$ , which, in turn, is logically equivalent to $A \vdash B$ . Therefore, indeed, whenever $A \Rightarrow B$ is true, $A \vdash B$ is true, and vice-versa. Is there a difference between $\vdash$ and $\Rightarrow$ then, and why do we need the two different symbols at all? The "provocative" question above provides an opportunity to illustrate this.

The spoken language is rather informal, and there can be several ways of formally interpreting the same statement. Both statements in the puzzle are given in the form " $A$ , $B$ , consequently $C$ ". Here are at least four different ways to put them formally, which make the two statements true or false in different ways.

The Pure Logic Interpretation

Anyone who has enough experience solving logic puzzles would know that both statements should be interpreted as abstract claims about provability (i.e. deducibility):

$A, B \vdash C.$

As mentioned above, this is equivalent to

$(A\,\&\, B) \Rightarrow C.$

or

$\vdash (A\,\&\, B) \Rightarrow C.$

In this interpretation the first statement is wrong and the second is a correct implication.

The Pragmatic Interpretation

People who have less experience with math puzzles would often assume that they should not exclude their common sense knowledge from the task. The corresponding formal statement of the problem then becomes the following:

$[\text{common knowledge}] \vdash (A\,\&\, B) \Rightarrow C.$

In this case both statements become true. The first one is true simply because the consequent $C$ is true on its own, given common knowledge (the Earth is indeed a planet) - the antecedents and provability do not play any role at all. The second is true because it is a valid reasoning, independently of the common knowledge.

This type of interpretation is used in rhetorical phrases like "If this is true, I am a Dutchman".

The Overly Strict Interpretation

Some people may prefer to believe that a logical statement should only be deemed correct if every single part of it is true and logically valid. The two claims must then be interpreted as follows:

$([\text{common}] \vdash A)\,\&\, ([\text{common}] \vdash B)\,\&\, (A, B\vdash C).$

Here the issue of provability is combined with the question about the truthfulness of the facts used. Both statements are false - the first fails on logic, and the second on facts (assuming that God creating the Earth is not part of common knowledge).

The Oversimplified Interpretation

Finally, people very unfamiliar with strict logic would sometimes tend to ignore the words "consequently", "therefore" or "then", interpreting them as a kind of an extended synonym for "and". In their minds the two statements could be regarded as follows:

$[\text{common}] \vdash A\,\&\, B\,\&\, C.$

From this perspective, the first statement becomes true and the second (again, assuming the aspects of creation are not commonly known) is false.

Although the author of the original question most probably did really assume the "pure logic" interpretation, as is customary for such puzzles, note how much leeway there can be when converting a seemingly simple phrase in English to a formal statement. In particular, observe that questions about provability, where you deliberately have to abstain from relying on common knowledge, may be different from questions about facts and implications, where common sense may (or must) be assumed and you can sometimes skip the whole "reasoning" part if you know the consequent is true anyway.

Here is an quiz question to check whether you understood what I meant to explain.

"The sky is blue, and therefore the Earth is round." True or false?
Tags: Explanation, Logic, Mathematics, Puzzle, Theory
The Schrödinger's Cat Uncertainty

Posted by Konstantin 21.03.2017 No Comments

Ever since Erwin Schrödinger described a thought experiment, in which a cat in a sealed box happened to be "both dead and alive at the same time", popular science writers have been relying on it heavily to convey the mysteries of quantum physics to the layman. Unfortunately, instead of providing any useful intuition, this example has instead laid solid base to a whole bunch of misconceptions. Having read or heard something about the strange cat, people would tend to jump to profound conclusions, such as "according to quantum physics, cats can be both dead and alive at the same time" or "the notion of a conscious observer is important in quantum physics". All of these are wrong, as is the image of a cat, who is "both dead and alive at the same time". The corresponding Wikipedia page does not stress this fact well enough, hence I thought the Internet might benefit from a yet another explanatory post.

The Story of the Cat

The basic notion in quantum mechanics is a quantum system. Pretty much anything could be modeled as a quantum system, but the most common examples are elementary particles, such as electrons or photons. A quantum system is described by its state. For example, a photon has polarization, which could be vertical or horizontal. Another prominent example of a particle's state is its wave function, which represents its position in space.

There is nothing special about saying that things have state. For example, we may say that any cat has a "liveness state", because it can be either "dead" or "alive". In quantum mechanics we would denote these basic states using the bra-ket notation as $|\mathrm{dead}\rangle$ and $|\mathrm{alive}\rangle$ . The strange thing about quantum mechanical systems, though, is the fact that quantum states can be combined together to form superpositions. Not only could a photon have a purely vertical polarization $\left|\updownarrow\right\rangle$ or a purely horizontal polarization $\left|\leftrightarrow\right\rangle$ , but it could also be in a superposition of both vertical and horizontal states:

$\left|\updownarrow\right\rangle + \left|\leftrightarrow\right\rangle.$

This means that if you asked the question "is this photon polarized vertically?", you would get a positive answer with 50% probability - in another 50% of cases the measurement would report the photon as horizontally-polarized. This is not, however, the same kind of uncertainty that you get from flipping a coin. The photon is not either horizontally or vertically polarized. It is both at the same time.

Amazed by this property of quantum systems, Schrödinger attempted to construct an example, where a domestic cat could be considered to be in the state

$|\mathrm{dead}\rangle + |\mathrm{alive}\rangle,$

which means being both dead and alive at the same time. The example he came up with, in his own words (citing from Wikipedia), is the following:

A cat is penned up in a steel chamber, along with the following device (which must be secured against direct interference by the cat): in a Geiger counter, there is a tiny bit of radioactive substance, so small, that perhaps in the course of the hour one of the atoms decays, but also, with equal probability, perhaps none; if it happens, the counter tube discharges and through a relay releases a hammer that shatters a small flask of hydrocyanic acid. If one has left this entire system to itself for an hour, one would say that the cat still lives if meanwhile no atom has decayed. The first atomic decay would have poisoned it.

The idea is that after an hour of waiting, the radiactive substance must be in the state

$|\mathrm{decayed}\rangle + |\text{not decayed}\rangle,$

the poison flask should thus be in the state

$|\mathrm{broken}\rangle + |\text{not broken}\rangle,$

and the cat, consequently, should be

$|\mathrm{dead}\rangle + |\mathrm{alive}\rangle.$

Correct, right? No.

The Cat Ensemble

Superposition, which is being "in both states at once" is not the only type of uncertainty possible in quantum mechanics. There is also the "usual" kind of uncertainty, where a particle is in either of two states, we just do not exactly know which one. For example, if we measure the polarization of a photon, which was originally in the superposition $\left|\updownarrow\right\rangle + \left|\leftrightarrow\right\rangle$ , there is a 50% chance the photon will end up in the state $\left|\updownarrow\right\rangle$ after the measurement, and a 50% chance the resulting state will be $\left|\leftrightarrow\right\rangle$ . If we do the measurement, but do not look at the outcome, we know that the resulting state of the photon must be either of the two options. It is not a superposition anymore. Instead, the corresponding situation is described by a statistical ensemble:

$\{\left|\updownarrow\right\rangle: 50\%, \quad\left|\leftrightarrow\right\rangle: 50\%\}.$

Although it may seem that the difference between a superposition and a statistical ensemble is a matter of terminology, it is not. The two situations are truly different and can be distinguished experimentally. Essentially, every time a quantum system is measured (which happens, among other things, every time it interacts with a non-quantum system) all the quantum superpositions are "converted" to ensembles - concepts native to the non-quantum world. This process is sometimes referred to as decoherence.

Now recall the Schrödinger's cat. For the cat to die, a Geiger counter must register a decay event, triggering a killing procedure. The registration within the Geiger counter is effectively an act of measurement, which will, of course, "convert" the superposition state into a statistical ensemble, just like in the case of a photon which we just measured without looking at the outcome. Consequently, the poison flask will never be in a superposition of being "both broken and not". It will be either, just like any non-quantum object should. Similarly, the cat will also end up being either dead or alive - you just cannot know exactly which option it is before you peek into the box. Nothing special or quantum'y about this.

The Quantum Cat

"But what gives us the right to claim that the Geiger counter, the flask and the cat in the box are "non-quantum" objects?", an attentive reader might ask here. Could we imagine that everything, including the cat, is a quantum system, so that no actual measurement or decoherence would happen inside the box? Could the cat be "both dead and alive" then?

Indeed, we could try to model the cat as a quantum system with $|\mathrm{dead}\rangle$ and $|\mathrm{alive}\rangle$ being its basis states. In this case the cat indeed could end up in the state of being both dead and alive. However, this would not be its most exciting capability. Way more suprisingly, we could then kill and revive our cat at will, back and forth, by simply measuring its liveness state appropriately. It is easy to see how this model is unrepresentative of real cats in general, and the worry about them being able to be in superposition is just one of the many inconsistencies. The same goes for the flask and the Geiger counter, which, if considered to be quantum systems, get the magical abilities to "break" and "un-break", "measure" and "un-measure" particles at will. Those would certainly not be a real world flask nor a counter anymore.

The Cat Multiverse

There is one way to bring quantum superposition back into the picture, although it requires some rather abstract thinking. There is a theorem in quantum mechanics, which states that any statistical ensemble can be regarded as a partial view of a higher-dimensional superposition. Let us see what this means. Consider a (non-quantum) Schrödinger's cat. As it might be hopefully clear from the explanations above, the cat must be either dead or alive (not both), and we may formally represent this as a statistical ensemble:

$\{\left|\text{dead}\right\rangle: 50\%, \quad\left|\text{alive}\right\rangle: 50\%\}.$

It turns out that this ensemble is mathematically equivalent in all respects to a superposition state of a higher order:

$\left|\text{Universe A}, \text{dead}\right\rangle + \left|\text{Universe B}, \text{alive}\right\rangle,$

where "Universe A" and "Universe B" are some abstract, unobservable "states of the world". The situation can be interpreted by imagining two parallel universes: one where the cat is dead and one where it is alive. These universes exist simultaneously in a superposition, and we are present in both of them at the same time, until we open the box. When we do, the universe superposition collapses to a single choice of the two options and we are presented with either a dead, or a live cat.

Yet, although the universes happen to be in a superposition here, existing both at the same time, the cat itself remains completely ordinary, being either totally dead or fully alive, depending on the chosen universe. The Schrödinger's cat is just a cat, after all.

Tags: Explanation, Mathematics, Paradox, Philosophy, Physics, Quantum, Science
What is the Covariance Matrix?

Posted by Konstantin 23.11.2016 16 Comments
Basic linear algebra, introductory statistics and some familiarity with core machine learning concepts (such as PCA and linear models) are the prerequisites of this post. Otherwise it will probably make no sense. An abridged version of this text is also posted on Quora.

Most textbooks on statistics cover covariance right in their first chapters. It is defined as a useful "measure of dependency" between two random variables:

$\mathrm{cov}(X,Y) = E[(X - E[X])(Y - E[Y])].$

The textbook would usually provide some intuition on why it is defined as it is, prove a couple of properties, such as bilinearity, define the covariance matrix for multiple variables as ${\bf\Sigma}_{i,j} = \mathrm{cov}(X_i, X_j)$ , and stop there. Later on the covariance matrix would pop up here and there in seeminly random ways. In one place you would have to take its inverse, in another - compute the eigenvectors, or multiply a vector by it, or do something else for no apparent reason apart from "that's the solution we came up with by solving an optimization task".

In reality, though, there are some very good and quite intuitive reasons for why the covariance matrix appears in various techniques in one or another way. This post aims to show that, illustrating some curious corners of linear algebra in the process.

Meet the Normal Distribution

The best way to truly understand the covariance matrix is to forget the textbook definitions completely and depart from a different point instead. Namely, from the the definition of the multivariate Gaussian distribution:

We say that the vector $\bf x$ has a normal (or Gaussian) distribution with mean $\bf \mu$ and covariance $\bf \Sigma$ if:

$\Pr({\bf x}) =|2\pi{\bf\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}({\bf x} - {\bf\mu})^T{\bf\Sigma}^{-1}({\bf x} - {\bf \mu})\right).$

To simplify the math a bit, we will limit ourselves to the centered distribution (i.e. ${\bf\mu} = {\bf 0}$ ) and refrain from writing out the normalizing constant $|2\pi{\bf\Sigma}|^{-1/2}$ . Now, the definition of the (centered) multivariate Gaussian looks as follows:

$\Pr({\bf x}) \propto \exp\left(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}\right).$

Much simpler, isn't it? Finally, let us define the covariance matrix as nothing else but the parameter of the Gaussian distribution. That's it. You will see where it will lead us in a moment.

Transforming the Symmetric Gaussian

Consider a symmetric Gaussian distribution, i.e. the one with ${\bf \Sigma = \bf I}$ (the identity matrix). Let us take a sample from it, which will of course be a symmetric, round cloud of points:

We know from above that the likelihood of each point in this sample is

(1) $P({\bf x}) \propto \exp(-0.5 {\bf x}^T {\bf x}).$

Now let us apply a linear transformation ${\bf A}$ to the points, i.e. let ${\bf y} ={\bf Ax}$ . Suppose that, for the sake of this example, ${\bf A}$ scales the vertical axis by 0.5 and then rotates everything by 30 degrees. We will get the following new cloud of points ${\bf y}$ :

What is the distribution of ${\bf y}$ ? Just substitute ${\bf x}={\bf A}^{-1}{\bf y}$ into (1), to get:

(2) $\begin{align*} P({\bf y}) &\propto \exp(-0.5 ({\bf A}^{-1}{\bf y})^T({\bf A}^{-1}{\bf y}))\\ &=\exp(-0.5{\bf y}^T({\bf AA}^T)^{-1}{\bf y}) \end{align*}$

This is exactly the Gaussian distribution with covariance ${\bf \Sigma} = {\bf AA}^T$ . The logic works both ways: if we have a Gaussian distribution with covariance $\bf \Sigma$ , we can regard it as a distribution which was obtained by transforming the symmetric Gaussian by some ${\bf A}$ , and we are given ${\bf AA}^T$ .

More generally, if we have any data, then, when we compute its covariance to be $\bf\Sigma$ , we can say that if our data were Gaussian, then it could have been obtained from a symmetric cloud using some transformation $\bf A$ , and we just estimated the matrix ${\bf AA}^T$ , corresponding to this transformation.

Note that we do not know the actual $\bf A$ , and it is mathematically totally fair. There can be many different transformations of the symmetric Gaussian which result in the same distribution shape. For example, if $\bf A$ is just a rotation by some angle, the transformation does not affect the shape of the distribution at all. Correspondingly, ${\bf AA}^T = {\bf I}$ for all rotation matrices. When we see a unit covariance matrix we really do not know, whether it is the “originally symmetric” distribution, or a “rotated symmetric distribution”. And we should not really care - those two are identical.

There is a theorem in linear algebra, which says that any symmetric matrix $\bf \Sigma$ can be represented as:

(3) ${\bf \Sigma} = {\bf VDV}^T,$

where ${\bf V}$ is orthogonal (i.e. a rotation) and ${\bf D}$ is diagonal (i.e. a coordinate-wise scaling). If we rewrite it slightly, we will get:

(4) ${\bf \Sigma} = ({\bf VD}^{1/2})({\bf VD}^{1/2})^T = {\bf AA}^T,$

where ${\bf A} = {\bf VD}^{1/2}$ . This, in simple words, means that any covariance matrix $\bf \Sigma$ could have been the result of transforming the data using a coordinate-wise scaling ${\bf D}^{1/2}$ followed by a rotation $\bf V$ . Just like in our example with $\bf x$ and $\bf y$ above.

Principal Component Analysis

Given the above intuition, PCA already becomes a very obvious technique. Suppose we are given some data. Let us assume (or “pretend”) it came from a normal distribution, and let us ask the following questions:
1. What could have been the rotation $\bf V$ and scaling ${\bf D}^{1/2}$ , which produced our data from a symmetric cloud?
2. What were the original, “symmetric-cloud” coordinates $\bf x$ before this transformation was applied.
3. Which original coordinates were scaled the most by $\bf D$ and thus contribute most to the spread of the data now. Can we only leave those and throw the rest out?
All of those questions can be answered in a straightforward manner if we just decompose $\bf \Sigma$ into $\bf V$ and $\bf D$ according to (3). But (3) is exactly the eigenvalue decomposition of $\bf\Sigma$ . I’ll leave you to think for just a bit and you’ll see how this observation lets you derive everything there is about PCA and more.

The Metric Tensor

Bear me for just a bit more. One way to summarize the observations above is to say that we can (and should) regard ${\bf\Sigma}^{-1}$ as a metric tensor. A metric tensor is just a fancy formal name for a matrix, which summarizes the deformation of space. However, rather than claiming that it in some sense determines a particular transformation $\bf A$ (which it does not, as we saw), we shall say that it affects the way we compute angles and distances in our transformed space.

Namely, let us redefine, for any two vectors $\bf v$ and $\bf w$ , their inner product as:

(5) $\langle {\bf v}, {\bf w}\rangle_{\Sigma^{-1}} = {\bf v}^T{\bf \Sigma}^{-1}{\bf w}.$

To stay consistent we will also need to redefine the norm of any vector as

(6) $|{\bf v}|_{\Sigma^{-1}} = \sqrt{{\bf v}^T{\bf \Sigma}^{-1}{\bf v}},$

and the distance between any two vectors as

(7) $|{\bf v}-{\bf w}|_{\Sigma^{-1}} = \sqrt{({\bf v}-{\bf w})^T{\bf \Sigma}^{-1}({\bf v}-{\bf w})}.$

Those definitions now describe a kind of a “skewed world” of points. For example, a unit circle (a set of points with “skewed distance” 1 to the center) in this world might look as follows:

And here is an example of two vectors, which are considered “orthogonal”, a.k.a. “perpendicular” in this strange world:

Although it may look weird at first, note that the new inner product we defined is actually just the dot product of the “untransformed” originals of the vectors:

(8) ${\bf v}^T{\bf \Sigma}^{-1}{\bf w} = {\bf v}^T({\bf AA}^T)^{-1}{\bf w}=({\bf A}^{-1}{\bf v})^T({\bf A}^{-1}{\bf w}),$

The following illustration might shed light on what is actually happening in this $\Sigma$ -“skewed” world. Somehow “deep down inside”, the ellipse thinks of itself as a circle and the two vectors behave as if they were (2,2) and (-2,2).

Getting back to our example with the transformed points, we could now say that the point-cloud $\bf y$ is actually a perfectly round and symmetric cloud “deep down inside”, it just happens to live in a skewed space. The deformation of this space is described by the tensor ${\bf\Sigma}^{-1}$ (which is, as we know, equal to $({\bf AA}^T)^{-1}$ . The PCA now becomes a method for analyzing the deformation of space, how cool is that.

The Dual Space

We are not done yet. There’s one interesting property of “skewed” spaces worth knowing about. Namely, the elements of their dual space have a particular form. No worries, I’ll explain in a second.

Let us forget the whole skewed space story for a moment, and get back to the usual inner product ${\bf w}^T{\bf v}$ . Think of this inner product as a function $f_{\bf w}({\bf v})$ , which takes a vector $\bf v$ and maps it to a real number, the dot product of $\bf v$ and $\bf w$ . Regard the $\bf w$ here as the parameter (“weight vector”) of the function. If you have done any machine learning at all, you have certainly come across such linear functionals over and over, sometimes in disguise. Now, the set of all possible linear functionals $f_{\bf w}$ is known as the dual space to your “data space”.

Note that each linear functional is determined uniquely by the parameter vector $\bf w$ , which has the same dimensionality as $\bf v$ , so apparently the dual space is in some sense equivalent to your data space - just the interpretation is different. An element $\bf v$ of your “data space” denotes, well, a data point. An element $\bf w$ of the dual space denotes a function $f_{\bf w}$ , which projects your data points on the direction $\bf w$ (recall that if $\bf w$ is unit-length, ${\bf w}^T{\bf v}$ is exactly the length of the perpendicular projection of $\bf v$ upon the direction $\bf w$ ). So, in some sense, if $\bf v$ -s are “vectors”, $\bf w$ -s are “directions, perpendicular to these vectors”. Another way to understand the difference is to note that if, say, the elements of your data points numerically correspond to amounts in kilograms, the elements of $\bf w$ would have to correspond to “units per kilogram”. Still with me?

Let us now get back to the skewed space. If $\bf v$ are elements of a skewed Euclidean space with the metric tensor ${\bf\Sigma}^{-1}$ , is a function $f_{\bf w}({\bf v}) = {\bf w}^T{\bf v}$ an element of a dual space? Yes, it is, because, after all, it is a linear functional. However, the parameterization of this function is inconvenient, because, due to the skewed tensor, we cannot interpret it as projecting vectors upon $\bf w$ nor can we say that $\bf w$ is an “orthogonal direction” (to a separating hyperplane of a classifier, for example). Because, remember, in the skewed space it is not true that orthogonal vectors satisfy ${\bf w}^T{\bf v}=0$ . Instead, they satisfy ${\bf w}^T{\bf \Sigma}^{-1}{\bf v} = 0$ . Things would therefore look much better if we parameterized our dual space differently. Namely, by considering linear functionals of the form $f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf \Sigma}^{-1}{\bf v}$ . The new parameters $\bf z$ could now indeed be interpreted as an “orthogonal direction” and things overall would make more sense.

However when we work with actual machine learning models, we still prefer to have our functions in the simple form of a dot product, i.e. $f_{\bf w}$ , without any ugly $\bf\Sigma$ -s inside. What happens if we turn a “skewed space” linear functional from its natural representation into a simple inner product?

(9) $f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf\Sigma}^{-1}{\bf v} = ({\bf \Sigma}^{-1}{\bf z})^T{\bf v} = f_{\bf w}({\bf v}),$

where ${\bf w} = {\bf \Sigma}^{-1}{\bf z}$ . (Note that we can lose the transpose because $\bf \Sigma$ is symmetric).

What it means, in simple terms, is that when you fit linear models in a skewed space, your resulting weight vectors will always be of the form ${\bf \Sigma}^{-1}{\bf z}$ . Or, in other words, ${\bf\Sigma}^{-1}$ is a transformation, which maps from “skewed perpendiculars” to “true perpendiculars”. Let me show you what this means visually.

Consider again the two “orthogonal” vectors from the skewed world example above:

Let us interpret the blue vector as an element of the dual space. That is, it is the $\bf z$ vector in a linear functional ${\bf z}^T{\bf\Sigma}^{-1}{\bf v}$ . The red vector is an element of the “data space”, which would be mapped to 0 by this functional (because the two vectors are “orthogonal”, remember).

For example, if the blue vector was meant to be a linear classifier, it would have its separating line along the red vector, just like that:

If we now wanted to use this classifier, we could, of course, work in the “skewed space” and use the expression ${\bf z}^T{\bf\Sigma}^{-1}{\bf v}$ to evaluate the functional. However, why don’t we find the actual normal $\bf w$ to that red separating line so that we wouldn’t need to do an extra matrix multiplication every time we use the function?

It is not too hard to see that ${\bf w}={\bf\Sigma}^{-1}{\bf z}$ will give us that normal. Here it is, the black arrow:

Therefore, next time, whenever you see expressions like ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ or $({\bf v}-{\bf w})^T{\bf\Sigma}^{-1}({\bf v}-{\bf w})$ , remember that those are simply inner products and (squared) distances in a skewed space, while ${\bf \Sigma}^{-1}{\bf z}$ is a conversion from a skewed normal to a true normal. Also remember that the “skew” was estimated by pretending that the data were normally-distributed.

Once you see it, the role of the covariance matrix in some methods like the Fisher’s discriminant or Canonical correlation analysis might become much more obvious.

The Dual Space Metric Tensor

“But wait”, you should say here. “You have been talking about expressions like ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ all the time, while things like ${\bf w}^T{\bf\Sigma}{\bf v}$ are also quite common in practice. What about those?”

Hopefully you know enough now to suspect that ${\bf w}^T{\bf\Sigma}{\bf v}$ is again an inner product or a squared norm in some deformed space, just not the “internal data metric space”, that we considered so far. Which space is it? It turns out it is the “internal dual metric space”. That is, whilst the expression ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ denoted the “new inner product” between the points, the expression ${\bf w}^T{\bf\Sigma}{\bf v}$ denotes the “new inner product” between the parameter vectors. Let us see why it is so.

Consider an example again. Suppose that our space transformation $\bf A$ scaled all points by 2 along the $x$ axis. The point (1,0) became (2,0), the point (3, 1) became (6, 1), etc. Think of it as changing the units of measurement - before we measured the $x$ axis in kilograms, and now we measure it in pounds. Consequently, the norm of the point (2,0) according to the new metric, $|(2,0)|_{\Sigma^{-1}}$ will be 1, because 2 pounds is still just 1 kilogram “deep down inside”.

What should happen to the parameter ("direction") vectors due to this transformation? Can we say that the parameter vector (1,0) also got scaled to (2,0) and that the norm of the parameter vector (2,0) is now therefore also 1? No! Recall that if our initial data denoted kilograms, our dual vectors must have denoted “units per kilogram”. After the transformation they will be denoting “units per pound”, correspondingly. To stay consistent we must therefore convert the parameter vector (”1 unit per kilogram”, 0) to its equivalent (“0.5 units per pound”,0). Consequently, the norm of the parameter vector (0.5,0) in the new metric will be 1 and, by the same logic, the norm of the dual vector (2,0) in the new metric must be 4. You see, the “importance of a parameter/direction” gets scaled inversely to the “importance of data” along that parameter or direction.

More formally, if ${\bf x}'={\bf Ax}$ , then

(10) $\begin{align*} f_{\bf w}({\bf x}) &= {\bf w}^T{\bf x} = {\bf w}^T{\bf A}^{-1}{\bf x}'\\ & =(({\bf A}^{-1})^T{\bf w})^T{\bf x}'=f_{({\bf A}^{-1})^T{\bf w}}({\bf x}'). \end{align*}$

This means, that the transformation $\bf A$ of the data points implies the transformation ${\bf B}:=({\bf A}^{-1})^T$ of the dual vectors. The metric tensor for the dual space must thus be:

(11) $({\bf BB}^T)^{-1}=(({\bf A}^{-1})^T{\bf A}^{-1})^{-1}={\bf AA}^T={\bf \Sigma}.$

Remember the illustration of the “unit circle” in the ${\bf \Sigma}^{-1}$ metric? This is how the unit circle looks in the corresponding $\bf\Sigma$ metric. It is rotated by the same angle, but it is stretched in the direction where it was squished before.

Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by $\bf A$ becomes proportionally larger (note that the “unit circle” is, on the contrary, “squished” along those directions).

But the “stretch” of the space deformation in any direction can be measured by the variance of the data. It is therefore not a coincidence that ${\bf w}^T{\bf \Sigma}{\bf w}$ is exactly the variance of the data along the direction $\bf w$ (assuming $|{\bf w}|=1$ ).

The Covariance Estimate

Once we start viewing the covariance matrix as a transformation-driven metric tensor, many things become clearer, but one thing becomes extremely puzzling: why is the inverse covariance of the data a good estimate for that metric tensor? After all, it is not obvious that ${\bf X}^T{\bf X}/n$ (where $\bf X$ is the data matrix) must be related to the $\bf\Sigma$ in the distribution equation $\exp(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x})$ .

Here is one possible way to see the connection. Firstly, let us take it for granted that if $\bf X$ is sampled from a symmetric Gaussian, then ${\bf X}^T{\bf X}/n$ is, on average, a unit matrix. This has nothing to do with transformations, but just a consequence of pairwise independence of variables in the symmetric Gaussian.

Now, consider the transformed data, ${\bf Y}={\bf XA}^T$ (vectors in the data matrix are row-wise, hence the multiplication on the right with a transpose). What is the covariance estimate of $\bf Y$ ?

(12) ${\bf Y}^T{\bf Y}/n=({\bf XA}^T)^T{\bf XA}^T/n={\bf A}({\bf X}^T{\bf X}){\bf A}^T/n\approx {\bf AA}^T,$

the familiar tensor.

This is a place where one could see that a covariance matrix may make sense outside the context of a Gaussian distribution, after all. Indeed, if you assume that your data was generated from any distribution $P$ with uncorrelated variables of unit variance and then transformed using some matrix $\bf A$ , the expression ${\bf X}^T{\bf X}/n$ will still be an estimate of ${\bf AA}^T$ , the metric tensor for the corresponding (dual) space deformation.

However, note that out of all possible initial distributions $P$ , the normal distribution is exactly the one with the maximum entropy, i.e. the “most generic”. Thus, if you base your analysis on the mean and the covariance matrix (which is what you do with PCA, for example), you could just as well assume your data to be normally distributed. In fact, a good rule of thumb is to remember, that whenever you even mention the word "covariance matrix", you are implicitly fitting a Gaussian distribution to your data.
Tags: Algebra, Explanation, Machine learning, Mathematics, Probability theory, Statistics, Visualization
The Secrets of Spring Motion

Posted by Konstantin 17.11.2016 3 Comments

Imagine a weight hanging on a spring. Let us pull the weight a bit and release it into motion. What will its motion look like? If you remember some of your high-school physics, you should probably answer that the resulting motion is a simple harmonic oscillation, best described by a sinewave. Although this is a fair answer, it actually misses an interesting property of real-life springs. A property most people don't think much about, because it goes a bit beyond the high school curriculum. This property is best illustrated by

The Slinky Drop

The "slinky drop" is a fun little experiment which has got its share of internet fame.

The Slinky Drop

When the top end of a suspended slinky is released, the bottom seems to patiently wait for the top to arrive before starting to fall as well. This looks rather unexpected. After all, we know that things fall down according to a parabola, and we know that springs collapse according to a sinewave, however neither of the two rules seem to apply here. If you browse around, you will see lots of awesome videos demonstrating or explaining this effect. There are news articles, forum discussions, blog posts and even research papers dedicated to the magical slinky. However, most of them are either too sketchy or too complex, and none seem to mention the important general implications, so let me give a shot at another explanation here.

The Slinky Drop Explained Once More

Let us start with the classical, "high school" model of a spring. The spring has some length $L$ in the relaxed state, and if we stretch it, making it longer by $\Delta L$ , the two ends of the spring exert a contracting force of $k\Delta L$ . Assume we hold the top of the spring at the vertical coordinate $y_{\mathrm{top}}=0$ and have it balance out. The lower end will then position at the coordinate $y_{\mathrm{bot}} = -(L+mg/k)$ , where the gravity force $mg$ is balanced out exactly by the spring force.

How would the two ends of the spring behave if we let go off the top now? Here's how:

The horozontal axis here denotes the time, the vertical axis - is the vertical position. The blue curve is the trajectory of the top end of the spring, the green curve - trajectory of the bottom end. The dotted blue line is offset from the blue line by exactly $L$ - the length of the spring in relaxed state.

Observe that the lower end (the green curve), similarly to the slinky, "waits" for quite a long time for the top to approach before starting to move with discernible velocity. Why is it the case? The trajectory of the lower point can be decomposed in two separate movements. Firstly, the point is trying to fall down due to gravity, following a parabola. Secondly, the point is being affected by string tension and thus follows a cosine trajectory. Here's how the two trajectories look like separately:

They are surprisingly similar at the start, aren't they? And indeed, the cosine function does resemble a parabola up to $o(x^3)$ . Recall the corresponding Taylor expansion:

$\cos(x) = 1 - \frac{x^2}{2} + \frac{x^4}{24} + \dots \approx 1 - \frac{x^2}{2}.$

If we align the two curves above, we can see how well they match up at the beginning:

Consequently, the two forces happen to "cancel" each other long enough to leave an impression that the lower end "waits" for the upper for some time. This effect is, however, much more pronounced in the slinky. Why so?

Because, of course, a single spring is not a good model for the slinky. It is more correct to regard a slinky as a chain of strings. Observe what happens if we model the slinky as a chain of just three simple springs:

Each curve here is the trajectory of one of the nodes inbetween the three individual springs. We can see that the top two curves behave just like a single spring did - the green node waits a bit for the blue and then starts moving. The red one, however, has to wait longer, until the green node moves sufficiently far away. The bottom, in turn, will only start moving observably when the red node approaches it close enough, which means it has to wait even longer yet - by that time the top has already arrived. If we consider a more detailed model, the movement of a slinky composed of, say, 9 basic springs, the effect of intermediate nodes "waiting" becomes even more pronounced:

To make a "mathematically perfect" model of a slinky we have to go to the limit of having infinitely many infinitely small springs. Let's briefly take a look at how that solution looks like.

The Continuous Slinky

Let $x$ denote the coordinate of a point on a "relaxed" slinky. For example, in the two discrete models above the slinky had 4 and 10 points, numbered $1,\dots, 4$ and $1,\dots, 10$ respectively. The continuous slinky will have infinitely many points numbered $[0,1]$ .

Let $h(x,t)$ denote the vertical coordinate of a point $x$ at time $t$ . The acceleration of point $x$ at time $t$ is then, by definition $\frac{\partial^2 h(x,t)}{\partial^2 t}$ , and there are two components affecting it: the gravitational pull $-g$ and the force of the spring.

The spring force acting on a point $x$ is proportional to the stretch of the spring at that point $\frac{\partial h(x,t)}{\partial x}$ . As each point is affected by the stretch from above and below, we have to consider a difference of the "top" and "bottom" stretches, which is thus the derivative of the stretch, i.e. $\frac{\partial^2 h(x,t)}{\partial^2 x}$ . Consequently, the dynamics of the slinky can be described by the equation:

$\frac{\partial^2 h(x,t)}{\partial^2 t} = a\frac{\partial^2 h(x,t)}{\partial^2 x} - g.$

where $a$ is some positive constant. Let us denote the second derivatives by $h_{tt}$ and $h_{xx}$ , replace $a$ with $v^2$ and rearrange to get:

(1) $h_{tt} - v^2 h_{xx} = -g,$

which is known as the wave equation. The name stems from the fact that solutions to this equation always resemble "waves" propagating at a constant speed $v$ through some medium. In our case the medium will be the slinky itself. Now it becomes apparent that, indeed, the lower end of the slinky should not move before the wave of disturbance, unleashed by releasing the top end, reaches it. Most of the explanations of the slinky drop seem to refer to that fact. However when it is stated alone, without the wave-equation-model context, it is at best a rather incomplete explanation.

Given how famous the equation is, it is not too hard to solve it. We'll need to do it twice - first to find the initial configuration of a suspended slinky, then to compute its dynamics when the top is released.

In the beginning the slinky must satisfy $h_t(x, t) = 0$ (because it is not moving at all), $h(0, t) = 0$ (because the top end is located at coordinate 0), and $h_x(1, t) = 0$ (because there is no stretch at the bottom). Combining this with (1) and searching for a polynomial solution, we get:

$h(x, t) = h_0(x) = \frac{g}{2v^2}x(x-2).$

Next, we release the slinky, hence the conditions $h_t(x,t)=0$ and $h(0,t)=0$ disappear and we may use the d'Alembert's formula with reflected boundaries to get the solution:

$h(x,t) = \frac{1}{2}(\phi(x-vt) + \phi(x+vt)) - \frac{gt^2}{2},$

$\text{ where }\phi(x) = h_0(\mathrm{mod}(x, 2)).$

Here's how the solution looks like visually:

Notice how the part of the slinky to which the wave has not arrived yet, stays completely fixed in place. Here are the trajectories of 4 equally-spaced points on the slinky:

Note how, quite surprisingly, all points of the slinky are actually moving with a constant speed, changing it abruptly at certain moments. Somewhat magically, the mean of all these piecewise-linear trajectories (i.e. the trajectory of the center of mass of the slinky) is still a smooth parabola, just as it should be:

The Secret of Spring Motion

Now let us come back to where we started. Imagine a weight on a spring. What will its motion be like? Obviously, any real-life spring is, just like the slinky, best modeled not as a Hooke's simple spring, but rather via the wave equation. Which means that when you let go off the weight, the weight will send a deformation wave, which will move along the spring back and forth, affecting the pure sinewave movement you might be expecting from the simple Hooke's law. Watch closely:

Here is how the movement of the individual nodes looks like:

The fat red line is the trajectory of the weight, and it is certainly not a sinewave. It is a curve inbetween the piecewise-linear "sawtooth" (which is the limit case when the weight is zero) and the true sinusoid (which is the limit case when the mass of the spring is zero). Here's how the zero-weight case looks like:

And this is the other extreme - the massless spring:

These observations can be summarized into the following obviously-sounding conclusion: the basic Hooke's law applies exactly only to the the massless spring. Any real spring has a mass and thus forms an oscillation wave traveling back and forth along its length, which will interfere with the weight's simple harmonic oscillation, making it "less simple and harmonic". Luckily, if the mass of the weight is large enough, this interference is negligible.

And that is, in my opinion, one of the interesting, yet often overlooked aspects of spring motion.

Tags: Explanation, Mathematics, Paradox, Physics, Science, Theory, Visualization
The Meaning of the Gaussian Kernel

Posted by Konstantin 11.11.2016 4 Comments

A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMs and Kernel methods, but I never found the time. The post expects basic knowledge of those topics from the reader.

Introductory Background

The concept of kernel methods is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, $x^T y$ , and square it, the result can be regarded as a dot product of two "feature vectors", where the features are all pairwise products of the original inputs:

$\begin{align*} &k(x,y) = (x^Ty)^2 = (\sum_i x_iy_i)^2 = \sum_{i,j} x_ix_jy_iy_j \\ & = \langle (x_1x_1, x_1x_2,\dots,x_ix_j,\dots,x_nx_n), (y_1y_1,\dots,y_iy_j,\dots,y_ny_n)\rangle \end{align*}$

Analogously, if you raise $x^Ty$ to the third power, you are essentially computing a dot product within a space of all possible three-way products of your inputs, and so on, without ever actually having to see those features explicitly.

If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a kernel function, and this will magically convert your model into a linear model with nonlinear features (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

Consider, for example, plain old linear regression: $f(x) = w^Tx + b$ . We can "kernelize" it by first representing $w$ as a linear combination of the data points (this is called a dual representation):

$f(x) = \left(\sum_i \alpha_i x_i\right)^T x + b = \sum_i \alpha_i (x_i^T x) + b,$

and then swapping all the dot products $x_i^T x$ with a custom kernel function:

$f(x) = \sum_i \alpha_i k(x_i,x) + b.$

If we now substitute $k(x,y) = (x^T y)^2$ here, our model becomes a second degree polynomial regression. If $k(x,y) = (x^T y)^5$ it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

It turns out that there are lots of valid choices for the kernel function $k$ , and, of course, the Gaussian function is one of these choices:

$k(x, y) = \exp\left(-\frac{|x - y|^2}{2\sigma^2}\right).$

It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

The Gaussian Kernel

To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power $i$ makes a kernel with a feature space, which includes all $i$ -wise products. Now let us examine what happens if we add two or more kernel functions. Consider $k(x,y) = x^Ty + (x^Ty)^2$ , for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

$(x_1, x_2, \dots, x_n, \quad x_1x_1, x_1x_2,\dots,x_ix_j,\dots, x_n x_n),$

i.e. the concatenation of degree-1 (untransformed) features, and degree-2 (pairwise product) features.

Multiplying a kernel function with a constant $c$ is also meaningful. It corresponds to scaling the corresponding features by $\sqrt{c}$ . For example, $k(x,y) = 4x^Ty = (2x)^T(2y)$ .

Still with me? Great, now let us combine the tricks above and consider the following kernel:

$k(x,y) = 1 + x^Ty + \frac{(x^Ty)^2}{2} + \frac{(x^Ty)^3}{6}.$

Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by $\sqrt{2}$ and all triple products scaled down by $\sqrt{6}$ .

Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to infinity-wise products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

$\sum_{i=0}^\infty \frac{(x^Ty)^i}{i!} = \exp(x^Ty).$

We can conclude here that $k(x,y) = \exp(x^Ty)$ is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

But we are not done yet. Suppose that we decide to normalize the inputs before applying our linear model. That is, we want to convert each vector $x$ to $\frac{x}{|x|}$ before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

Consider again the linear kernel $k(x,y) = x^Ty$ . If we normalize the vectors before taking their inner product, we get

$\left(\frac{x}{|x|}\right)^T\left(\frac{y}{|y|}\right) = \frac{x^Ty}{|x||y|} = \frac{k(x,y)}{\sqrt{k(x,x)k(y,y)}}.$

With some reflection you will see that the latter expression would normalize the features for any kernel.

Let us see what happens if we apply this kernel normalization to the “infinite polynomial” (i.e. exponential) kernel we just derived:

$\begin{align*} \frac{\exp(x^Ty)}{\sqrt{\exp(x^Tx)\exp(y^Ty)}} &= \frac{\exp(x^Ty)}{\exp(|x|^2/2)\exp(|y|^2/2)} \\ &= \exp\left(-\frac{1}{2}(|x|^2+|y|^2-2x^Ty)\right) \\ &= \exp\left(-\frac{|x-y|^2}{2}\right). \end{align*}$

Voilà, the Gaussian kernel. Well, it still lacks $\sigma^2$ in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by $1/\sigma$

To conclude: the Gaussian kernel is a normalized polynomial kernel of infinite degree (where feature products if $i$ -th degree are scaled down by $\sqrt{i!}$ before normalization). Simple, right?

An Example

The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let $x = 1$ , $y = 2$ . The value of the Gaussian kernel $k(x, y)$ for these inputs is:

$k(x, y) = \exp(-0.5|1-2|^2) \approx 0.6065306597...$

Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

$\phi'(x) = \left(1, x, \frac{x^2}{\sqrt{2!}}, \frac{x^3}{\sqrt{3!}}, \frac{x^4}{\sqrt{4!}}, \frac{x^5}{\sqrt{5!}}\dots\right).$

As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

---- $\phi'(1) =$ (1, 1, 0.707, 0.408, 0.204, 0.091, 0.037, 0.014, 0.005, 0.002, 0.001, 0.000, 0.000, ...)

---- $\phi'(2) =$ (1, 2, 2.828, 3.266, 3.266, 2.921, 2.385, 1.803, 1.275, 0.850, 0.538, 0.324, 0.187, 0.104, 0.055, 0.029, 0.014, 0.007, 0.003, 0.002, 0.001, ...)

Let us limit ourselves to just these first 21 features. To obtain the final Gaussian kernel feature representations $\phi$ we just need to normalize:

---- $\phi(1) =\frac{\phi'(1)}{|\phi'(1)|} = \frac{\phi'(1)}{1.649} =$ (0.607, 0.607, 0.429, 0.248, 0.124, 0.055, 0.023, 0.009, 0.003, 0.001, 0.000, ...)

---- $\phi(2) =\frac{\phi'(2)}{|\phi'(2)|} = \frac{\phi'(2)}{7.389} =$ (0.135, 0.271, 0.383, 0.442, 0.442, 0.395, 0.323, 0.244, 0.173, 0.115, 0.073, 0.044, 0.025, 0.014, 0.008, 0.004, 0.002, 0.001, 0.000, ...)

Finally, we compute the simple dot product of these two vectors:

$\scriptstyle\phi(1)^T\phi(2) = 0.607\cdot 0.135 + 0.607\cdot 0.271 + \dots = {\bf 0.6065306}602....$

In boldface are the decimal digits, which match the value of $\exp(-0.5|1-2|^2)$ . The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

A 2D Example

The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

$\begin{align*} \scriptscriptstyle\phi'(&\scriptscriptstyle x_1, x_2) = \left(1, x_1, \frac{x_1x_1}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{2!}}, \frac{x_2x_1}{\sqrt{2!}}, \frac{x_2x_2}{\sqrt{2!}}, \right. \\ &\scriptscriptstyle \left.\frac{x_1x_1x_1}{\sqrt{3!}}, \frac{x_1x_1x_2}{\sqrt{3!}}, \frac{x_1x_2x_1}{\sqrt{3!}}, \frac{x_1x_2x_2}{\sqrt{3!}}, \frac{x_2x_1x_2}{\sqrt{3!}}, \frac{x_2x_2x_1}{\sqrt{3!}}, \dots\right).\\ \end{align*}$

This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

Note that many products are repeated, however (e.g. $x_1x_2 = x_2x_1$ ), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

Basic combinatorics will tell us, that each feature of the form $\frac{x_1^a x_2^b}{\sqrt{n!}}$ must be repeated exactly $\frac{n!}{a!b!}$ times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by $\sqrt{\frac{n!}{a!b!}}$ . "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector $(v, v, v)$ , for example. Its norm is $\sqrt{v^2 + v^2 + v^2} = \sqrt{3}|v|$ , exactly the same as the norm of the single-element vector $(\sqrt{3}v)$ .

As we do this scaling, each feature gets converted to a nice symmetric form:

$\sqrt{\frac{n!}{a!b!}}\frac{x_1^a x_2^b}{\sqrt{n!}} = \frac{x_1^a x_2^b}{\sqrt{a!b!}} = \frac{x^a}{\sqrt{a!}}\frac{x^b}{\sqrt{b!}}.$

This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with $21^2$ = 231 features instead of 2097151. Nice!

Here is a new view of the unnormalized feature vector up to degree 3:

$\phi'_3(x_1, x_2) = \scriptstyle\left(1, x_1, x_2, \frac{x_1^2}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{1!1!}}, \frac{x^2}{\sqrt{2!}}, \frac{x_1^3}{\sqrt{3!}}, \frac{x_1^2x_2}{\sqrt{2!1!}}, \frac{x_1x_2^2}{\sqrt{1!2!}}, \frac{x_2^3}{\sqrt{3!}}\right).$

Let us limit ourselves to this degree-3 example and let $x = (0.7, 0.2)$ , $y = (0.1, 0.4)$ (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

---- $\phi'_3(0.7, 0.2) =$ (1, 0.7, 0.2, 0.346, 0.140, 0.028, 0.140, 0.069, 0.020, 0.003),

---- $\phi'_3(0.1, 0.4) =$ (1, 0.1, 0.4, 0.007, 0.040, 0.113, 0.000, 0.003, 0.011, 0.026).

After normalization:

---- $\phi_3(0.7, 0.2) =$ (0.768, 0.538, 0.154, 0.266, 0.108, 0.022, 0.108, 0.053, 0.015, 0.003),

---- $\phi_3(0.1, 0.4) =$ (0.919, 0.092, 0.367, 0.006, 0.037, 0.104, 0.000, 0.003, 0.010, 0.024).

The dot product of these vectors is $0.8196\dots$ , what about the exact Gaussian kernel value?

$\exp(-0.5|x-y|^2) = 0.{\bf 81}87\dots.$

Close enough. Of course, this could be done for any number of dimensions, so let us conclude the post with the new observation we made:

The features of the unnormalized $d$ -dimensional Gaussian kernel are:

$\phi({\bf x})_{\bf a} = \prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},$

where ${\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d$ .

The Gaussian kernel is just the normalized version of that, and we know that the norm to divide by is $\sqrt{\exp({\bf x}^T{\bf x})}$ . Thus we may also state the following:

The features of the $d$ -dimensional Gaussian kernel are:

$\phi({\bf x})_{\bf a} = \exp(-0.5|{\bf x}|^2)\prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},$

where ${\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d$ .

That's it, now you have seen the soul of the Gaussian kernel.

Tags: Definition, Explanation, Machine learning, Mathematics, SVM, Teaching, Theory
Holography in Simple Terms

Posted by Konstantin 24.07.2015 No Comments

Holography is a fascinating way of storing and displaying visual information - one which will, hopefully, one day become as commonplace as photography and video is today. Unfortunately, at the current moment it is still a largely exotic trade, limited to artistic exhibitions or science museum displays. While the explanation of how photography works is usually part of the standard school curriculum, holography, at the same time, is often regarded as some mysterious sci-fi technology. The problem is made worse by the fact that there do not seem to be too many explanations around in the Internet, which would aim to properly dispel the aura of mystery. Here is my take at fixing that problem.

Let us start by considering the following illustration, where two hypothetical eyes are observing two hypothetical light sources.

The light from the two lamps reaches the eyes at different angles and intensities, which depend on the positions of the eyes relative to the lights. Because of that difference in angles and intensities, the brain, attached to the eyes, is capable of "triangulating" the location of the lamps, thus forming an enjoyable "three-dimensional" perception of the whole scene for itself.

Let us position a piece of glass in front of the light sources. We can regard this glass as a set of "pixels", through which light rays fly from the lamps towards the eyes. The figure below illustrates three such randomly selected "pixels" on the glass along with the directions of rays that leave them.

What if we could design a piece of glass which would fully consist of such hypothetical "pixels", with each pixel emitting two light rays in the appropriate directions? The eyes, observing such a piece of glass, would not know that the rays are really produced "in the pixels". Indeed, they perceive exactly the same light rays as if the light was coming from two lamps, positioned somewhere behind the glass. Hence, the brain, attached to the eyes, has to perform the same triangulation as the one it did when the lamps were real, and is not be able to distinguish the "fake glass" from a real scene with the lamps.

Although this idea of faking reality might look nice, the task of manufacturing small enough "pixels", which would be able to emit light in a number of given directions with given intensities and colors, is, unfortunately, technologically too complex to be practical.

This is where Physics comes to help us out. "Hey, guys," - she says, - "you are forgetting that light is not really a set of rays flying from from point A to point B along straight lines. Light is a wave. Hence, assuming that your light sources are emitting coherent light of a given frequency, a correct illustration of what happens in your scene with two lamps would be the following:"

If we now consider an arbitrary point-"pixel" on our glass, the electromagnetic waves which pass through it are perceived there in the form of some oscillation. The whole situation above could be modeled by considering waves on a water surface. To simulate the "glass" (the blue line) let us have a horizontal row of buoys float on the surface. Each of the buoys will be moving up and down in its place due to the waves, and we can record this movement.

Here comes a question: what if we remove the original two oscillation sources and instead will start "replaying" the recorded movements of the buoys buy forcing them to move up and down and thus form their own waves? The example below simulates this situation using 19 "buoys":

Does not look nice at all, does it? Now Mathematics comes along to help: "Try harder!" - she says. "If your buoys are dense enough, everything will work out!". Hence, we increase the number of buoys and try again:

Indeed, now we observe how a row of oscillating buoys (an analogue of a "glass" in our model) is generating a wave in front of it, which is completely identical to the one that was produced when two sources were oscillating behind it. No matter where we position ourselves in front of the buoy row, our perception will happily fool us into thinking that the wave we see is really a result of two point sources oscillating somewhere in a distance. Our brain will even help us to estimate the exact position of those sources somewhere behind the row. He does not know he is being fooled. Try covering the upper half of the animation above to get a better understanding of this illusion.

Hence, if we can make a glass, which consists of "pixels" each of which could "record" a profile of a light wave and then be able to reproduce it, the "image" observed by looking at such glass would be perfectly identical to an image that would be observed if the glass would simply pass through it the light from some sources located behind it. But is it possible? Isn't asking each "pixel" to record an arbitrary oscillation too much?

Here is when Mathematics comes along to help again. "Do you want to see a magic trick?" - she asks. "Yes!" - we answer.

Remember that you have two light sources emitting waves behind your piece of glass. As we agreed, they are emitting coherent light, i.e. light of the same frequency (for example, you could shine a laser light on the two points and have them reflect this light). Now let us see what happens at one of the pixels-"buoys" of your glass.

This point receives light waves from both sources. As the distances to the two sources are different, the two light waves reach the pixel with different intensities and phases. Formally, suppose the wave from the first source reaches the pixel with amplitude $a_1$ and phase $b_1$ and the wave from the second source reaches the pixel with amplitude $a_2$ and phase $b_2$ . The two waves add up, so the pixel "feels" the overall oscillation of the form

$a_1 \sin(ft + b_1) + a_2 \sin(ft + b_2).$

Now if you have not two, but three, four, ten, or a million of point sources behind the glass, the arriving signal will be, correspondingly, a sum of three, four, ten, or even a million sinewaves with different amplitudes and phases.

Now here comes the trick, hold your breath: Sum of any number of such sinewaves is itself just a single sinewave with some amplitude and phase.

In other words, even if there is a million point sources emitting coherent light from behind the glass, each "pixel-buoy" of our glass will still perform a simple oscillation of the form $A \sin(ft + B)$ . Hence, to "record" the oscillation of each "pixel-buoy" we must store only two parameters: its intensity and phase.

Now everything starts to look much simpler. Storing light intensity of a pixel is simple - any photograph does that. Dark spots on a photo or a screen emit (or reflect) low intensity light, white spots emit high intensity. It only remains to store the phase. Several methods have been proposed for that. Conceptually the simplest one (but technologically the most advanced) makes use of particular crystals which can shift the phase of light that passes through them. Each "pixel" could then be made out of such a crystal with an appropriate phase shift coded in it. When a "glass" made of such crystals is lit from the back by a coherent light source, each pixel would introduce the necessary phase shift into the light which passes through it, and thus reproduce the necessary wave front. I presume that this HoloVisio project is relying on this idea to design a proper holographic monitor.

The conventional holograms, the ones you typically see in science museums, are made using a simpler principle, though. Omitting some technical details, the idea behind it is the following. Let us add to the scene a reference ray of coherent light.

Now at some pixels on our "glass" the phase of the reference ray will coincide with the phase of the light coming from the scene. The overall wave oscillation at those points will be increased. At other points the phases of the reference and the scene light will mismatch and the signal will be weakened. We can record the resulting interference pattern by storing light spots on the glass at the pixels where the phases matched and dark spots where the phases did not match.

Now, whenever we shine the reference ray again through the resulting plate, only the parts of the ray with the "correct" phase will be let through, as those are exactly the places where the glass is transparent to the reference ray. As a result, you'll obtain a decent holographic reproduction. Here's a figure from Wikipedia, illustrating this principle of recording and reproducing holograms:

Considering the speed at which our photocameras have been transitioning over the recent decades from megapixel resolutions into gigapixels and observing the fast spread of stereoscopic "3D" from movie theaters into TV-sets and portable devices, it is both surprising and sad to see no notable signs of proper holographic display technology being developed in parallel. I do hope to see mass-produced holographic displays along with appropriate software within my own lifetime, though.

Tags: 3D, Mathematics, Physics, Visualization
Playing Card Cryptography

Posted by Konstantin 09.03.2015 No Comments

Playing cards are a great tool for modeling, popularizing and explaining various mathematical and algorithmic notions. Indeed, a deck of cards is a straightforward example for a finite set, a discrete distribution or a character string. Shuffling and dealing cards represents random sampling operations. A card hand denotes information possessed by a party. Turning card face down or face up looks a lot like bit flipping. Finally, card game rules describe an instance of a particular algorithm or a protocol. I am pretty sure that for most concepts in maths or computer science you can find some card game or a card trick that is directly related to it. Here are some examples of how you can simulate a simple computing machine, illustrate inductive reasoning or explain map-reduce, in particular.

Cryptographers seem to be especially keen towards playing cards. Some cryptographic primitives could have been inspired by them. There are decently secure ciphers built upon a deck of cards. Finally, there are a couple of very enlightening card-based illustrations of such nontrivial cryptographic concepts as zero-knowledge proofs and voting protocols. The recent course on secure two-party computation given by abhi shelat at the last week's EWSCS extended my personal collection of such illustrations with another awesome example — a secure two-party protocol for computing the AND function. As I could not find a description of this protocol anywhere in the internet (and abhi did not know who is the author), I thought it was worth being written up in a blog post here (along with a small modification of my own).

The Tinder Game

Consider the situation, where Alice and Bob want to find out, whether they are both interested in each other (just as if they were both users of the Tinder app). More formally, suppose that Alice has her private opinion about Bob represented as a single bit (where "0" means "not interested" and "1" means "interested"). Equivalently, Bob has his private opinion about Alice represented in the same way. They need to find out whether both bits are "1". However, if it is not the case, they would like to keep their opinions private. For example, if Alice is not interested in Bob, Bob would prefer Alice not to know that he is all over her. Because, you know, opinion asymmetry may lead to awkward social dynamics when disclosed, at least among college students.

The following simple card game happens to solve their problem. Take five cards, so that three of them are red and two are black. The cards of one color must be indistinguishable from each other (e.g. you can't simply take three different diamonds for the reds). Deal one black and one red card to Alice, one black and one red card to Bob. Put the remaining red card on the table face up.

Initial table configuration

Now Alice will put her two cards face down to the left of the open card, and Bob will put his two cards to the right of the open card:

Alice and Bob played

The rules for Alice and Bob are simple: if you like the other person, put your red card next to the central red card. Otherwise, put your black card next to the central one. For example, if both Alice and Bob like each other, they would play their cards as follows (albeit still face down):

What Alice and Bob would play if they both liked each other

Now the middle card is also turned face down, and all five cards are collected, preserving their order:

Five cards collected, preserving order

Next, Alice cuts this five-card deck, moving some number of cards from the top to the bottom (Bob should not know exactly how many). Then Bob cuts the deck (also making sure that Alice does not know how many cards he cuts). The result is equivalent to applying some cyclic rotation to the five card sequence yet neither Bob nor Alice knows how many cards were shifted in total.

The five cards can now be opened by dealing them in order onto the table to find out the result. If there happen to be three red cards one after another in a row or two black cards one after another, Alice and Bob both voted "yes". Here is one example of such a situation. It is easy to see that it is a cyclic shift of the configuration with three red aces in the middle shown above.

Both voted "yes"

Otherwise, if neither three red aces nor two black aces are side by side, we may conclude that one or both of the players voted "no". However, there is absolutely no way to find out anything more specific than that. Here is an example:

No mutual affection (or no affection at all)

Obviously, this result could not be obtained as a cyclic shift of the configuration with three aces clumped together. On the other hand, it could have been obtained as a cyclic shift from any of the three other alternatives ("yes-no", "no-no" and "no-yes"), hence if Alice voted "no" she will have no way of figuring out what was Bob's vote. Thus, playing cards along with a cryptographic mindset helped Alice and Bob discover their mutual affection or the lack of it without the risk of awkward situations. Isn't it great?

Throwing in a Zero-Knowledge Proof

There are various ways Alice or Bob can try to "break" this protocol while still trying to claim honesty. One possible example is shown below:

Alice is trying to cheat

Do you see what happened? Alice has put her ace upside down, which will later allow her to understand what was Bob's move (yet she can easily pretend that turning a card upside down was an honest mistake). Although the problem is easily solved by picking a deck with asymmetric backs, for the sake of example, let us assume that such a solution is somewhy unsuitable. Perhaps there are no requisite decks at Alice and Bob's disposal, or they need to have symmetric backs for some other reasons. This offers a nice possibility for us to practice even more playing card cryptography and try to secure the original algorithm from such "attacks" using a small imitation of a zero-knowledge proof for turn correctness.

Try solving it yourself before proceeding.

Read more...

Tags: Computer science, Cryptography, Fun, Games, Mathematics, Teaching, Theory
Float Precision

Posted by Konstantin 13.01.2015 No Comments
I haven't updated this blog for quite some time, which is a shame. Before I resume I wanted to reproduce here a couple of my old posts from other places. This particular piece stems from a post on our research group's webpage from more than 8 years ago, but is about an issue that never stops popping up in practice.

Precision of floating point numbers is a very subtle issue. It creeps up so rarely that many people (me included) would get it out of their heads completely before stumbling upon it in some unexpected place again.

Indeed, most of the time it is not a problem at all, that floating point computations are not ideally precise, and no one cares about the small additive noise that it produces, as long as you remember to avoid exact comparisons between floats. Sometimes, however, the noise can severely spoil your day by violating the core assumptions, such as "distance is always greater than zero", or "cosine of an angle never exceeds 1".

The following is, I think, a marvelous example, discovered by Alex, while debugging an obscure problem in one Python program. The choice of the language is absolutely irrelevant, however, so I took the liberty of presenting it here using Javascript (because this lets you reproduce it in your browser's console, if you wish). For Python fans, there is an interactive version available here as well.

A cosine distance metric is a measure of dissimilarity of two vectors, often used in information retrieval and clustering, that is defined as follows:

$\mathrm{cdist}(\mathbf{x},\mathbf{y}) = 1 - \frac{\mathbf{x}^T\mathbf{y}}{|\mathbf{x}| \; |\mathbf{y}|}$

A straightforward way to put this definition into code is, for example, the following:
```
function length(x) {
    return Math.sqrt(x[0]*x[0] + x[1]*x[1]);
}

function cosine_similarity(x, y) {
    return (x[0]*y[0] + x[1]*y[1])/length(x)/length(y);
}

function cosine_distance(x, y) {
    return 1 - cosine_similarity(x, y);
}
```
Now, mathematically, the cosine distance is a valid distance function and is thus always positive. Unfortunately, the floating-point implementation of it presented above, is not the same. Check this out:
```
> Math.sign(cosine_distance([6.0, 6.0], [9.0, 9.0]))
< -1
```
Please, beware of float comparisons. In particular, think twice next time you use the sign() function.
Tags: Mathematics, Programming, Python
The Substring Paradox

Posted by Konstantin 25.02.2013 No Comments
If anyone tells you he or she understands probability theory, do not believe them. That person simply does not know enough of it to admit, that probability theory is riddled with paradoxes, where common sense must step aside and wait in silence, or your brain will hurt. Substring statistics is probably among the lesser-known, yet magically unintuitive examples.

Consider a sequence of random coin flips. Each coin flip is either a "heads" or a "tails", hence the sequence might written down as a sequence of H and T-s: HTHTHHTT...

It is easy to show that the probability of the sequence to begin with, say, HHH is equal to P(HHH) = 1/8th, as is the case with any other three-letter combination: P(HHT) = P(THH) = P(THT) = 1/8, etc. Moreover, by symmetry, the probability of seeing a particular three-letter combination at any fixed position in the sequence is still always 1/8-th. All three-letter substrings seem to be equivalent here.

But let us now play a game, where we throw a coin until we see a particular three-letter combination. To be more specific, let us wait until either HHT or HHH comes up. Suppose I win in the first case and you win in the second one. Obviously, the game first proceeds until two heads are flipped. Then, whichever coin flip comes up next determines the winner. Sounds pretty fair, doesn't it? Well, it turns out that, surprisingly, if you count carefully the expected number of coin flips to obtain HHT, it happens to be 8, whereas for HHH it is 14! Ha! Does it mean I have an advantage? Suprisingly again, no. The probability of HHT occuring before HHH in any given sequence is still precisely 0.5 and, as we reasoned initially, the game is still fair.

We can, however, construct even more curious situations with four-letter combinations. Suppose I bet on HTHT and you bet on THTT. The expected number of coin flips to obtain my combination can be computed to be 20. The expected number of flips to get your combination is smaller: 18 flips. However, it is still more probable (64%) that my combination will happen before yours!

If this is not amusing enough, suppose that four of us are playing such a game. Player A bets on the string THH, Player B bets on HHT, player C on HTT and player D on TTH. It turns out that A's combination will occur earlier than B's with probability 75%. B's combination, however, wins over C's with probability 66.7%. C's combination, though, wins over D's with probability 75%. And, to close the loop, D wins over A with probability 66.7%! This is just like the nontransitive dice.

Hopefully, you are amazed enough at this point to require an explanation for how this all might happen. Let me leave it to you as a small puzzle:
- Explain in simple terms, how can it happen so that the expected time to first occurrence of otherwise equiprobable substrings may be different?
- Explain in simple terms, how can it be so that one substring has higher than 50% chance of occuring earlier than some other substring.
- Finally, explain why the two effects above are not strictly related to each other.
PS: The theory used to compute actual probabilities and expected times to occurrence of a substring is elegant yet somewhat involved. For the practically-minded, here is the code to check the calculations.
Tags: Mathematics, Paradox, Probability theory, Puzzle, Python
On Gödel's Theorems and the Universe

Posted by Konstantin 27.01.2013 7 Comments

Most results that are published in mathematics and natural sciences are highly technical and obscure to anyone outside the particular area of research. Consequently, most scientific publications and discoveries do not get any media attention at all. This is a problem both for the public (which would benefit from knowing a bit more about our world), the science in general (which would benefit from wider dissemination) and the scientists themselves (which would certainly appreciate if their work reached more than a couple of other people). The issue is usually addressed by the researchers trying to popularize their findings — provide simple interpretations, which could be grasped by the layman without the need to understand the foundations. Unfortunately, if the researcher was lucky to grab enough media attention, chances are high the process of "popularization" will get completely out of hands, and the message received by the public will have nothing to do with the original finding.

There are some theorems in pure mathematics, which happened to suffer especially seriously from this phenomenon. My two most favourite examples are the Heizenberg's uncertainty principle and Gödel's incompleteness theorems. Every once in a while I stumble on pieces, written by people who have zero knowledge of basic physics (not even mentioning quantum mechanics or Fourier theory), but claim to have clearly understood the meaning and the deep implications of the uncertainty principle upon our daily lives. Even more popular are Gödel's theorems. "Science proves that some things in this world can never be understood or predicted by us, mere humans!!!" is a typical take on Gödel by the press. Oh, and it is loved by religion enthusiasts!

This post is my desperate attempt to bring some sanity back into this world by providing a (yet another, but maybe slightly different) popular explanation for Gödel's ideas.

Gödel's theorems explained (with enough context)

Kurt Gödel

In our daily lives we primarily use symbols to communicate and describe the world. Those may be formulas in mathematics or simply sentences in the English language. Symbol-based systems can be used in different ways. From the perspective of arts and humanities, one is welcome to construct arbitrary sequences of words and sentences, as long as the result is "beautiful", "inspiring" or "seems smart and convincing". From a more technically-minded point of view, however, one is only welcome to operate with "true and logical" statements. However, it is not at all obvious which statements should be universally considered "true and logical". There are two ways the problem can be resolved.

Firstly, we can determine the truth by consulting "reality". For example, the phrase "All people are mortal, hence Socrates is mortal" is true because, indeed (avoiding a lecture-worth of formalities or so), in all possible universes where there are people and they are mortal, Socrates will always also be mortal.

Secondly, we can simply postulate a set of statements, which we shall universally regard as "the true ones". To distinguish those postulated truths from the "actual truths", mentioned in the paragraph above, let us call them "logically correct statements". That is, we can hypothetically write down an (infinite) list of all the "logically correct" sentences in some hypothetical "Book of All Logically Correct Statements". From that time on, every claim not in the book will be deemed "illogical" and hence "false" (the question of whether it is false in reality would, of course, depend on the goodness and relevance of the book we created).

This is really convenient, because now in order to check the correctness of the claim about Socrates we do not have to actually visit all the possible universes. It suffices to look it up in our book! If the book is any good, it will provide good answers, so the concept of "logical correctness" may be close or even equivalent to "actual truth". And in principle, there must exist "The Actual Book of Truth", which would list precisely the "actually true" statements.

Unfortunately, infinite books are somewhat inconvenient to print and carry around. Luckily, we can try to represent an infinite book in a finite way using algorithms. We have all seen how a short algorithm is capable of producing an infinite amount of nonsense, haven't we? Thus, instead of looking for an (infinitely-sized) "book of truth", we should look for a (finitely-sized) "algorithm of truth", which will be either capable of generating the whole book, or checking for any given statement, whether it belongs to the book.

Frege's propositional calculus

For clarity and historical reasons, the algorithms for describing "books of truths" are usually developed and written down using a particular "programming language", which consists of "axioms" and "inference rules". Here is one of the simplest examples of how a "program" might look like.

This is where a natural question arises: even if we assume that we have access to "the actual book of truth", will it really be possible to compress it down to a finite algorithm? Isn't it too much to ask? It turns out it is. There is really no way to create a compact representation for any but the most trivial "books of truths".

Besides the "asking too much" intuition, there are at least two other simple reasons why this is impossible. Firstly, most of the readers here know that there exist problems, that are in principle uncomputable. Those are usually about statements, the correctness of which cannot be checked in finite time by a finite algorithm, with the most famous example being the halting problem, as well as everything related to uncomputable numbers.

Naturally, if we can't even use a finite algorithm to list "all truly halting programs", our hopes for getting a finite description for any more complicated "book of truths" is gone. But there is another funny reason. Once we have decided to define the book with all true statements and devise an algorithm for generating it, we must not forget to include into the book some statements about the book and the algorithm themselves. In particular, we must consider the following phrase: "This sentence will not be included in the book generated by our algorithm."

If our algorithm does include this sentence in the generated book, the claim turns out to be false. As a result, our precious book would have false statements in it: it would be inconsistent. On the other hand, if our algorithm produces a book which does not include the above sentence, the sentence turns out to be true and the book is incomplete. It is just the liar's paradox.

And so it is — we have to put up with the situation, where it is impossible to produce a finite representation for all true statements without stumbling into logical paradoxes. Or, alternatively, that there are mathematical statements, which cannot be checked or proven within a finite time.

This describes what is known as "Gödel's first incompleteness theorem" (namely, the claim that in most cases our algorithm is only capable of producing either an incomplete or an inconsistent "book"). A curious corollary may be derived from it, which is known as the "Gödel's second incompleteness theorem". The theorem says that if our "book-generating algorithm" happens to include into it exactly the phrase "This book is consistent", it must necessarily be a false claim, and the algorithm must actually be inconsistent. This is often interpreted that no "honest" theory is capable of claiming it's own correctness (there are caveats, though).

Gödel's theorems have several interesting implications for the foundations of mathematics. None the less, those are still theorems about the fundamental impossibility of using a finite algorithm for generating a book of logically correct statements without stumbling into logical paradoxes and infinite loops. Nothing more and nothing less. They have nothing to do with the experimental sciences: note how we had to cut away all connections to reality in the very first paragraphs of the explanation. And thus, Gödel's theorems are not related to the causes of "human inability to grasp the universe", "failure or success of startup business plans", or anything else of that kind.

Tags: Computer science, Definition, Logic, Mathematics, Philosophy, Theory, Weltschmertz