• Posted by Konstantin 23.11.2016

    Basic linear algebra, introductory statistics and some familiarity with core machine learning concepts (such as PCA and linear models) are the prerequisites of this post. Otherwise it will probably make no sense. An abridged version of this text is also posted on Quora.

    Most textbooks on statistics cover covariance right in their first chapters. It is defined as a useful "measure of dependency" between two random variables:

        \[\mathrm{cov}(X,Y) = E[(X - E[X])(Y - E[Y])].\]

    The textbook would usually provide some intuition on why it is defined as it is, prove a couple of properties, such as bilinearity, define the covariance matrix for multiple variables as {\bf\Sigma}_{i,j} = \mathrm{cov}(X_i, X_j), and stop there. Later on the covariance matrix would pop up here and there in seeminly random ways. In one place you would have to take its inverse, in another - compute the eigenvectors, or multiply a vector by it, or do something else for no apparent reason apart from "that's the solution we came up with by solving an optimization task".

    In reality, though, there are some very good and quite intuitive reasons for why the covariance matrix appears in various techniques in one or another way. This post aims to show that, illustrating some curious corners of linear algebra in the process.

    Meet the Normal Distribution

    The best way to truly understand the covariance matrix is to forget the textbook definitions completely and depart from a different point instead. Namely, from the the definition of the multivariate Gaussian distribution:

    We say that the vector \bf x has a normal (or Gaussian) distribution with mean \bf \mu and covariance \bf \Sigma if:

        \[\Pr({\bf x}) =|2\pi{\bf\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}({\bf x} - {\bf\mu})^T{\bf\Sigma}^{-1}({\bf x} - {\bf \mu})\right).\]

    To simplify the math a bit, we will limit ourselves to the centered distribution (i.e. {\bf\mu} = {\bf 0}) and refrain from writing out the normalizing constant |2\pi{\bf\Sigma}|^{-1/2}. Now, the definition of the (centered) multivariate Gaussian looks as follows:

        \[\Pr({\bf x}) \propto \exp\left(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}\right).\]

    Much simpler, isn't it? Finally, let us define the covariance matrix as nothing else but the parameter of the Gaussian distribution. That's it. You will see where it will lead us in a moment.

    Transforming the Symmetric Gaussian

    Consider a symmetric Gaussian distribution, i.e. the one with {\bf \Sigma = \bf I} (the identity matrix). Let us take a sample from it, which will of course be a symmetric, round cloud of points:

    We know from above that the likelihood of each point in this sample is

    (1)   \[P({\bf x}) \propto \exp(-0.5 {\bf x}^T {\bf x}).\]

    Now let us apply a linear transformation {\bf A} to the points, i.e. let {\bf y} ={\bf Ax}. Suppose that, for the sake of this example, {\bf A} scales the vertical axis by 0.5 and then rotates everything by 30 degrees. We will get the following new cloud of points {\bf y}:

    What is the distribution of {\bf y}? Just substitute {\bf x}={\bf A}^{-1}{\bf y} into (1), to get:

    (2)   \begin{align*} P({\bf y}) &\propto \exp(-0.5 ({\bf A}^{-1}{\bf y})^T({\bf A}^{-1}{\bf y}))\\ &=\exp(-0.5{\bf y}^T({\bf AA}^T)^{-1}{\bf y}) \end{align*}

    This is exactly the Gaussian distribution with covariance {\bf \Sigma} = {\bf AA}^T. The logic works both ways: if we have a Gaussian distribution with covariance \bf \Sigma, we can regard it as a distribution which was obtained by transforming the symmetric Gaussian by some {\bf A}, and we are given {\bf AA}^T.

    More generally, if we have any data, then, when we compute its covariance to be \bf\Sigma, we can say that if our data were Gaussian, then it could have been obtained from a symmetric cloud using some transformation \bf A, and we just estimated the matrix {\bf AA}^T, corresponding to this transformation.

    Note that we do not know the actual \bf A, and it is mathematically totally fair. There can be many different transformations of the symmetric Gaussian which result in the same distribution shape. For example, if \bf A is just a rotation by some angle, the transformation does not affect the shape of the distribution at all. Correspondingly, {\bf AA}^T = {\bf I} for all rotation matrices. When we see a unit covariance matrix we really do not know, whether it is the “originally symmetric” distribution, or a “rotated symmetric distribution”. And we should not really care - those two are identical.

    There is a theorem in linear algebra, which says that any symmetric matrix \bf \Sigma can be represented as:

    (3)   \[{\bf \Sigma} = {\bf VDV}^T,\]

    where {\bf V} is orthogonal (i.e. a rotation) and {\bf D} is diagonal (i.e. a coordinate-wise scaling). If we rewrite it slightly, we will get:

    (4)   \[{\bf \Sigma} = ({\bf VD}^{1/2})({\bf VD}^{1/2})^T = {\bf AA}^T,\]

    where {\bf A} = {\bf VD}^{1/2}. This, in simple words, means that any covariance matrix \bf \Sigma could have been the result of transforming the data using a coordinate-wise scaling {\bf D}^{1/2} followed by a rotation \bf V. Just like in our example with \bf x and \bf y above.

    Principal Component Analysis

    Given the above intuition, PCA already becomes a very obvious technique. Suppose we are given some data. Let us assume (or “pretend”) it came from a normal distribution, and let us ask the following questions:

    1. What could have been the rotation \bf V and scaling {\bf D}^{1/2}, which produced our data from a symmetric cloud?
    2. What were the original, “symmetric-cloud” coordinates \bf x before this transformation was applied.
    3. Which original coordinates were scaled the most by \bf D and thus contribute most to the spread of the data now. Can we only leave those and throw the rest out?

    All of those questions can be answered in a straightforward manner if we just decompose \bf \Sigma into \bf V and \bf D according to (3). But (3) is exactly the eigenvalue decomposition of \bf\Sigma. I’ll leave you to think for just a bit and you’ll see how this observation lets you derive everything there is about PCA and more.

    The Metric Tensor

    Bear me for just a bit more. One way to summarize the observations above is to say that we can (and should) regard {\bf\Sigma}^{-1} as a metric tensor. A metric tensor is just a fancy formal name for a matrix, which summarizes the deformation of space. However, rather than claiming that it in some sense determines a particular transformation \bf A (which it does not, as we saw), we shall say that it affects the way we compute angles and distances in our transformed space.

    Namely, let us redefine, for any two vectors \bf v and \bf w, their inner product as:

    (5)   \[\langle {\bf v}, {\bf w}\rangle_{\Sigma^{-1}} = {\bf v}^T{\bf \Sigma}^{-1}{\bf w}.\]

    To stay consistent we will also need to redefine the norm of any vector as

    (6)   \[|{\bf v}|_{\Sigma^{-1}} = \sqrt{{\bf v}^T{\bf \Sigma}^{-1}{\bf v}},\]

    and the distance between any two vectors as

    (7)   \[|{\bf v}-{\bf w}|_{\Sigma^{-1}} = \sqrt{({\bf v}-{\bf w})^T{\bf \Sigma}^{-1}({\bf v}-{\bf w})}.\]

    Those definitions now describe a kind of a “skewed world” of points. For example, a unit circle (a set of points with “skewed distance” 1 to the center) in this world might look as follows:

    And here is an example of two vectors, which are considered “orthogonal”, a.k.a. “perpendicular” in this strange world:

    Although it may look weird at first, note that the new inner product we defined is actually just the dot product of the “untransformed” originals of the vectors:

    (8)   \[{\bf v}^T{\bf \Sigma}^{-1}{\bf w} = {\bf v}^T({\bf AA}^T)^{-1}{\bf w}=({\bf A}^{-1}{\bf v})^T({\bf A}^{-1}{\bf w}),\]

    The following illustration might shed light on what is actually happening in this \Sigma-“skewed” world. Somehow “deep down inside”, the ellipse thinks of itself as a circle and the two vectors behave as if they were (2,2) and (-2,2).

    Getting back to our example with the transformed points, we could now say that the point-cloud \bf y is actually a perfectly round and symmetric cloud “deep down inside”, it just happens to live in a skewed space. The deformation of this space is described by the tensor {\bf\Sigma}^{-1} (which is, as we know, equal to ({\bf AA}^T)^{-1}. The PCA now becomes a method for analyzing the deformation of space, how cool is that.

    The Dual Space

    We are not done yet. There’s one interesting property of “skewed” spaces worth knowing about. Namely, the elements of their dual space have a particular form. No worries, I’ll explain in a second.

    Let us forget the whole skewed space story for a moment, and get back to the usual inner product {\bf w}^T{\bf v}. Think of this inner product as a function f_{\bf w}({\bf v}), which takes a vector \bf v and maps it to a real number, the dot product of \bf v and \bf w. Regard the \bf w here as the parameter (“weight vector”) of the function. If you have done any machine learning at all, you have certainly come across such linear functionals over and over, sometimes in disguise. Now, the set of all possible linear functionals f_{\bf w} is known as the dual space to your “data space”.

    Note that each linear functional is determined uniquely by the parameter vector \bf w, which has the same dimensionality as \bf v, so apparently the dual space is in some sense equivalent to your data space - just the interpretation is different. An element \bf v of your “data space” denotes, well, a data point. An element \bf w of the dual space denotes a function f_{\bf w}, which projects your data points on the direction \bf w (recall that if \bf w is unit-length, {\bf w}^T{\bf v} is exactly the length of the perpendicular projection of \bf v upon the direction \bf w). So, in some sense, if \bf v-s are “vectors”, \bf w-s are “directions, perpendicular to these vectors”. Another way to understand the difference is to note that if, say, the elements of your data points numerically correspond to amounts in kilograms, the elements of \bf w would have to correspond to “units per kilogram”. Still with me?

    Let us now get back to the skewed space. If \bf v are elements of a skewed Euclidean space with the metric tensor {\bf\Sigma}^{-1}, is a function f_{\bf w}({\bf v}) = {\bf w}^T{\bf v} an element of a dual space? Yes, it is, because, after all, it is a linear functional. However, the parameterization of this function is inconvenient, because, due to the skewed tensor, we cannot interpret it as projecting vectors upon \bf w nor can we say that \bf w is an “orthogonal direction” (to a separating hyperplane of a classifier, for example). Because, remember, in the skewed space it is not true that orthogonal vectors satisfy {\bf w}^T{\bf v}=0. Instead, they satisfy {\bf w}^T{\bf \Sigma}^{-1}{\bf v} = 0. Things would therefore look much better if we parameterized our dual space differently. Namely, by considering linear functionals of the form f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf \Sigma}^{-1}{\bf v}. The new parameters \bf z could now indeed be interpreted as an “orthogonal direction” and things overall would make more sense.

    However when we work with actual machine learning models, we still prefer to have our functions in the simple form of a dot product, i.e. f_{\bf w}, without any ugly \bf\Sigma-s inside. What happens if we turn a “skewed space” linear functional from its natural representation into a simple inner product?

    (9)   \[f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf\Sigma}^{-1}{\bf v} = ({\bf \Sigma}^{-1}{\bf z})^T{\bf v} = f_{\bf w}({\bf v}),\]

    where {\bf w} = {\bf \Sigma}^{-1}{\bf z}. (Note that we can lose the transpose because \bf \Sigma is symmetric).

    What it means, in simple terms, is that when you fit linear models in a skewed space, your resulting weight vectors will always be of the form {\bf \Sigma}^{-1}{\bf z}. Or, in other words, {\bf\Sigma}^{-1} is a transformation, which maps from “skewed perpendiculars” to “true perpendiculars”. Let me show you what this means visually.

    Consider again the two “orthogonal” vectors from the skewed world example above:

    Let us interpret the blue vector as an element of the dual space. That is, it is the \bf z vector in a linear functional {\bf z}^T{\bf\Sigma}^{-1}{\bf v}. The red vector is an element of the “data space”, which would be mapped to 0 by this functional (because the two vectors are “orthogonal”, remember).

    For example, if the blue vector was meant to be a linear classifier, it would have its separating line along the red vector, just like that:

    If we now wanted to use this classifier, we could, of course, work in the “skewed space” and use the expression {\bf z}^T{\bf\Sigma}^{-1}{\bf v} to evaluate the functional. However, why don’t we find the actual normal \bf w to that red separating line so that we wouldn’t need to do an extra matrix multiplication every time we use the function?

    It is not too hard to see that {\bf w}={\bf\Sigma}^{-1}{\bf z} will give us that normal. Here it is, the black arrow:

    Therefore, next time, whenever you see expressions like {\bf w}^T{\bf\Sigma}^{-1}{\bf v} or ({\bf v}-{\bf w})^T{\bf\Sigma}^{-1}({\bf v}-{\bf w}), remember that those are simply inner products and (squared) distances in a skewed space, while {\bf \Sigma}^{-1}{\bf z} is a conversion from a skewed normal to a true normal. Also remember that the “skew” was estimated by pretending that the data were normally-distributed.

    Once you see it, the role of the covariance matrix in some methods like the Fisher’s discriminant or Canonical correlation analysis might become much more obvious.

    The Dual Space Metric Tensor

    “But wait”, you should say here. “You have been talking about expressions like {\bf w}^T{\bf\Sigma}^{-1}{\bf v} all the time, while things like {\bf w}^T{\bf\Sigma}{\bf v} are also quite common in practice. What about those?”

    Hopefully you know enough now to suspect that {\bf w}^T{\bf\Sigma}{\bf v} is again an inner product or a squared norm in some deformed space, just not the “internal data metric space”, that we considered so far. Which space is it? It turns out it is the “internal dual metric space”. That is, whilst the expression {\bf w}^T{\bf\Sigma}^{-1}{\bf v} denoted the “new inner product” between the points, the expression {\bf w}^T{\bf\Sigma}{\bf v} denotes the “new inner product” between the parameter vectors. Let us see why it is so.

    Consider an example again. Suppose that our space transformation \bf A scaled all points by 2 along the x axis. The point (1,0) became (2,0), the point (3, 1) became (6, 1), etc. Think of it as changing the units of measurement - before we measured the x axis in kilograms, and now we measure it in pounds. Consequently, the norm of the point (2,0) according to the new metric, |(2,0)|_{\Sigma^{-1}} will be 1, because 2 pounds is still just 1 kilogram “deep down inside”.

    What should happen to the parameter ("direction") vectors due to this transformation? Can we say that the parameter vector (1,0) also got scaled to (2,0) and that the norm of the parameter vector (2,0) is now therefore also 1? No! Recall that if our initial data denoted kilograms, our dual vectors must have denoted “units per kilogram”. After the transformation they will be denoting “units per pound”, correspondingly. To stay consistent we must therefore convert the parameter vector (”1 unit per kilogram”, 0) to its equivalent (“0.5 units per pound”,0). Consequently, the norm of the parameter vector (0.5,0) in the new metric will be 1 and, by the same logic, the norm of the dual vector (2,0) in the new metric must be 4. You see, the “importance of a parameter/direction” gets scaled inversely to the “importance of data” along that parameter or direction.

    More formally, if {\bf x}'={\bf Ax}, then

    (10)   \begin{align*} f_{\bf w}({\bf x}) &= {\bf w}^T{\bf x} = {\bf w}^T{\bf A}^{-1}{\bf x}'\\ & =(({\bf A}^{-1})^T{\bf w})^T{\bf x}'=f_{({\bf A}^{-1})^T{\bf w}}({\bf x}'). \end{align*}

    This means, that the transformation \bf A of the data points implies the transformation {\bf B}:=({\bf A}^{-1})^T of the dual vectors. The metric tensor for the dual space must thus be:

    (11)   \[({\bf BB}^T)^{-1}=(({\bf A}^{-1})^T{\bf A}^{-1})^{-1}={\bf AA}^T={\bf \Sigma}.\]

    Remember the illustration of the “unit circle” in the {\bf \Sigma}^{-1} metric? This is how the unit circle looks in the corresponding \bf\Sigma metric. It is rotated by the same angle, but it is stretched in the direction where it was squished before.

    Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by \bf A becomes proportionally larger (note that the “unit circle” is, on the contrary, “squished” along those directions).

    But the “stretch” of the space deformation in any direction can be measured by the variance of the data. It is therefore not a coincidence that {\bf w}^T{\bf \Sigma}{\bf w} is exactly the variance of the data along the direction \bf w (assuming |{\bf w}|=1).

    The Covariance Estimate

    Once we start viewing the covariance matrix as a transformation-driven metric tensor, many things become clearer, but one thing becomes extremely puzzling: why is the inverse covariance of the data a good estimate for that metric tensor? After all, it is not obvious that {\bf X}^T{\bf X}/n (where \bf X is the data matrix) must be related to the \bf\Sigma in the distribution equation \exp(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}).

    Here is one possible way to see the connection. Firstly, let us take it for granted that if \bf X is sampled from a symmetric Gaussian, then {\bf X}^T{\bf X}/n is, on average, a unit matrix. This has nothing to do with transformations, but just a consequence of pairwise independence of variables in the symmetric Gaussian.

    Now, consider the transformed data, {\bf Y}={\bf XA}^T (vectors in the data matrix are row-wise, hence the multiplication on the right with a transpose). What is the covariance estimate of \bf Y?

    (12)   \[{\bf Y}^T{\bf Y}/n=({\bf XA}^T)^T{\bf XA}^T/n={\bf A}({\bf X}^T{\bf X}){\bf A}^T/n\approx {\bf AA}^T,\]

    the familiar tensor.

    This is a place where one could see that a covariance matrix may make sense outside the context of a Gaussian distribution, after all. Indeed, if you assume that your data was generated from any distribution P with uncorrelated variables of unit variance and then transformed using some matrix \bf A, the expression {\bf X}^T{\bf X}/n will still be an estimate of {\bf AA}^T, the metric tensor for the corresponding (dual) space deformation.

    However, note that out of all possible initial distributions P, the normal distribution is exactly the one with the maximum entropy, i.e. the “most generic”. Thus, if you base your analysis on the mean and the covariance matrix (which is what you do with PCA, for example), you could just as well assume your data to be normally distributed. In fact, a good rule of thumb is to remember, that whenever you even mention the word "covariance matrix", you are implicitly fitting a Gaussian distribution to your data.

    Posted by Konstantin @ 11:04 pm

    Tags: , , , , , ,

  • 16 Comments

    1. Lao on 07.12.2016 at 12:08 (Reply)

      Good article.

      It give me a meaning - what is eigenvector of the covariance matrix, or otherwords - what linear transformation the covariance matrix provide?

    2. Iwabn on 22.01.2017 at 08:17 (Reply)

      Shouldn't the left side of the equation (2) be: P(A^-1 y) instead of P(y) directly? And then how to get ride of the A^-1?

      1. Konstantin on 03.03.2017 at 10:33 (Reply)

        The equation expresses P(y), not P(A^-1 y).

    3. Simon on 27.06.2017 at 01:38 (Reply)

      Excellent article. Gives a very nice intuitive explanation, thank you.

    4. Chris on 10.11.2017 at 20:29 (Reply)

      Hi, I am really a fan of this explanation. I find these kind of connections between statistical thinking and other math (linear algebra) strengthen my intuition the most. Are you aware of any other sources or blogs that drill down into concepts like this from a linear algebra point of view?

    5. […] What is the Covariance Matrix?, 2016 […]

    6. Arun on 13.05.2018 at 14:48 (Reply)

      Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by \bf A becomes proportionally larger..

      Is larger is correct or it is typo and should be smaller rather than larger

      1. Konstantin on 13.05.2018 at 23:31 (Reply)

        Think of it this way. Suppose you measured your data in kilograms and changed to pounds.

        Suppose you had a dual vector (5). Originally it meant (5 units/kg) and, say, its norm was equal to 5.

        After the switch to pounds, the dual vector (5) will be denoting (5 units/kg), which is equivalent to (10 units/kg), and, consequently, its norm should be equal to 10 (if you want to preserve the physical meaning of your computations).

        Consequently, the norm of the (numerically the same) dual vectors has grown in the new coordinates. At the same time "semantically the same" vectors got squished (to preserve the norm they had before).

    7. Lakshman on 13.05.2018 at 15:21 (Reply)

      Dear sir,
      I could not understand the statement "So, in some sense, if v-s are “vectors”, w-s are “directions, perpendicular to these vectors”. As per linear algebra perspective if we project v upon unit vector w, then v- is perpendicular to w, I can not see w as perpendicular to v. Please explain.

      1. Konstantin on 13.05.2018 at 23:50 (Reply)

        The wording is too vague here, indeed - the term "perpendicular" (should have been "orthogonal") was meant to refer to the illustration before it, where you have two vectors with an inner product of zero. If one of them were taken to be the "primal" vector, the second one would correspond to the "dual" one.

        Of course, whenever the inner product is non-zero, we should not say that "w is orthogonal to v". The key idea is just the fact that the vector w can only "act" on v via the inner product.

        Here's another way to phrase how dual space vector differs from a primal one, perhaps it would help:

        The components of a primal vector denote the "amounts of movement" along each axis. The components of a dual vector denote the "price (or weight)" assigned to movement along each axis. "Amount" and "price" are "semantically orthogonal" in the sense that you may only meaningfully multiply them together, not add to each other.

        1. Lakshman on 14.05.2018 at 10:14 (Reply)

          Thank you for such a great explanation...

    8. Arun Chauhan on 16.03.2019 at 08:40 (Reply)

      Can you please help me to understand this "The eigenvalues of data co-variance matrix, Φ^TΦ over (data space), measure the curvature of the likelihood function over (parameter space)." (Reference: Book,Patter Recognition and Machine Learning; Section 3.5.3: 6th line)

      1. Kris on 16.04.2019 at 20:10 (Reply)

        The eigenvectors of the data covariance matrix can be thought of as the local, new « coordinates « while the eigenvalues tell you how the space curves. For example, if the covariance matrix is constant in the parameters, the space is flat, since the axes are scaled by the same amount for different parameter values, and this is not so for non-constant covariance matrices.

      2. Konstantin on 04.05.2019 at 22:49 (Reply)

        This question is reasonably easy to answer with some handwaving at the whiteboard yet might be hard to explain in a short comment.

        In any case, try solving "Exercise 112" here (and any of the preceding exerices you might need to understand the necessary basics). This could help grasp the needed intuition.

    9. […] [4] Konstantin, fouryears.eu, “What is the Covariance Matrix?“ […]

    10. Aldo D'Ottavio on 13.12.2019 at 04:13 (Reply)

      Very nice note! Thank you for taking the time to write it up.

      Do you have a book or article reference which you would recommend for this?

    Leave a comment

    Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.