• Posted by Konstantin 11.11.2016 4 Comments

    A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMs and Kernel methods, but I never found the time. The post expects basic knowledge of those topics from the reader.

    Introductory Background

    The concept of kernel methods is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

    The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, x^T y, and square it, the result can be regarded as a dot product of two "feature vectors", where the features are all pairwise products of the original inputs:

        \begin{align*} &k(x,y) = (x^Ty)^2 = (\sum_i x_iy_i)^2 = \sum_{i,j} x_ix_jy_iy_j \\ & = \langle (x_1x_1, x_1x_2,\dots,x_ix_j,\dots,x_nx_n), (y_1y_1,\dots,y_iy_j,\dots,y_ny_n)\rangle \end{align*}

    Analogously, if you raise x^Ty to the third power, you are essentially computing a dot product within a space of all possible three-way products of your inputs, and so on, without ever actually having to see those features explicitly.

    If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a kernel function, and this will magically convert your model into a linear model with nonlinear features (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

    Consider, for example, plain old linear regression: f(x) = w^Tx + b. We can "kernelize" it by first representing w as a linear combination of the data points (this is called a dual representation):

        \[f(x) = \left(\sum_i \alpha_i x_i\right)^T x + b = \sum_i \alpha_i (x_i^T x) + b,\]

    and then swapping all the dot products x_i^T x with a custom kernel function:

        \[f(x) = \sum_i \alpha_i k(x_i,x) + b.\]

    If we now substitute k(x,y) = (x^T y)^2 here, our model becomes a second degree polynomial regression. If k(x,y) = (x^T y)^5 it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

    It turns out that there are lots of valid choices for the kernel function k, and, of course, the Gaussian function is one of these choices:

        \[k(x, y) = \exp\left(-\frac{|x - y|^2}{2\sigma^2}\right).\]

    It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

    The Gaussian Kernel

    To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power i makes a kernel with a feature space, which includes all i-wise products. Now let us examine what happens if we add two or more kernel functions. Consider k(x,y) = x^Ty + (x^Ty)^2, for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

        \[(x_1, x_2, \dots, x_n, \quad x_1x_1, x_1x_2,\dots,x_ix_j,\dots, x_n x_n),\]

    i.e. the concatenation of degree-1 (untransformed) features, and degree-2 (pairwise product) features.

    Multiplying a kernel function with a constant c is also meaningful. It corresponds to scaling the corresponding features by \sqrt{c}. For example, k(x,y) = 4x^Ty = (2x)^T(2y).

    Still with me? Great, now let us combine the tricks above and consider the following kernel:

        \[k(x,y) = 1 + x^Ty + \frac{(x^Ty)^2}{2} + \frac{(x^Ty)^3}{6}.\]

    Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by \sqrt{2} and all triple products scaled down by \sqrt{6}.

    Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to infinity-wise products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

        \[\sum_{i=0}^\infty \frac{(x^Ty)^i}{i!} = \exp(x^Ty).\]

    We can conclude here that k(x,y) = \exp(x^Ty) is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

    But we are not done yet. Suppose that we decide to normalize the inputs before applying our linear model. That is, we want to convert each vector x to \frac{x}{|x|} before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

    Consider again the linear kernel k(x,y) = x^Ty. If we normalize the vectors before taking their inner product, we get

        \[\left(\frac{x}{|x|}\right)^T\left(\frac{y}{|y|}\right) = \frac{x^Ty}{|x||y|} = \frac{k(x,y)}{\sqrt{k(x,x)k(y,y)}}.\]

    With some reflection you will see that the latter expression would normalize the features for any kernel.

    Let us see what happens if we apply this kernel normalization to the “infinite polynomial” (i.e. exponential) kernel we just derived:

        \begin{align*} \frac{\exp(x^Ty)}{\sqrt{\exp(x^Tx)\exp(y^Ty)}} &= \frac{\exp(x^Ty)}{\exp(|x|^2/2)\exp(|y|^2/2)}  \\ &= \exp\left(-\frac{1}{2}(|x|^2+|y|^2-2x^Ty)\right) \\ &= \exp\left(-\frac{|x-y|^2}{2}\right). \end{align*}

    Voilà, the Gaussian kernel. Well, it still lacks \sigma^2 in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by 1/\sigma

    To conclude: the Gaussian kernel is a normalized polynomial kernel of infinite degree (where feature products if i-th degree are scaled down by \sqrt{i!} before normalization). Simple, right?

    An Example

    The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let x = 1, y = 2. The value of the Gaussian kernel k(x, y) for these inputs is:

        \[k(x, y) = \exp(-0.5|1-2|^2) \approx 0.6065306597...\]

    Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

        \[\phi'(x) = \left(1, x, \frac{x^2}{\sqrt{2!}}, \frac{x^3}{\sqrt{3!}}, \frac{x^4}{\sqrt{4!}}, \frac{x^5}{\sqrt{5!}}\dots\right).\]

    As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

    ----\phi'(1) = (1, 1, 0.707, 0.408, 0.204, 0.091, 0.037, 0.014, 0.005, 0.002, 0.001, 0.000, 0.000, ...)

    ----\phi'(2) = (1, 2, 2.828, 3.266, 3.266, 2.921, 2.385, 1.803, 1.275, 0.850, 0.538, 0.324, 0.187, 0.104, 0.055, 0.029, 0.014, 0.007, 0.003, 0.002, 0.001, ...)

    Let us limit ourselves to just these first 21 features. To obtain the final Gaussian kernel feature representations \phi we just need to normalize:

    ----\phi(1) =\frac{\phi'(1)}{|\phi'(1)|} = \frac{\phi'(1)}{1.649} = (0.607, 0.607, 0.429, 0.248, 0.124, 0.055, 0.023, 0.009, 0.003, 0.001, 0.000, ...)

    ----\phi(2) =\frac{\phi'(2)}{|\phi'(2)|} = \frac{\phi'(2)}{7.389} = (0.135, 0.271, 0.383, 0.442, 0.442, 0.395, 0.323, 0.244, 0.173, 0.115, 0.073, 0.044, 0.025, 0.014, 0.008, 0.004, 0.002, 0.001, 0.000, ...)

    Finally, we compute the simple dot product of these two vectors:

        \[\scriptstyle\phi(1)^T\phi(2) = 0.607\cdot 0.135 + 0.607\cdot 0.271 + \dots = {\bf 0.6065306}602....\]

    In boldface are the decimal digits, which match the value of \exp(-0.5|1-2|^2). The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

    A 2D Example

    The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

        \begin{align*} \scriptscriptstyle\phi'(&\scriptscriptstyle x_1, x_2) = \left(1, x_1, \frac{x_1x_1}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{2!}}, \frac{x_2x_1}{\sqrt{2!}}, \frac{x_2x_2}{\sqrt{2!}}, \right. \\ &\scriptscriptstyle \left.\frac{x_1x_1x_1}{\sqrt{3!}}, \frac{x_1x_1x_2}{\sqrt{3!}}, \frac{x_1x_2x_1}{\sqrt{3!}}, \frac{x_1x_2x_2}{\sqrt{3!}}, \frac{x_2x_1x_2}{\sqrt{3!}},  \frac{x_2x_2x_1}{\sqrt{3!}}, \dots\right).\\ \end{align*}

    This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

    Note that many products are repeated, however (e.g. x_1x_2 = x_2x_1), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

    Basic combinatorics will tell us, that each feature of the form \frac{x_1^a x_2^b}{\sqrt{n!}} must be repeated exactly \frac{n!}{a!b!} times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by \sqrt{\frac{n!}{a!b!}}. "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector (v, v, v), for example. Its norm is \sqrt{v^2 + v^2 + v^2} = \sqrt{3}|v|, exactly the same as the norm of the single-element vector (\sqrt{3}v).

    As we do this scaling, each feature gets converted to a nice symmetric form:

        \[\sqrt{\frac{n!}{a!b!}}\frac{x_1^a x_2^b}{\sqrt{n!}} = \frac{x_1^a x_2^b}{\sqrt{a!b!}} = \frac{x^a}{\sqrt{a!}}\frac{x^b}{\sqrt{b!}}.\]

    This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with 21^2 = 231 features instead of 2097151. Nice!

    Here is a new view of the unnormalized feature vector up to degree 3:

        \[\phi'_3(x_1, x_2) = \scriptstyle\left(1, x_1, x_2, \frac{x_1^2}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{1!1!}}, \frac{x^2}{\sqrt{2!}}, \frac{x_1^3}{\sqrt{3!}}, \frac{x_1^2x_2}{\sqrt{2!1!}}, \frac{x_1x_2^2}{\sqrt{1!2!}}, \frac{x_2^3}{\sqrt{3!}}\right).\]

    Let us limit ourselves to this degree-3 example and let x = (0.7, 0.2), y = (0.1, 0.4) (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

    ----\phi'_3(0.7, 0.2) = (1, 0.7, 0.2, 0.346, 0.140, 0.028, 0.140, 0.069, 0.020, 0.003),

    ----\phi'_3(0.1, 0.4) = (1, 0.1, 0.4, 0.007, 0.040, 0.113, 0.000, 0.003, 0.011, 0.026).

    After normalization:

    ----\phi_3(0.7, 0.2) = (0.768, 0.538, 0.154, 0.266, 0.108, 0.022, 0.108, 0.053, 0.015, 0.003),

    ----\phi_3(0.1, 0.4) = (0.919, 0.092, 0.367, 0.006, 0.037, 0.104, 0.000, 0.003, 0.010, 0.024).

    The dot product of these vectors is 0.8196\dots, what about the exact Gaussian kernel value?

        \[\exp(-0.5|x-y|^2) = 0.{\bf 81}87\dots.\]

    Close enough. Of course, this could be done for any number of dimensions, so let us conclude the post with the new observation we made:

    The features of the unnormalized d-dimensional Gaussian kernel are:

        \[\phi({\bf x})_{\bf a} = \prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},\]

    where {\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d.

    The Gaussian kernel is just the normalized version of that, and we know that the norm to divide by is \sqrt{\exp({\bf x}^T{\bf x})}. Thus we may also state the following:

    The features of the d-dimensional Gaussian kernel are:

        \[\phi({\bf x})_{\bf a} = \exp(-0.5|{\bf x}|^2)\prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},\]

    where {\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d.

    That's it, now you have seen the soul of the Gaussian kernel.

    Tags: , , , , , ,

  • Posted by Konstantin 07.06.2012 33 Comments

    The following text will only make sense to you if you know the technical details of Support Vector Machine learning.

    Having recently had to give some lectures (1,2,3,4) and exercise sessions (1,2,3,4) on linear classification in a machine learning course, I discovered that one of the most famous linear classification methods, the classical Support Vector Machine, does not seem to be fully specified in any of the prominent reference materials. Namely, the SVM learning procedure produces two parameters: the coefficient vector \bf\alpha and the bias term b. While the problem of finding \bf\alpha is typically explained in sufficient detail, the procedure of computing b is usually swept under the carpet.

    Here is, for example, what we see in the Vapnik's seminal work, the birthplace of SVMs (Section 10.2, page 411):

    Vapnik (page 411)According to my calculations this is plain incorrect, because the corresponding Kuhn-Tucker conditions (in the "soft margin" SVM that we are considering here) are actually:

        \[\alpha_t^0\left(\frac{1}{\gamma}\sum_{i=1}^\ell \alpha_i^0y_i(x_t * x_i) + b-1+\xi_i\right) = 0\]

    The important difference is the presence of the \xi_i term, which is unknown, hence the equation is not useful for finding b. Later on, the reader is also presented with the following summarizing description of the algorithm:

    Vapnik (page 425)Do you see any mention of the way to compute b? I would expect the author to be somewhat more helpful with respect to this detail, which is obviously important for anyone willing to implement their own version of SVM.

    OK, that was an old book, let's take a look at some newer materials. The two books which could be regarded as authoritative sources on the subject of SVMs are probably "An Introduction to Support Vector Machines" and "Kernel Methods for Pattern Analysis". Here is what the first book tells us about computing b (Proposition 6.11, page 106):

    Cristianini (page 106)

    This suggests that in order to compute b we need to find \alpha_i^*, such that C>\alpha_i^*>0, i.e. there must exist at least one training point lying exactly on the margin. Although in most practical situations such support vectors will indeed exist, it is also theoretically possible that there won't be any, i.e. not a single support vector will lie exactly on the margin. Thus, for purposes of implementing SVMs, the provided specification is incomplete.

    The same problem is present in the second book:

    Shawe-Taylor

    Take a careful look at lines 5-6, which claim that in order to compute b we need to choose i,j such that the corresponding \alpha_i, \alpha_j are strictly between 0 and C. This is not necessarily true for any \alpha_i.

    So then, what is the proper, fully general way of computing b? Of course, it is not too hard to derive, and thus makes for a great course homework exercise (e.g. Exercise 4 here). If you managed to read up to this point, I presume you should be motivated enough to try solving it. If you give up, you are welcome to consult my sample solution (Exercise 4, page 7).

    Tags: , , ,