• ## The SVM Bias Term Conspiracy

Posted by Konstantin 07.06.2012

The following text will only make sense to you if you know the technical details of Support Vector Machine learning.

Having recently had to give some lectures (1,2,3,4) and exercise sessions (1,2,3,4) on linear classification in a machine learning course, I discovered that one of the most famous linear classification methods, the classical Support Vector Machine, does not seem to be fully specified in any of the prominent reference materials. Namely, the SVM learning procedure produces two parameters: the coefficient vector $\bf\alpha$ and the bias term $b$. While the problem of finding $\bf\alpha$ is typically explained in sufficient detail, the procedure of computing $b$ is usually swept under the carpet.

Here is, for example, what we see in the Vapnik's seminal work, the birthplace of SVMs (Section 10.2, page 411):

According to my calculations this is plain incorrect, because the corresponding Kuhn-Tucker conditions (in the "soft margin" SVM that we are considering here) are actually:

$\alpha_t^0\left(\frac{1}{\gamma}\sum_{i=1}^\ell \alpha_i^0y_i(x_t * x_i) + b-1+\xi_i\right) = 0$

The important difference is the presence of the $\xi_i$ term, which is unknown, hence the equation is not useful for finding $b$. Later on, the reader is also presented with the following summarizing description of the algorithm:

Do you see any mention of the way to compute $b$? I would expect the author to be somewhat more helpful with respect to this detail, which is obviously important for anyone willing to implement their own version of SVM.

OK, that was an old book, let's take a look at some newer materials. The two books which could be regarded as authoritative sources on the subject of SVMs are probably "An Introduction to Support Vector Machines" and "Kernel Methods for Pattern Analysis". Here is what the first book tells us about computing $b$ (Proposition 6.11, page 106):

This suggests that in order to compute $b$ we need to find $\alpha_i^*$, such that $C>\alpha_i^*>0$, i.e. there must exist at least one training point lying exactly on the margin. Although in most practical situations such support vectors will indeed exist, it is also theoretically possible that there won't be any, i.e. not a single support vector will lie exactly on the margin. Thus, for purposes of implementing SVMs, the provided specification is incomplete.

The same problem is present in the second book:

Take a careful look at lines 5-6, which claim that in order to compute $b$ we need to choose $i,j$ such that the corresponding $\alpha_i, \alpha_j$ are strictly between 0 and $C$. This is not necessarily true for any $\alpha_i$.

So then, what is the proper, fully general way of computing $b$? Of course, it is not too hard to derive, and thus makes for a great course homework exercise (e.g. Exercise 4 here). If you managed to read up to this point, I presume you should be motivated enough to try solving it. If you give up, you are welcome to consult my sample solution (Exercise 4, page 7).

Posted by Konstantin @ 8:49 pm

1. M0nZDeRR on 05.12.2012 at 22:21 (Reply)

I usually prefer to get rid of the bias term by considering previously centered data instead. For the case of the Tikhonov regularization (squared loss function) it is straightforward to recover an optimal bias term from the resulting expansion parameters (alphas, in your case) and original means ( mu(X) and my(Y) ). I’m not sure about the SVM case (same penalty term but hinge loss, not squared) but it’s potentially possible to do the same trick.

P.S: Of course I mean centering the data in feature space, i.e., centering a kernel, not in the input domain.

1. Konstantin on 06.12.2012 at 14:48 (Reply)

Yes, for L2-loss with L2-regularization, centering is equivalent to removing the bias term. For SVM’s hinge loss, though, it is not the case.

Consider a simple dataset of 1-dimensional points, with two negative instances (1, 2) and two positive instances (3, 1002). The mean of those data points is 252, which is obviously not a correct bias (assuming the classifier f(x) = sign(x – b)).

The correct SVM bias would be any number between 2 and 3.

It is probably true, though, that for “good” data without bad outliers, using the mean for the SVM bias (e.g. centering the kernel) might often be a decently valid thing to do.

1. M0nZDeRR on 06.12.2012 at 18:09 (Reply)

Yep. I agree. For purely L2 case one can recover an optimal bias
with

b = mean(Y)- mean(X)*alpha,

which is not the case for hinge or eps.-insenstitve losses.

However, I’m still curious about existence of a certain statistics fx() and fy() such that one can omit the bias by doing the following:

alpha=solveSVM(X-fx(X),Y-fy(Y),[C gamma ..])

b = fy(Y)- fx(X)*alpha;

Probably fx() and fy() may depend on C (and epsilon).

1. Konstantin on 06.12.2012 at 20:17 (Reply)

You can obviously find a proper centering of X, that will let you fit alphas correctly using a “b-free” procedure. Also, there is a way to compute b from a vector of (Y – alpha*X). However, this computation is not of the form (fy(Y) – fx(X)*alpha). Instead, it is a combination of taking min’s and max’s to find a range of suitable values for b.

Note that, contrarily to the L2 case, the solution for b is not unique, hence the formula of the form b := fy(Y) – alpha*fx(X) wouldn’t be sufficient.

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

June 2012
M T W T F S S
« May   Sep »
123
45678910
11121314151617
18192021222324
252627282930