Please report your SVM’s kernel!

I’m tired of reading papers that use an SVM but don’t say which kernel they used.  (There’s tons of such papers in NLP and, I think, other areas that do applied machine learning.)  I suspect a lot of these papers are actually using a linear kernel.

An un-kernelized, linear SVM is nearly the same as logistic regression — every feature independently increases or decreases the classifier’s output prediction.  But a quadratic kernelized SVM is much more like boosted depth-2 decision trees.  It can do automatic combinations of pairs of features — a potentially very different thing, since you can start throwing in features that don’t do anything on their own but might have useful interactions with others.  (And of course, more complicated kernels do progressively more complicated and non-linear things.)

I have heard people say they download an SVM package, try a bunch of different kernels, and find the linear kernel is the best. In such cases they could have just used a logistic regression.  (Which is way faster and simpler to train!  You can implement SGD for it in a few lines of code!)

A linear SVM sometimes has a tiny bit better accuracy than logistic regression, because hinge loss is a tiny bit more like error rate than is log-loss.  But I really doubt this would matter in any real-world application, where much bigger issues are happening (like data cleanliness, feature engineering, etc.)

If a linear classifier is doing better than non-linear ones, that’s saying something pretty important about your problem.  Saying that you’re using an SVM is missing the point.  An SVM is interesting only when it’s kernelized.  Otherwise it’s just a needlessly complicated variant of logistic regression.

This entry was posted in Uncategorized. Bookmark the permalink.

13 Responses to Please report your SVM’s kernel!

  1. Amen to that, brother. I’ve only written those same lines about a dozen times in reviews. I would add a plea to discuss regularization, too.

    The gradient for regularized SVM is just as easy as for regularized logistic regression, so it’s just as easy to implement with stochastic gradient descent.

    I think log loss makes sense on its own. If you want to estimate the probability of outcomes and heavily penalize really really low estimates for things that actually happen.

    If you really want to, you can also kernelize logistic regression and solve the dual problem. It’s not done much.

    Instead, you often see people add interaction features. That’s what our project at Columbia’s about. Then we give the interaction features their own priors to pool their estimates. It’s coherent in a multilevel regression setting. I don’t know how you’d do this same kind of multilevel modeling with kernels.

    Another advantage to logistic regression is that it plays nicely with other probabilistic modeling components. That is, it’s easy to drop a regression into a graphical model to account for covariates. For instance, Mimno et al. did this to model the effect of document-level covariates (like source of the document) on the topic distribution.

  2. Pingback: Tweets that mention Please report your SVM’s kernel! | AI and Social Science – Brendan O'Connor -- Topsy.com

  3. anagi says:

    Hello there, ….

    While i do agree with u in general regarding providing the details when using SVM, like the kernel and regularization method, i am not sure about your statement that “An un-kernelized, linear SVM is nearly the same as logistic regression — every feature independently increases or decreases the classifier’s output prediction”. The latter is true for any linear model, so why do specifically related it to logistic regression?

    As far as i know, one difference between SVM and logistic regression, is the function which gets minimized.

  4. I mentioned logistic regression because it’s the usual (generalized) linear model used for classification. Usually SVM’s are used for classification. As you said, the only difference between a linear SVM and logistic regression is the loss function that gets minimized, and they’re quite similar loss functions.

  5. anagi says:

    Hi Brendan

    Another difference, slightly tied to the minimization of the loss function, is the starting point of the problem… In the generalized linear model, you assume that the error terms has a distribution that comes from the family of exponential distribution (i.e. normal, poisson, binomial,…) where in SVM you don’t make any assumption about the “error” term, and hence the optimization function can be different (depending on the type of SVM, one utilize)… and so while the both, generalized linear model, and SVM (with linear kernel) might share the linearity property, there is a fundamentally different perspective in the way you approach a problem: GLM is a probabilistic approach, while SVM is a data driven approach..

    I might be a trivial distinction, but it helps me differentiate the applicability of both method… :)

  6. @anagi — i still don’t think that distinction could help you decide which to apply. linear SVM’s and logistic regression will give you basically identical same results.

    the difference does slightly change how you reason about them, and more importantly, how you think of their related families of models/algorithms.

  7. Shay says:

    [sawa link to your post on Buzz!]

    You have to keep in mind that from a point of view of an end user, support vector machines may be much easier to use, because they have off-the-shelf implementations, while, at least to my impression, good implementations of logistic regression are less common.

    I don’t think that there is a reason to use logistic reason over SVM, just like there is perhaps no reason to use SVM over logistic regression, in most cases. It just depends on how easy you can get things done, if you are just interested in a specific application.

    And, yes, it is very likely that people use a simple linear kernel when they do not report which kernel they are using. That’s originally how SVMs were developed. The “kernel trick” came out later, I believe, in another paper. (You also have to be careful with wording here… SVMs are *always* linear, even when using a kernel, just in a different feature space. The non-linearity is with respect to the original feature space of the problem… but I am sure you know that.)

    And by the way… I believe there is also a kernelized version of logistic regression :-)

  8. anagi says:

    @ Brendan: agreed :)

  9. brendano says:

    @Shay, nicely done. Good point.

    @anagi — maybe the useful conceptual difference is: geometric vs. statistical views of classification.

  10. @Shay:

    1. There are plenty of solid logistic regression implementations. Start with something like R, which has it built in for small problems, or use any of the larger scale L-BFGS or stochastic gradient versions that are out there. Both SVMs and logistic regression are almost trivial to implement with SGD. If you’re an NLP person, you may need to look for “max entropy” — the field got off on a confusing footing terminologically.

    You can also use just about any neural networks package. If you use a logistic sigmoid activation function with a single output neuron, you have standard logistic regression (this is how it’s presented in MacKay’s text, for instance). Backpropagation is just stochastic gradient descent. The so-called “deep belief nets” are liked stacked logistic regressions.

    2. The primary reason to use logistic regression is that it gives you probabilities out. The error function is set up so that you’re effectively minimizing probabilistic predictive loss. I said this in my original comment above.

    A second reason to use it is that it has a natural generalization to K-way classifiers. Last I saw, SVMs tended to estimate a bunch of binary problems and then combine classifiers, which seems rather clunky.

    Another reason probabilities are convenient is that it means a logistic regression can play nicely with other models and in other model components. For instance, Lafferty and Blei used a logistic normal distribution as a prior on document topics in an extension of LDA in order to deal with correlated topics. You see the logistic transform all over in stats, including in many latent variable models such as the item-response model or the Bradley-Terry model for ranking contests or comparisons.

    3. The error function is complex in logistic regression. On the latent scale (i.e., as in Albert and Chib’s formulation of logistic regression), the error is normal. But on the actual prediction scale, which is in the [0,1] range, this error gets transformed by the inverse logit.

    You can change the sigmoid curve (i.e., with the cumulative normal, you get probit regression instead of logit). You can also change the error term in the latent regression.

    The error’s not estimated as a free parameter in logistic regression. You fit the deterministic part of the regression to minimize error, just like in SVD, then the actual error’s estimated from the residuals (difference between predictions and true values).

    In SVMs, the error is the simpler hinge loss function (actually, it’s not differentiable, which is more complex, but at least it’s linear).

    t’s pretty easy to extend logistic regression to structured prediction problems like taggers or parsers. I’d imagine you could do that with SVMs, too.

    A. Yes, you can kernelize logistic regression, too. It’s less common, though. The problem with operating in kernelized dual space is that it doesn’t scale well with number of training instances. So there’s been work on pruning the support in the context of SVMs.

    PS: We might as well add perceptrons to the list here. They were invented after logistic regression but before SVMs and have an even simpler error function. And maybe voted perceptrons if you really want to tie things back to boosting.

  11. brendano says:

    @Bob — preaching to the choir on logistic normals :)

    http://www.cs.cmu.edu/~scohen/jmlr10pgcovariance.pdf

    also, jacob and i used logistic normal priors for a topic model but it was perhaps less central than shay’s or blei/lafferty http://brenocon.com/eisenstein_oconnor_smith_xing.emnlp2010.geographic_lexical_variation.pdf

  12. PadmaSree says:

    I am new to SVM .. i need the purpose of kernel , amd wat r the different types of kerenls can v use for tht ..
    can u plz help me …….

  13. PadmaSree says:

    if u have materials can u plz send me:)