Statistics vs. Machine Learning, fight!
So it’s pretty clear by now that statistics and machine learning aren’t very different fields. I was recently pointed to a very amusing comparison by the excellent statistician — and machine learning expert — Robert Tibshiriani. Reproduced here:
| Glossary | |
| Machine learning | Statistics |
|---|---|
| network, graphs | model |
| weights | parameters |
| learning | fitting |
| generalization | test set performance |
| supervised learning | regression/classification |
| unsupervised learning | density estimation, clustering |
| large grant = $1,000,000 | large grant = $50,000 |
| nice place to have a meeting: Snowbird, Utah, French Alps | nice place to have a meeting: Las Vegas in August |
Hah. Or rather, ouch! I had two thoughts reading this. (1) Poor statisticians. Machine learners invent annoying new terms, sound cooler, and have all the fun. (2) What’s wrong with statistics? They have way less funding and influence than it seems they might deserve.
There are several issues going on here, both substantive and cultural:
There might be too much re-making-up of terms on the ML side. But lots of these are useful. “Weights” is a great, intuitive term for the parameters of a linear model. I use it all the time to explain classifiers and regressions to non-experts. I was surprised to see “test set” on the statistics side; I’m used to thinking of held-out test set accuracy as an extremely common ML technique, while in statistics model fit is assessed with parametric assumptions for standard errors and such. I really like cross-validation and bootstrapping as ways of thinking about generalization — again, something that’s far easier to grasp than sampling and hypothesis testing approaches to parameter inference — which keep getting taught to and misunderstood by generations of confused Introduction to Statistics students. For example, how many times has been explained that: No, a p-value is NOT the probability your model is wrong. But scientific papers regularly treat significance levels in that manner (look how many stars are on this result!) On the other hand, cross-validation accuracy *is* something you can interpret as being related to the probability your model is right.
I’ll also note that there are definitely a number of topics in ML that aren’t very related to statistics or probability. Max-margin methods: if all we care about is prediction, why bother using a probability model at all? Why not just optimize the spatial geometry instead? SVM’s don’t require a lick of probability theory to understand. (Of course probability-based approaches are huge in ML, but it’s important to remember they’re not the only game in town, and there is no necessary reason they must be.) And then there are non-traditional settings such as online learning, reinforcement learning, and active learning, where the structure of access to information is in play. There are certainly plenty of things in statistics that aren’t considered part of ML — say, regression diagnostics and significance testing. Finally, many ML problems involve large, high dimensional data and models, where computational issues are very important. For example, in statistical machine translation, alignment models are described with probability theory and fit to data, but their structure is complex enough that optimal inference is intractable, and how you do approximate inference (EM, Viterbi, beam search, etc.) is a very major issue.
But the most interesting differences between stats and ML are institutional.
I’ve been hearing lots of friends compare two dueling courses at Stanford: CS229, the CS department’s “machine learning” course taught by Andrew Ng; and Stat 315 A/B, the Statistics department’s “statistical learning” sequence taught by some combination of Tibshirani, Jerome Friedman, and Trevor Hastie. These people are all top-of-the-line researchers in the field. Their courses’ contents are extremely similar; I’d bet any of them could teach most of the material from the other side.
What differs most is the teaching style. CS has far better lecture notes. Of course, the stats people wrote a very good book; but better lecture notes win because I can access them later and send them to people for free. CS students I’ve talked to think the CS course is better taught; I can’t find stats students who take the CS course. (My sample is biased, though I know people in both.) Finally, the CS course has a big, open-ended project component; the Stats course follows more of a traditional problem set and tests format.
I think this is reflective of the differences in institutional culture between CS and Stats. There’s an interesting John Langford post on part of the issue, which he calls “The Stats Handicap”. He points out that stats Ph.D.’s have a big disadvantage in the job market because statistics has an old-school journal-oriented publishing culture, so students publish much less and have less experience engaging with a research community. CS is conference-oriented — certain conferences have a higher prestige than many journals (e.g. NIPS in ML, CHI in HCI) — and this results in faster turnaround, dissemination, and collaboration. (I’ve heard others make similar comparisons between CS and psychology.) I’d expect any discipline with a larger conference emphasis to have better courses since they should reward presentation/teaching skills — or at least encourage practice — more than in journal world.
ML sounds like it’s young, vibrant, interesting to learn, and growing; Stats does not.
Is marketing a problem? Machine learning terms definitely sound pretty cool. Maybe the perspective of computational intelligence lends itself to cool names. Though the Stanford statisticians certainly know how to play this game — for example, they made up their own names for variants of L1 and L2-regularized regression, leaving annoyed people like me forever googling “lasso” and “ridge” trying to remember which is which. (On the other hand, perhaps that’s child’s play compared to the true original sin of ML nomenclature: tossing around the highly deceptive term “neural network” for a stack of linear functions paired with a wonky, overhyped training algorithm; the combination of which, many years later, still causes confusion. Definitely blame CS for that one.)
Another issue is the definition of statistics itself. In 1997, Jerome Friedman wrote an extremely interesting analysis of the situation: “Data Mining and Statistics: What’s the Connection?”. He points out, quite correctly, the statistical impoverishment of some common approaches to data mining. You can certainly blame statistics for not marketing its ideas well enough, or blame CS for ignoring statistics. For example there’s a good case that lots of genetic algorithms and neural network research was much ado about nothing — that is, over-complicated cool-sounding hammers looking for nails when all you needed were some time-honored statistical and optimization techniques. (E.g. why NN when you haven’t tried a straight-up GLM? Why GA when you haven’t tried Nelder-Mead?) But this problem has been rectified somewhat — for example, NLP has seen a big move to simple linear models as the default technique, and NN’s and GA’s have fallen from grace in mainstream ML.
Friedman argues part of the problem is in how statisticians approach problems and the world:
One can catalog a long history of Statistics (as a field) ignoring useful methodology developed in other data related fields. Here are some of them that had seminal beginnings in Statistics but for the most part were subsequently ignored in our field: Pattern Recognition, Neural Networks, Machine Learning, Graphical Models, Chemometrics, Data Visualization.
That is not to say statistics is not important — it’s incredibly important. He quotes Efron as saying “Statistics has been the most successful information science.” However, information science is becoming bigger and broader and more exciting, thanks to computation and ever-increasing amounts of data. What should statisticians do? Friedman continues (light editing and emphasis is mine):
One view says that our field should concentrate on that small part of information science that we do best, namely probabilistic inference based on mathematics. If this view is adopted, we should become resigned to the fact that the role of Statistics as a player in the “information revolution” will steadily diminish over time.
Another point of view holds that statistics ought to be concerned with data analysis. The field should be defined in terms of a set of problems — rather than a set of tools — that pertain to data. Should this point of view ever become the dominant one, a big change would be required in our practice and academic programs.
First and foremost, we would have to make peace with computing. It’s here to stay; that’s where the data is. This has been one of the most glaring omissions in the set of tools that have so far defined Statistics. Had we incorporated computing methodology from its inception as a fundamental statistical tool (as opposed to simply a convenient way to apply our existing tools) many of the other data related fields would not have needed to exist. They would have been part of our field.
Friedman wrote this article more than 10 years ago. All his observations about the importance and increasing prevalence of data and computing power are even more true today than back then. Has the field of statistics changed? Not clear. (I’d appreciate seeing evidence to the contrary.)
On the other hand a world of data *has* to be increasingly statistical. The positive spin from Efron:
A new generation of scientific devices, typified by microarrays, produce data on a gargantuan scale – with millions of data points and thousands of parameters to consider at the same time. These experiments are “deeply statistical”. Common sense, and even good scientific intuition, won’t do the job by themselves. Careful statistical reasoning is the only way to see through the haze of randomness to the structure underneath. Massive data collection, in astronomy, psychology, biology, medicine, and commerce, is a fact of 21st Century science, and a good reason to buy statistics futures if they are ever offered on the NASDAQ.
I know that I’m interested in quantitative information science, including statistics and data analysis. Machine learning has many strengths, but it is definitely an odd way to go about analysis. But there’s a good case that statistics, as traditionally defined, is only going to have a smaller role in the future. “Data mining” sounds more relevant, but does it even exist as a coherent subject? Maybe it’s time to study a more applied statistical field like econometrics.
3. December 2008 at 2:30 pm :
[...] O’Connor has a thoughtful comparison of machine learning and statistics this morning. ? [...]
3. December 2008 at 5:24 pm :
[...] puts machine learning and statistics in a jar and shakes the jar December 3, 2008 Brendan O’Connor puts machine learning and statistics in a jar and shakes the jar: ML sounds like it’s young, vibrant, interesting to learn, and growing; Stats does [...]
3. December 2008 at 10:19 pm :
Somewhat related to this:
http://tinyurl.com/breiman2001
Leo Breiman, Statistical Modeling: The Two Cultures.
From the abstract
“The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets.”
4. December 2008 at 1:58 am :
Very interesting, very related paper; thanks.
4. December 2008 at 4:23 am :
To paraphrase Feynman, something (some problem, some theorem, some assessment, some analysis, some method) isn’t new just because it has been given a new name.
4. December 2008 at 6:34 am :
Cool post. It seems to be a general phenomenon what happens when applied sciences (psychology, biology, economics) adopt methods of the [relatively more] pure sciences (physics, statistics, mathematics), or when newer disciplines (synthetic biology) venture into space occupied by an older one (electrical engineering).
In the long run, things do get cleaned up. But it may take a younger generation of thinkers, with a stronger grasp of the pure sciences (think of the mathematicians who’ve ventured into economics, or the physicists who’ve gotten into synthetic biology) to sort out the mess of nomenclature and integrate the fields.
4. December 2008 at 8:51 am :
[...] O’Connor’ın blog’undaki “Statistics vs. Machine Learning, fight! (also: ML nomenclature’s original sin)” girdisi iki önemli bilgi işlem alanındaki garip rekabete epey eğlenceli bir bakış [...]
4. December 2008 at 6:08 pm :
Tibshirani’s graph should really have included two additional factors: (1) average number of courses taught/year, and (2) median student/post-doc/faculty stipends and salaries. I think it’s part of the explanation of grant size, since most CS folks I knew at CMU simply bought themselves out of teaching by having grants pay their salaries. This makes CS departments much more highly leveraged (grant money needed per person to sustain the operation; at CMU our budget from the university didn’t even cover tenured faculty salaries, much less T.A.s). Even so, a $1M grant in CS is going to support a lot more people than a $50K grant in stats. You see an even stronger form of this effect in medical research, which has huge salaries and huge grants with armies of cheap post-docs.
Once you put average courses taught/year and see the statisticians with 3 or 4 and the machine learning people with 1 or maybe 2
Maybe we just need larger grants so we can go to the Alps and pay expensive graduate students and get away without having to actually teach for a living. I think the big tell would be
4. December 2008 at 6:38 pm :
Blame the physicists for the term “max entropy”, which is just plain old logistic regression (as are one-layer neural networks with sigmoid activations or softmax). But the non-Bayesian statisticians get the blame for “regularization”, and “lasso”/”ridge”; they’re just priors to the Bayesians.
Don’t diss back-prop! It’s having a renaissance in stochastic gradient methods all over machine learning.
Folks in machine learning are discovering Bayesian methods of dealing with uncertainty, whereas the Bayesian statisticians been using graphical models in custom and general-purpose systems like BUGS for decades.
Dan Jurafsky and I were just discussing ML vs. stats, because we’ve both been doing more social science type stats. We were both surprised, like Brandon, that the statisticians don’t use cross-validation. I speculated it’s largely because the statistical paradigm of evaluating fit leads them to build models focused on analyzing existing data sets rather than to doing forward-looking predictions, but there are lots of counterexamples, such as FiveThirtyEight, which predicted the 2008 U.S. presidential election very accurately using Bayesian methods over polls.
The other issue Dan and I discussed is that statisticians care deeply about their coefficients (weights, parameters, whatever), whereas machine learning folks tend to toss them all into a bin and let priors and cross-validation sort them out. Sure, we might look at the feature weights to make sure the algorithm’s doing something sensible, but we don’t write papers where the point is to explore the effect of the word “the” (a feature) on estimation (there actually should be more of these papers in ML, in my opinion). For instance, statisticians very much want to explore the effect of a person’s weight on their chance of diabetes and aren’t going to be very happy giving a doctor an SVM and saying “trust it, it worked well on cross-validation”. And they want to examine the role of income or church attendance on voting. The goal is to explore the parameters (”effects”) as much as predict which way a state’s going to vote in the next election.
Finally, let me point out that the main systems used for microarrays in practice are simple linear factor models that any statistician would recognize, like dChip. What’s the justification for Efron’s comment that “Careful statistical reasoning is the only way to see through the haze of randomness to the structure underneath.”? Does statistics imply probability? If not, what about SVMs, as Brandon asks? Maybe all we need is room-sized 3D visualization coupled to human brain power.
4. December 2008 at 10:03 pm :
On descriptive statistics by attention to coefficients — lots of social science empirical work involves small, limited situations where they’re trying to find out if certain effects are in play; extrapolation to other situations is usually done with reasoning by analogy. If your reasoning and decision making in future situations is going to be qualitative, a trained-up SVM from a different situation isn’t useful; but knowing the top 3 coefficients from a linear model there *is* useful qualitative information.
I think this is the point of that bit about the jeff hammerbacher talk we were discussing at http://anyall.org/blog/2008/07/the-macgyver-of-data-analysis/ — he’s assuming the domain of analyzing web behavior logs and figuring out how to make a website better. you could worry about automated decision making for what content to show people (ranking, recommendations etc.); but probably the most productive thing to do is extract qualitative insights from the data to inform the design process. this is a pretty social science-y domain; t-tests and linear regressions are going to be the tools of choice.
7. December 2008 at 9:39 am :
Another response, from Andrew Gelman — on a rather pro-CS note: http://www.stat.columbia.edu/~cook/movabletype/archives/2008/12/machine-learnin.html
I wish a statistician would come here aggressively defend their discipline. At the very least — what about experimental design? Or tricky low-evidence situations: don’t you want a statistician, not an MLer, to testify at a trial about the whether an event was a coincidence?
8. December 2008 at 12:50 am :
Have you ever studied game theory?
8. December 2008 at 5:42 am :
yes, why?
9. December 2008 at 11:07 pm :
[...] Statistics vs. Machine Learning, fight! - Brendan O’Connor’s Blog [...]
11. December 2008 at 3:33 am :
“…the highly deceptive term “neural network” for a stack of linear functions paired with a wonky, overhyped training algorithm; …”
The term “neural network” covers a broad range of techniques, but I don’t think the above description accurately describes any of them. For one thing, any “stack of linear functions” reduces algebraically to a single linear function. I imagine that you are referring to a multi-layer perceptron, but that is built of a stack of non-linear functions.
-Will Dwinnell
Data Mining in MATLAB
11. December 2008 at 3:44 am :
I did mean a multilayer perceptron. Individual units are “linear” in the generalized linear models sense — the response is a function of a linear combination of the inputs. (The same way a logistic regression is a linear model; you stack a them up to get a multilayer NN.) Trained with backpropagation this is theoretically very powerful, but unfortunately is tricky to use in practice. Thus “overhyped.”
Sorry for any confusion. And nice blog, by the way.
11. December 2008 at 5:34 pm :
@brendano
I was just asking because of your comment about econometrics. I just discovered game theory a couple of months ago and have been reading some books about applications to business strategy, like Nalebuff/Brandeberger’s “Co-opetition.” Great stuff; seems applicable to the things you’re interested in.
I met a CalTech PhD on a flight back from San Jose a couple of weeks ago who just finished his degree in theoretical computer science (emphasis on game theory). He interviewed at YHOO and they were looking for algorithms that identify Nash equilibria in huge data sets of user interactions.
fwiw
;-)
16. December 2008 at 1:32 am :
On the conference vs journal oriented cultures, one frustration I have with conference oriented cultures is that they still feel too slow and competitive for getting feedback on work but so fast that they encourage the publication of a lot of low hanging fruit work so you can be up on a podium every year (or multiple times a year, depending on how many prestigious conferences are in your field).
How do people decide what is a valid contribution for presentation at NIPS? I can’t seem to make sense of it at CHI, except for a particularly traditional sort of “build system - evaluate on 10 research lab mates - summarize results” that ends up not creating very powerful or synthetic new knowledge.
30. December 2008 at 9:51 pm :
Hey Lilly, I don’t really know what the NIPS criteria are. They do both theory and applied papers. I do know that John Langford has a bunch of interesting things to say about it and in general about review criteria.
http://hunch.net/?p=499
http://hunch.net/?p=191
http://hunch.net/?p=223
30. December 2008 at 9:52 pm :
Ah, he has a number of interesting posts here: http://hunch.net/?cat=33
8. January 2009 at 11:51 pm :
hi
afgbiiiq50qhstbe
good luck
23. February 2009 at 8:19 pm :
[...] Statistics vs. Machine Learning, fight! [...]
28. February 2009 at 10:36 am :
[...] ένα ανάγνωσμα που μάλλον θα βρείτε απολαυστικό: Στατιστική εναντίον Μηχανικής Μάθησης, με τα κύρια χαρακτηριστικά και τις διαφορές αυτών σε [...]
4. March 2009 at 8:52 am :
So, the “stack of linear functions” I took issue with too. A neural network with any modeling power has the crucial ingredient of nonlinearities at each hidden layer. The way I think about a single layer neural network is as a logistic regression model operating on a set of features where the feature extraction is learned as well. As for “overhyped” = tricky to use, yes, that’s true. There are entire books written on effectively training the damned things. Unlike SVMs, you need to know a little something about both your domain problem and the model in order to get good results. There’s been a recent renaissance, as Bob points out, in neural networks research with the advent of methods for training deep networks (which is nearly impossible with gradient descent + backprop alone, unless you tie a lot of the weights together i.e. convolutional networks).
The basic dichotomy between statistics and machine learning that I see is in academic lineage. Yes, they’ve invented new names for lots of things, but that’s mostly because the machine learning community grew out of computer scientists, engineers, physicists (I am often taken aback at just how many physicists seem to pop up), and yes, theoretical neuroscientists back in the 1980s, with very little crosstalk with statistics. There’s a strong difference in problem focus, as you mention.
I also think your discussion centers somewhat unfairly on classification and regression. There’s plenty of interesting work being done in unsupervised learning of complex, generative models of data, both with prior knowledge built in and without. Pleasantly, the two communities have converged on graphical models as a common parlance for describing probabilistic models of data.
5. March 2009 at 3:23 am :
Just discovered this post. Fantastic!! It aligns with many of my thoughts, specially since I’m a biostatistician interested in high-dimensional problems where ML techniques seem to be “easier”. Still learning about ML methods, though. You’re right about the cross-validation bit, though. Statisticians aren’t necessarily trained in predictive modeling and their techniques, including CV, model averaging, bagging, … I’ve recently felt the need for learning these areas since they’re apropos of some problems I’m consulting on. There NEEDS to be more cross-fertilization of the two fields, since we keep re-inventing wheels.
13. March 2009 at 8:46 am :
A great blog and interesting discussion - I’m an engineer who’s spent the last decade in biotech and pharma. Whenever someone non-analytical asks what I do, I say “data mining” which is not far from the truth. However, more recently I’ve spent more time with statistical modeling and the associated community in biostatistics.
I definitely see the differences Brendan and Bob mention - the focus on understanding the factor parameters/effects has a lot to with the fact that the same analysts helped design the study which includes specifying which data to collect and contrasts to select. Many of the machine learning folks seem to be more contract mercernaries / collaborating scientists who came on post-study.
The Netflix contest seems to be an interesting context to compare the approaches and insights gained. The highest performing [most?] groups seem to be dominated by ML. My guess is that Netflix will favor the white-box modelers to hel them decide how to modify their actual suggestion engine, as opposed to wrap around the best performer’s algorithm.
My experience is in biological data, which is almost always underspecified and overdetermined, full of correlated variables (the correlation itself being informative), and answers used as a starting point of another study/experiment/analysis. A ML-informed statistical (or statistically-informed ML) modeling approach ends up being the most useful. I use R mostly, with a lot of other domain-specific tools, and am starting to use MATLAB more as well (deployment of visualizations).
Keep up the great articles…
1. May 2009 at 3:09 am :
Abhijit -
> Statisticians aren’t necessarily trained in predictive modeling
> and their techniques, including CV, model averaging, bagging…
I agree and have thought the same thing myself, but it’s still funny to read that sentence given that there are so many papers from stats journals on those topics. I think they were all, or most of them, invented by statisticians too. Maybe this is showing a difference between statistics proper versus applied stats in traditional biology and social science.
But sometimes the ideas are there just named differently. Economists know about the held-out accuracy method of evaluation (e.g. CV & friends). They call it “out of sample predictions”. Economists, of course, broke away hard from mainline stats a while ago, calling it “econometrics” and reinventing names for EVERYTHING, plus throwing in a bizarre obsession with the method of moments. In terms of intellectual arrogance and needless renaming/duplication, economists are much worse than computer scientists and engineers. Maybe as bad as physicists.