Updates: CMU, Facebook

It’s been a good year. Last fall I started a master’s program in the Language Technologies department at CMU SCS, taking some great classes, hanging out with a cool lab, and writing two new papers (for ICWSM, involving Twitter: polls and tweetmotif; also did some coref work, financial text regression stuff, and looked at social lexicography.) I also applied to CS and stats PhD programs at several universities. Next year I’ll be starting the PhD program in the Machine Learning Department here at CMU.

I’m excited! Just the other day I was looking at videos on my old hard drive and found a presentation by Tom Mitchell on “the Discipline of Machine Learning” that I downloaded back in 2007 or so. (Can’t find it online right now, but this is similar.) That might be where I heard of the department first. Maybe some day I will be smarter than the guy who wrote this rant (though I am much more pro-stats and anti-ML these days…).

Also, I was recently named a finalist for the 2010 Facebook Fellowship program. CMU did impressively well here, with the number of winners and finalists (5+22) per school being:

9 Carnegie Mellon University
3 University of Michigan
3 Stanford University
2 University of Washington
2 University of Illinois at Urbana-Champagne
2 University of California at Berkeley
1 University of California at Irvine
1 Princeton University
1 Massachusetts Institute of Technology
1 Duke University
1 Cornell University
1 Arizona State University

And separately, I’ll be an intern at the Facebook Data team this summer (back in Palo Alto) starting in June. (Yes, notwithstanding my severe annoyances at bugs in the site itself.) I’ve enjoyed reading the work they’ve been doing there in the last year or two, and it seems like a very good place, if not the best place, to try doing some newfangled computational social science (whatever that is).

quick note: cer et al 2010

Quick note, reading this paper from their tweet.

update this reaction might be totally wrong; in particular, the conll dependencies for at least some languages were done completely by hand.


Malt and MSTParser were designed for the Yamada and Matsumodo dependencies formalism (the one used for the CoNLL dependency parsing shared task, from the penn2malt tool). Their feature sets and probably many other design decisions were created to support that. If you compare their outputs side-by-side, you will see that the Stanford Dependencies are a substantially different formalism; for example, compound verbs are handled very differently (the paper talks about copula example).

I think the following conclusion is premature:

Notwithstanding the very large amount of research that has gone into dependency
parsing algorithms in the last five years, our central conclusion is that the quality of the Charniak, Charniak-Johnson reranking, and Berkeley parsers is so high that in the vast majority of cases, dependency parse consumers are better off using them, and then converting the output to typed dependencies.

Re-run the experiment with Yamada and Matsumodo (or maybe Johannson and Nugues / pennconverter) then that would be convincing. In the meantime, a fair test of dependency parsing approach for the Stanford formalism would be to use a dependency parser and design richer features to support it. The paper points outs cases where Malt’s and MSTparser’s algorithms aren’t powerful enough; I wonder how TurboParser’s would fare. But this requires lots of work (writing new features) so maybe no one will do it.

How Facebook privacy failed me

At some point, I put extra email addresses on Facebook because I thought it was necessary for something, but didn’t want to show them, so in the privacy settings marked their visibility as “Only Me.” It turns out that right now, Facebook is blatantly ignoring that privacy setting, and instead showing them to the world.

Here are my settings:

Here is a fragment of my profile, viewed from a friend’s account half an hour ago:

I would complain to their customer service, but I can’t find a link from their Help Center page.

Obviously, this particular issue is a very minor concern for me. But it hardly instills faith in the system — especially considering that privacy bugs are ones that the affected user, almost by definition, can’t see. I’ve also had other weird issues where changes to privacy settings don’t seem to stick when I save then later go back to the page. It’s annoying and hard to verify these things — which is why important “social utilities” like Facebook have to be 99.99999% bug-free for things like this in order to deserve user trust.

Update #1: only by talking to a Facebook employee, I learned that if you search for “bugs” on the Help Center page, you get to the bug report form, and they place a high priority on privacy bugs. I guess this makes me feel a bit better. I’d be interested in knowing the incidence rate of privacy bugs like this. How often does people’s information get revealed without them knowing it? Does the general public of Facebook users have any way of knowing? This seems like territory a third-party consumer or privacy advocacy organization should work on (e.g. Consumer Union or EFF or something.)

Update #2: The bug seems to have disappeared from my friend’s view of my profile. They fixed the bug fast; this was seen about 30 minutes after reporting the bug. Great!

But still, I’m not going to trust Facebook’s privacy settings again, unless there is a credible argument and independent verification that they actually implement what they promise. Since even a low error rate is still bad, verification is hard (rare event detection problem), so I’m not sure what would be a convincing demonstration that privacy settings actually work…

Update #3: Apparently this was a brief but massive and wide-reaching bug. Maureen reports.

List of probabilistic model mini-language toolkits

There are an increasing number of systems that attempt to allow the user to specify a probabilistic model in a high-level language — for example, declare a (Bayesian) generative model as a hierarchy of various distributions — then automatically run training and inference algorithms on a data set. Now, you could always learn a good math library, and implement every model from scratch, but the motivation for this approach is you’ll avoid doing lots of repetitive and error-prone programming. I’m not yet convinced that any of them completely achieve this goal, but it would be great if they succeeded and we could use high-level frameworks for everything.

Everyone seems to know about only a few of them, so here’s a meager attempt to list together a bunch that can be freely downloaded. There is one package that is far more mature and been around much longer than the rest, so let’s start with:

  • BUGS - Bayesian Inference under Gibbs Sampling. Specify a generative model, then it does inference with a Gibbs sampler, thus being able to handle a wide variety of different sorts of models. The classic version has extensive GUI diagnostics for convergence and the like. BUGS can also be used from R. (The model definition language itself is R-like but not actually R.)

    BUGS has had many users from a variety of fields. There are many books and zillions of courses and other resources showing how to use it to do Bayesian statistical data analysis. BUGS is supposed to be too slow once you get to thousands of parameters. The original implementation, WinBUGS, is written in Delphi, a variant of Pascal (!); its first release was in 1996. There are also two alternative open-source implementations (OpenBUGS, JAGS).

    This is clearly very mature and successful software. Any new attempts to make something new should be compared against BUGS.

Next are systems that are much newer, generally less than several years old. Their languages all fall broadly into the category of probabilistic graphical models, but there are plenty of differences and specializations and assumptions that are a project in itself to understand. In lieu of doing a real synthesis, I’ll just list them with brief explanations.

  • Factorie focuses on factor graphs and discriminative undirected models. Claims to scale to millions of parameters. Written in Scala. New as of 2009. Its introductory paper is interesting. From Andrew McCallum’s group at UMass Amhearst.

  • Infer.NET. I only just learned of it. New as of 2008. Focuses on message-passing inference. Written in C#. From MSR Cambridge. I actually can’t tell whether you get its source code in the download. All other systems here are clearly open source (except WinBUGS, but OpenBUGS is a real alternative).

  • Church. Very new (as of 2009?), without much written about it yet. Focuses on generative models. Seems small/limited compared to the first three. Written in Scheme. From MIT.

  • PMTK - Probabilistic Modeling Toolkit. I actually have no idea whether it does model specification-driven inference, but the author’s previous similar-looking toolkit (BNT) is fairly well-known, so it’s in this list. Written in Matlab. From Kevin Murphy.

  • HBC - Hierarchical Bayesian Compiler. Similar idea as BUGS, though see webpage for a clear statement of its somewhat different goals. It compiles the Gibbs sampler to C, so it’s much faster. Seems to be unmaintained. Written in Haskell. From Hal Daume.

Finally, there are a few systems that seem to be more specialized. I certainly haven’t listed all of them; see the Factorie paper for a list of a few others.

  • Alchemy - an implementation of the Markov Logic Network formalism, an undirected graphical model over log-linear-weighted first-order logic. So, unlike BUGS and the above systems, there are no customized probability distributions for anything; everything is a Boltzmann (log-linear) distribution. At least, that’s how I understood it from the original paper. The FOL is essentially a language the user uses to define log-linear features. Alchemy then runs training algorithms to fit the their weights to data.

    From Pedros Domingos’ group at UWashington. Written in C++. I’ve heard people complain that Alchemy is too slow. But in fairness, all these systems are slower than customized implementations.

  • Dyna is specialized for dynamic programming. The formalism is weighted Horn clauses (weighted Prolog). Implements agenda-based training/inference algorithms that generalize Baum-Welch, PCFG chart parsers, and the like. Written in C++, compiles to C++. Seems unmaintained. From Jason Eisner’s group at John Hopkins.

    Since it only does dynamic programs, Dyna usefully supports a much more limited set of models than the above systems. But I expect that means it can train and infer with models that the above would be hopeless to handle, since dynamic programming gives you big-O efficiency gains over more general algorithms. (But on the other hand, even dynamic programming can be too generic and slow compared to direct, customized implementations. That’s the danger of all these systems, of course.)

  • BLOG - first-order logic with probability, though a fairly different formalism than MLNs. Focuses on problems with unknown and unknown numbers of objects. I personally don’t understand the use case very well. Its name stands for “Bayesian logic,” which seems like an unfairly broad characterization given all the other work in this area. From Brian Milch. Seems unmaintained? Written in Java.

An interesting axis of variation of all these is whether the model specification language is Turing-complete or not, and to what extent training and inference can be combined with external code.

  • Turing-complete: Factorie, Infer.NET, Church, and Dyna are all Turing complete. The modeling languages of the first three are embedded in general procedural programming languages (Scala, C#, and Scheme respectively). Dyna is Turing complete in two different ways: it has a complete Prolog-ish engine, which is technically Turing complete but is gonna be a pain to do anything normal in (I simply mean, since Prolog is technically Turing-complete but a total pain to do anything non-Prolog-y in); but also, it compiles to C++.
  • Not Turing-complete: BUGS, HBC, Alchemy/MLN, and BLOG use specialized mini-languages. BUGS’ and HBC’s languages are essentially the same as standard probabilistic model notation, though BUGS is imperative. Alchemy and BLOG are logic variants.
  • Compiles to Turing-complete: HBC compiles to C, and Dyna compiles to C++, which are then intended to be hacked up and/or embedded in larger programs. I imagine this is a maintainability nightmare, but could be fine for one-off projects.

Another interesting variation is to what extent the systems handle probabilistic relations. BUGS and HBC don’t really try at all beyond plates; Alchemy, BLOG, and Factorie basically specialize in this; Dyna kind of does in a way; and the rest I can’t tell.

In summary, lots of interesting variation here. Given how many of these things are new and changing, this area will probably look much different in a few years.

Seeing how “art” and “pharmaceuticals” are linguistically similar in web text

Earlier this week I asked the question,

How are “art” and “pharmaceuticals” similar?

People sent me lots of submissions! Some are great, some are a bit of a stretch.

  • Overpriced by an order of magnitude.
  • The letters of “art” are found embedded, in order, in “pharmaceuticals”.
  • Search keywords that cost the most to advertise on?
  • “Wyeth”: I think this means this, and this.
  • “Romeo and Juliet” famously includes both “art” (wherefore art thou) and pharmaceuticals (poison!)
  • Some art has been created out of pharmaceuticals.
  • Some art has been created under the influence of pharmaceuticals.

I was asking because I was playing around with a dataset of 100,000 noun phrases’ appearances on the web, from the Reading the Web project at CMU. That is, for a noun like “art”, this data has a large list of phrases in which the word “art” is used, across some 200 million web pages. For two noun concepts, we can see what they have in common and what’s different by looking at examples of how people use them when writing. So, for “art” versus “pharmaceuticals”:

common contexts for “art” but not “pharmaceuticals” [7394 total] common contexts for both “art” and “pharmaceuticals” [165 total] common contexts for “pharmaceuticals” but not “art” [206 total]
‘m into _
’s interested in _
A collection of _
_ has been described by
structure of _
study in _
_ have been shown in
The knowledge of _
_ is a commodity
_ is a creation
_ is a world
an exhibition of _
the commercialization of _
the confinement of _
_ is cast in

areas such as _
prices of _
storage of _
producers of _
_ designed for
the provision of _
_ sold in
the same way as _
_ are among
The production of _
the analysis of _
advances in _
specialising in _
a career in _
_ stolen from

a greater amount of _
standards for _
marketer of _
market for _
prescriptions for _
the supply of _
the availability of _
advertising for _
the appropriate use of _
shipment of _
a cocktail of _
classes of _
a complete inventory of _
_ related downloads
new generations of _

The middle column, showing ways in which people talk about both “art” and “pharmaceuticals”, makes it pretty clear. What they have in common is that they’re both products: you can buy, sell, produce, and store them. (There’s also an intellectual goods aspect: they both can be stolen.) This really didn’t occur to me at first; silly me, I thought art was a thing of beauty removed from such mundane considerations. A number of the submitted answers, though, center around the theme of them both being expensive — so we have positive agreement between corpus statistics and human judgments!

Examining massive numbers of contexts like this follows what the infinitely wise Dinosaur Comics calls “a statistically-based descriptivist approach to semantics.” Or as linguist J.R. Firth put it, “You shall know a word by the company it keeps.” Many subtleties of the two concepts can be seen just in their context lists. For example, in the left column, we see that only art “is a commodity”. Well, certainly pharmaceuticals are a commodity too. But that’s so obvious it’s not worth saying. Proclaiming that “art is a commodity”, however, is interesting. Maybe we think about this (possible) fact more.

As for the data: it comes from 200 million web pages (500 million sentences), and is filtered to contexts that appear more than five hundred times in the data. It was collected as part of a research project that seeks to extract a database of knowledge from this information — “reading the web”. (Yes, Hadoop was involved.) To make the table, I took the contexts’ set differences and intersection and showed a random subsample from each.

A final note. Will pointed out that in Alice in Wonderland, the Mad Hatter asks, “Why is a raven like a writing desk?” I tried that query on this data, but unfortunately, it didn’t contain many instances of “raven”. However, it does include a proper name “Raven” — which turns out to be an anime character. Not the first time I’ve seen the Internet’s massive amount of anime knowledge get in the way of a very serious semantic extraction system!

Many thanks to Adam, Joanna, Will, Vikas, and Michael for the submitted answers.

Quiz: “art” and “pharmaceuticals”

A lexical semantics question:

How are “art” and “pharmaceuticals” similar?

I have a data-driven answer, but am curious how easy it is to guess it, and in what sense it’s valid. I’ll post my answer and supporting evidence on Tuesday.

Don’t MAWK AWK - the fastest and most elegant big data munging language!

update 4/30/2010: I have since found large datasets where mawk is buggy and gives the wrong result. nawk seems safe.


When one of these newfangled “Big Data” sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you’re dealing with hundreds of megabytes of data, even simple operations can take plenty of time.

For one recent ad-hoc task I had — reformatting 1GB of textual feature data into a form Matlab and R can read — I tried writing implementations in several languages, with help from my classmate Elijah. The results really surprised us:

 Language   Time (min:sec)  Speed (vs. gawk) Lines of code Notes Type
mawk 1:06 7.8x 3 Mike Brennan’s Awk, system default on Ubuntu/Debian Linux. VM

java 1:20 6.4x 32 version 1.6 (-server didn’t matter) VM+JIT

c-ish c++ 1:35 5.4x 42 g++ 4.0.1 with -O3, using stdio.h Native

python 2:15 3.8x 20 version 2.5, system default on OSX 10.5 VM

perl 3:00 2.9x 17 version 5.8.8, system default on OSX 10.5 VM

nawk 6:10 1.4x 3 Brian Kernighan’s “One True Awk”, system default on OSX, *BSD ?

c++ 6:50 1.3x 48 g++ 4.0.1 with -O3, using fstream, stringstream Native

ruby 7:30 1.1x 22 version 1.8.4, system default on OSX 10.5; also tried 1.9, but was slower Interpreted

gawk 8:35 1x 3 GNU Awk, system default on RedHat/Fedora Linux Interpreted

To be clear, Read more »

Patches to Rainbow, the old text classifier that won’t go away

I’ve been reading several somewhat recent finance papers (Antweiler and Frank 2005, Das and Chen 2007) that use Rainbow, the text classification software originally written by Andrew McCallum back in 1996. The last version is from 2002 and the homepage announces he isn’t really supporting it any more.

However, as far as I can tell, it might still be the easiest-to-use text classifier package out there. You don’t have to program — just invoke commandline arguments — and it can accommodate reasonably sized datasets, does tokenization, stopword filtering, etc. for you, and has some useful feature selection and other options. Based on my limited usage, it seems well-implemented. If anyone knows of a better one I’d love to hear it. I once looked at, among other things, GATE and UIMA, and they seemed too hard to use if you wanted to download something that did simple text classification; or else, maybe they didn’t have documentation on how to use them in that manner. Rainbow does. If I had to recommend a text classifier to a social scientist today, I might say they should Rainbow.

(GATE and UIMA call themsleves “architectures”. I usually don’t want an architecture, I want a program that does stuff. LingPipe was the only other system I found that had good web documentation saying how to use it to do text classification. It looks like a good option, if you’re willing to write some code. There are numerous academic efforts to make automated content analysis systems that at a high level sound like the right sort of thing, but nearly all of them have poor web docs so it’s hard to tell whether they do what you want.)

In the meantime, the current Rainbow download has issues compiling on modern GCC and Mac OSX — some issues documented here. I worked through them put my patched version (only tested on GCC 4.0, OSX 10.5) up here: github.com/brendano/bow

Another R flashmob today

Dan Goldstein sends word they’re doing another Stackoverflow R flashmob today. It’s a neat trick. The R tag there is becoming pretty useful.

Beautiful Data book chapter

Today I received my copy of Beautiful Data, a just-released anthology of articles about, well, working with data.  Lukas and I contributed a chapter on analyzing social perceptions in web data.  See it here. After a long process of drafting, proofreading, re-drafting, and bothering the publishers under rather sudden deadlines, I’ve resolved to never use graphics again in anything I write :)

Here’s our final figure, a k-means clustering of face photos via perceived social attributes (social concepts/types? with exemplars?):

I just started reading the rest of the book and it’s very fun.  Peter Norvig’s chapter on language models is gripping.  (It does word segmentation, ciphers, and more, in that lovely python-centric tutorial style extending his previous spell correction article.)  There are also chapters by many other great researchers and practitioners (some of whom you may have seen around this blog or its neighborhood) like Andrew Gelman, Hadley Wickham, Michal Migurski, Jeffrey Heer, and still more… I’m impressed just by the talent-gathering-and-organizing operation. Big kudos to editors Toby Segaran and Jeff Hammerbacher, and O’Reilly’s Julie Steele.

I also have an apparently secret code that gets you a discount, so email me if you want it.  I wonder if I’m not supposed to give out many of them.  Hm.