Cognition, systems, decisions, visualization, machine learning, etc.
About
This is a blog on artificial intelligence and social science — call it "Social Science++", with an emphasis on computer science and statistics. My general website is anyall.org.
Michael Jackson just died while Iran is in turmoil. I am reminded of a passage in Marjane Satrapi’s wonderful graphic novel Persepolis, a memoir of growing up in revolutionary Iran in the 80’s.
(Read the book to see how it ends.)
I wonder how much coincidences of news event timing can influence perceptions. Clearly, large news stories can crowd out other ones. Are there any other effects of joint appearances? Celebrity deaths are fairly exogenous shocks — there might be a nice natural experiment somewhere here.
It is rather surprising that systematic studies of human abilities were not undertaken until the second half of the last century… An accurate method was available for measuring the circumference of the earth 2,000 years before the first systematic measures of human ability were developed.
–Jum Nunnally, Psychometric Theory (1967)
(Social science textbooks from the 60’s and 70’s are rad.)
I think I like the top better, without the map lines, like those night satellite photos: pointwise ghosts of high-end human economic development.
This data is a fairly extreme sample of convenience: I’m only looking at tweets posted by certain types of iPhone clients, because they conveniently report exact gps-derived latitude/longitude numbers. (search.twitter.com has geographic proximity operators — which are very cool! — but they seem to usually use zip codes or other user information that’s not available in the per-tweet API data.) So there’s only 30,000 messages out of 1.2 million spritzer tweets over ~3 days (itself only a small single-digit percentage sample of twitter).
Will Fitzgerald just wrote about an excellent article by Steven Strogatz on Zipf’s Law for the populations of cities. If you look at the biggest city, then the next biggest city, etc., there tends to be an exponential fall-off in size.
I was wondering what this looks like so here’s the classic zipfian plot (log-size vs. log-rank) for city population data from from populationdata.net:
If you fit a power law — that is, a line on the above logsize-logrank plot — you can use rank to predict the sizes of smaller cities very accurately, according to Will’s analysis. Larger cities are more problematic, lying off the line.
I was curious whether the power law holds within countries as well. The above plot was only for the countries that had more than 10 cities in the dataset — just eight countries. So here are those same cities again, but plotted against ranks within their respective countries.
The answer is — usually, yes, the power law looks like it holds within countries as well. (Country names are French in this data … Etats-Unis = USA, Allemagne = Germany, etc.) Russia seems to have the biggest difference between its head vs. tail cities. The tail cities have the linear logsize-logrank relationship, but the top 3 cities (Moscow, St. Petersburg, Nizhny Novgorod) seem to have their own different slope.
If you randomly subsample out of a Zipf distribution, the samples will be Zipfian as well, so this isn’t too surprising. If, on the other hand, you’re a fan of theories that power law population relationships might happen as a result of the structural dynamics of growth — for example, winners-win (i.e. rich-get-richer) growth patterns can sometimes result in zipf-distributed sizes — then there’s a case that these dynamics might be happening at both the world and country levels.
Also: this is the first time I’ve used Hadley Wickham’s ggplot2 and it was great. All of the fun of lattice minus a lot of the pain, plus default display options that aren’t ugly as hell :)
Update: alternative view of those two above graphs.
This was brought to you via the following R code: Read more »
Last week, I, with my awesome friends David Ahn and Mike Krieger, finished hacking together an experimental prototype, TweetMotif, for exploratory search on Twitter. If you want to know what people are thinking about something, the normal search interface search.twitter.com gives really cool information, but it’s hard to wade through hundreds or thousands of results. We take tweets matching a query and group together similar messages, showing significant terms and phrases that co-occur with the user query. Try it out at tweetmotif.com. Here’s an example for a current hot topic, #WolframAlpha:
It’s currently showing tweets that match both #WolframAlpha as well as two interesting bigrams: “queries failed” and “google killer”. TweetMotif doesn’t attempt to derive the meaning or sentiment toward the phrases — NLP is hard, and doing this much is hard enough! — but it’s easy for you to look at the tweets themselves and figure out what’s going on.
Here’s another fun example right now, a query for Dollhouse:
I love that the #wolframalpha topic has “infected” the dollhouse space. Someone pointed out a connection between them, but really they’re connected through bot spam. TweetMotif’s duplicate detection algorithm found 22 messages here where each is basically a list of all the trending topics. This seems to be a popular form of twitter spambots.
I learned a ton making this system, and I’ll try to write more about the technical details in a future post. It’s interesting to hear people speculate on how it works; everyone gives a different answer. I guess this goes to show you that search/NLP is still a pretty unsettled, not-completely-understood area.
There are lots of interesting TweetMotif examples. More prosaic, less news-y queries like sandwich yield cool things like major ingredients of sandwiches and types of sandwiches. (These are basically distributional similarity candidates for synonym and meronym acquisition, though a bit too noisy to use in its current form.) And in a few cases, like for understanding currently unfolding events, TweetMotif might even be useful! It would be nice to expand the set of usefully served queries. We’re occasionally posting interesting queries at twitter.com/tweetmotif.
And oh yeah. We have a beautiful iPhone interface!
Check it out folks. This is a functional prototype, so you can play with it right now at tweetmotif.com.
I’m doing word and bigram counts on a corpus of tweets. I want to store and rapidly retrieve them later for language model purposes. So there’s a big table of counts that get incremented many times. The easiest way to get something running is to use an open-source key/value store; but which? There’s recently been some development in this area so I thought it would be good to revisit and evaluate some options.
Here are timings for a single counting process: iterate over 45,000 short text messages, tokenize them, then increment counters for their unigrams and bigrams. (The speed of the data store is only one component of performance.) There are about 17 increments per tweet: 400k unique terms and 750k total count. This is substantially smaller than what I need, but it’s small enough to easily test. I used several very different architectures and packages, explained below.
This is fun — Jamie Callan’s group at CMU LTI just finished a crawl of 1 billion web pages. It’s 5 terabytes compressed — big enough so they have to send it to you by mailing hard drives.
One of their motivations was to have a corpus large enough such that research results on it would be taken seriously by search engine companies. To my mind, this begs the question whether academics should try to innovate in web search, when it’s a research area incredibly dependent on really large, expensive-to-acquire datasets. And what’s the point? To slightly improve Google someday? Don’t they do that pretty well themselves?
On the other hand, having a billion web pages around sounds like a lot of fun. Someone should get Amazon to add this to the AWS Public Datasets. Then, to process the data, instead of paying to get 5 TB of data shipped to you, you instead pay Amazon to rent virtual computers that can access the data. This costs less only to a certain point, of course.
It always seemed to me that a problem with Amazon’s public datasets program is that they want data that’s genuinely large enough you need to rent lots of computing power to work on it; but there are very few public datasets large enough to warrant that. (For example, they have Freebase up there, but I think it’s slightly too small to justify that; e.g. I can fit all of freebase just fine on my laptop and run a grep on it in like 5 minutes flat.) But 1 billion web pages is more arguably appropriate for this treatment.
The bigger problem with big-data research initiatives is that organizations with petabyte-scale data are always going to keep it private; e.g. from giant corporations — walmart retail purchase records, or the facebook friend graph, or google search query logs — or else from governments of course. Maybe biology and computational genetics is the big exception to this tendency. At least the public data situation for web research just got a lot better.
A binary classifier makes decisions with confidence levels. Usually it’s imperfect: if you put a decision threshold anywhere, items will fall on the wrong side — errors. I made this a diagram a while ago for Turker voting; same principle applies for any binary classifier.
So there are a zillion ways to evaluate a binary classifier. Accuracy? Accuracy on different item types (sens, spec)? Accuracy on different classifier decisions (prec, npv)? And worse, over the years every field has given these metrics different names. Signal detection, bioinformatics, medicine, statistics, machine learning, and more I’m sure. But in R, there’s the excellent ROCR package to compute and visualize all the different metrics.
I wanted to have a small, easy-to-use function that calls ROCR and reports the basic information I’m interested in. For preds, a vector of predictions (as confidence scores), and labels, the true labels for the instances, it works like this:
> binary_eval(preds, labels)
These are four graphs showing variation of classifier performance as the cutoff changes. Read more »