Moral psychology on Amazon Mechanical Turk
There’s a lot of exciting work in moral psychology right now. I’ve been telling various poor fools who listen to me to read something from Jonathan Haidt or Joshua Greene, but of course there’s a sea of too many articles and books of varying quality and intended audience. But just last week Steven Pinker wrote a great NYT magazine article, “The Moral Instinct,” which summarizes current research and tries to spell out a few implications. I recommend it highly, if just for presenting so many awesome examples. (Yes, this blog has poked fun at Pinker before. But in any case, he is a brilliant expository writer. The Language Instinct is still one of my favorite popular science books.)
For a while now I’ve been thinking that recruiting subjects online could lend itself to collecting some really interesting behavioral science data. A few months ago I tried doing this with Amazon Mechanical Turk, a horribly misnamed web service that actually lets you create web-based tasks and pay online workers do them. Its canonical commercial applications include tedious tasks like search quality evaluation or image labeling, where you really need human data to perform well. You put up, say, several thousand images you want classified as “porn” or “not-porn”, say you’ll pay workers $0.01 to label ten images, then sit back and watch the data roll in.
So AMT advertises itself as a data annotation or machine learning substitute system, but I think its main innovation is finding out that there are lots and lots of people with free time willing to do online work for very, very low amounts of money. You can run any task you want, including surveys, and people happily respond for mere pennies. (Far below minimum wage, I might add — their motivation seems to be more like casual gaming or so.) To that end, I tried out running one of the standard moral psych survey questions to see what would happen — the so-called “trolley problem”:
A runaway trolley is hurtling down a track towards five people who have been tied down in its path. If nothing happens, they will be killed. Fortunately, you have a switch which would divert the trolley to a different track. Unfortunately, the other track has one person tied down to it. Should you flip the switch?
It’s supposed to be a classic dilemma of consequentialist vs. deontological moral reasoning. Is it acceptable to sacrifice for the greater good? Is it permissible to take an action that will cause a preventable death? And so on. I think it’s neat just because when I pose it to people, different folks really do disagree, give different answers, and are willing to argue about it. There are some interesting recent fMRI findings (due to Greene I think?) that people who refuse to flip the switch seem to be engaged in a more emotional response, whereas those who do seem to be using deliberative reasoning systems. (Some, like Greene and Pinker, seem to go further and argue this is a substantive normative reason to favor flipping the switch; whether you feel like getting sucked into that debate, though, there’s clearly something interesting happening here.)
So I ran this on AMT; the particpants (they call themselves “turkers”) had to answer yes or no. Turns out 77% say they’d flip the tracks.
I also ran two variant scenarios of the same logical dilemma, to sacrifice one person to save five:
A trolley is hurtling down a track towards five people. You are on a bridge under which it will pass, and you can stop it by dropping a heavy weight in front of it. As it happens, there is a very fat man next to you - your only way to stop the trolley is to push him over the bridge and onto the track, killing him to save five. Should you proceed?
and
A brilliant transplant surgeon has five patients, each in need of a different organ, each of whom will die without that organ. Unfortunately, there are no organs available to perform any of these five transplant operations. A healthy young traveler, just passing through the city the doctor works in, comes in for a routine checkup. In the course of doing the checkup, the doctor discovers that his organs are compatible with all five of his dying patients. Suppose further that if the young man were to disappear, no-one would suspect the doctor. Should the doctor sacrifice the man to save his other patients?
These two, of course, feel a lot harder to say “Yes” to, but if you were willing to say “Yes” to the original question, it is hard to justify why. The participants’ repsonses followed what you would expect: fewer said “Yes” to these scenarios. Here are the Yes/No responses to each of the questions (100 responses for each):
| Question | Yes | No |
|---|---|---|
| surgeon | 2 | 98 |
| fat man | 30 | 70 |
| switch, save 5 | 77 | 23 |
| switch, save 10 | 82 | 18 |
| switch, save 15 | 83 | 17 |
| switch, save 20 | 83 | 17 |
Only two people thought it was acceptable to sacrifice for organs, and only half as many would push the fat man as would flip the switch. I also ran variants of the switch version with more and more people on the tracks; the Yes response creeps upwards but never reaches 100%. The differences among the first three questions are statistically significant (unpaired t-tests, all p<.001 (this seems like the wrong test, can anyone correct me?)).
What’s amazing is how fast responses happen. I started getting responses just minutes after posting the question. I actually posted each of the six questions as a separate, standalone task; but many of the turkers who did one found the rest in the task pool and did them too. (So what was supposed to be a between-subjects design fell into something else, oops!) The whole thing cost $6 and was done in a matter of hours. It’s very encouraging — AMT allows you to very quickly iterate and try out different designs and such. It’s a bit of a pain to use, though; Amazon has certainly done a poor job in exploiting its full potential. (They have a form builder which was good enough to quickly write up these tasks, but to do anything moderately sophisticated, even just getting your data back out, you have to write programs against their somewhat mediocre API; you have to know how to use an XML parser, etc. Hm.)
I also tried an explicitly within-subject version, where each participant answered the three basic versions. I was interested in consistency — presumably very few people would sacrifice for organs but refuse to divert the trolley. For 141 participants, here are the frequencies of the different answer triples:
| % with this response triple | flip switch? | push fat man? | sacrifice traveler for organs? |
|---|---|---|---|
| 42.6 | Y | N | N |
| 29.8 | Y | Y | N |
| 20.6 | N | N | N |
| 5.0 | Y | Y | Y |
| 0.7 | Y | N | Y |
| 0.7 | N | Y | Y |
| 0.7 | N | Y | N |
I personally find the most common responses coherent with my own gut reactions — from left to right, I feel less and less good about sacrificing in each case. Perhaps all people feel the same gut reactions, and use different ad hoc reasons to draw the line in different places?
I’m sorry that this post started with neat moral psychology then degenerated into methodology, but hey it’s fun. I’ve seen only two instances of any sort of research paper being written using AMT, both by computer scientists; here’s a nice blog post on an information retrieval experiment (it’s a great blog, btw); and someone mentioned to me this one on data processing accuracy also. Anyone know of any? It’s clearly an interesting approach.
25. April 2008 at 10:06 pm :
Interesting post! I linked to it here: http://boldlygo.org/blog/make-decisions-based-on-means-or-consequence/
15. August 2008 at 2:43 am :
We used AMT to do psychology experiments around summer of 2007, and the results are published in a ACM CHI conference article here:
http://www-users.cs.umn.edu/~echi/papers/2008-CHI2008/2008-02-mech-turk-online-experiments-chi1049-kittur.pdf
Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008). (to appear). ACM Press, 2008. Florence, Italy.
15. August 2008 at 6:58 am :
Ed, thanks for the link to your paper! Everyone seems to be getting on the AMT wave :) We have a conference paper in review and if it’s accepted (or if it’s not, I suppose) I’ll post about it here…
Do you have the HITs or their templates saved anywhere? I’m curious to see the difference between the two different ones you ran, since you said you found big differences in the quality of Turker responses between them. I read through your CHI 2007 paper (”conflict and cooperation”) but couldn’t figure out exactly what the task was..
From reading your blog, looks like you folks have already discovered Panos Ipeirotis’s blog, and then perhaps ours (blog.doloreslabs.com). If you’ve heard or do any more cool AMT work I’d be eager to know.
12. October 2008 at 10:00 pm :
[...] have a response that is in some way the right one. Yet as Brendan O’Connor writes in this interesting article on the research into moral psychology, the scenario that surrounds a situation is all important [...]
13. October 2008 at 2:01 pm :
[...] Brendan O’Connor’s Blog - AI and Social Science » Moral psychology on Amazon Mechanical Turk (tags: psychology crowdsourcing neuroethics) [...]
10. November 2008 at 6:33 am :
Interesting experiment - thanks for sharing data. My question deals with worker motivation. Did you require workers to explain their positions? I.e., was the rationale field optional? If not, what extrinsic motivation would any worker have to read the problem? Wasn’t the quickest path to cashout a random click (of either Position A or Position B)?
10. November 2008 at 9:00 am :
Ryan: Nope, no rationales at all. There was no extrinsic motivation at all — you’re right, they could have done a random click and been fully rewarded.
I find the fact that this survey worked at all to be encouragng support for the use of MTurk for social science research.
5. January 2009 at 10:18 pm :
[...] Original idea and post by Brendan at http://anyall.org/blog/2008/01/moral-psychology-on-amazon-mechanical-turk/ [...]
26. January 2009 at 7:44 pm :
[...] http://anyall.org/blog/2008/01/moral-psychology-on-amazon-mechanical-turk/ [...]
4. April 2009 at 9:59 pm :
people who need organ transplants are likely going to be much sicker and have far shorter lives of lower quality than average people. moreover, in many cases people need organ transplants only due to risks they themselves assumed, such as alcoholics needing liver transplants.
Therefore, in the train example, the quality, length, and fault of each life saved or sacrificed are comparable. In the transplant example, however, they are not comparable: the healthy patient has no fault or other indications of sub-normal quality or length of life.
Therefore, the two examples are not comparable in that sacrificing the train victim does not implicate sacrificing the healthy patient.
19. April 2009 at 11:23 pm :
Очень полезно
21. April 2009 at 6:31 pm :
Unpaired t-tests sound fine to me. If you were doing a lot of them, you would want to adjust the critical threshold value. (With alpha at 5% and 20 t-tests, you would expect about one to test as significant when it is not.)
You could use ANOVA, which is designed for multiple t-tests and handles adjustments to critical values for you. It can also find interaction effects but you have no need for that here.
But for only three tests, I think t-tests are fine.
13. May 2009 at 1:44 pm :
I don’t know what to take from the results of this particular experiment, but makes me wonder what other experiments might be possible using MTurk.
Norman
1. July 2009 at 5:49 am :
Thats pretty cool info, would like to know more about this..!!
2. July 2009 at 5:56 am :
Artificial intelligence is the phenomenon of consciousness that has evoked the attention of many Philosophers and Scientists throughout history and vast number of papers and books have been published devoted to the subject. On the other hand, social science is made up of many different disciplines and factors which include geography, anthropology, psychology, political science, economics and sociology. Maurice F prout has very well explained about these entire phenomenons through his publications and articles, which are freely available at http://www.mauricefproutphd.com.
Although some of these factors have been researched, developed and practiced more thoroughly than others, psychology remains as the most argued as well as the most prominent factor. You can read more about psychology related articles at http://www.mauriceproutphd.com.