Don’t MAWK AWK - the fastest and most elegant big data munging language!

When one of these newfangled “Big Data” sets comes your way, the very first thing you have to do is data munging: shuffling around file formats, renaming fields and the like. Once you’re dealing with hundreds of megabytes of data, even simple operations can take plenty of time.

For one recent ad-hoc task I had — reformatting 1GB of textual feature data into a form Matlab and R can read — I tried writing implementations in several languages, with help from my classmate Elijah. The results really surprised us:

 Language   Time (min:sec)  Speed (vs. gawk) Lines of code Notes Type
mawk 1:06 7.8x 3 Mike Brennan’s Awk, system default on Ubuntu/Debian Linux. VM

java 1:20 6.4x 32 version 1.6 (-server didn’t matter) VM+JIT

c-ish c++ 1:35 5.4x 42 g++ 4.0.1 with -O3, using stdio.h Native

python 2:15 3.8x 20 version 2.5, system default on OSX 10.5 VM

perl 3:00 2.9x 17 version 5.8.8, system default on OSX 10.5 VM

nawk 6:10 1.4x 3 Brian Kernighan’s “One True Awk”, system default on OSX, *BSD ?

c++ 6:50 1.3x 48 g++ 4.0.1 with -O3, using fstream, stringstream Native

ruby 7:30 1.1x 22 version 1.8.4, system default on OSX 10.5; also tried 1.9, but was slower Interpreted

gawk 8:35 1x 3 GNU Awk, system default on RedHat/Fedora Linux Interpreted

To be clear, the problem is to take several files of (item name, feature name, value) triples, like:

000794107-10-K-19960401 limited 1
000794107-10-K-19960401 colleges 1
000794107-10-K-19960401 code 2
...
004334108-10-K-19961230 recognition 1
004334108-10-K-19961230 gross 8
...

And then rename items and features into sequential numbers as a sparse matrix: (i, j, value) triples. Items should count up from inside each file; but features should be shared across files, so they need a shared counter. Finally, we need to write a mapping of feature IDs back to their names for later inspection; this can just be a list.

This task is simple, but it’s representative of many data munging tasks out there. It inputs and outputs textual data. It’s probably one-off. The algorithm is easy — especially since it’s a subtask of something larger — but still complex enough you’ll need a debug cycle or two. You want to get it done fast so you can get on to the real work. Complex programming tools, like debuggers, are of little use — you figure out what’s going on by inspecting the output. Complex data processing environments, like Hadoop or an RDBMS, are also of little use — you have to munge in the first place to load data into them.

It turns out, this is a task AWK was made for. It’s a language dating from the original Bell Labs Unix era — circa 1977 — and it’s extremely specialized for processing delimited text files in a single pass. Perl was created in part to supersede it, but for this core use case, Awk is much more elegant and clearer. The implementation here is only 8 lines of code, expanded from merely 3 when I first wrote it.

Since it’s a standardized language, many implementations exist. One of them, MAWK, is incredibly efficient. It outperforms all other languages, including statically typed compiled ones like Java and C++! It wins on both LOC and performance criteria — a rare feat indeed, transcending the usual competition of slow-but-easy scripting languages versus fast-but-hard compiled languages.

[There's another big pro-VM story here: Java beat C++, both in LOC/ease-of-programming as well as performance. C++ was, as usual, a total nightmare to write. What you don't see in the LOC numbers is the sheer amount of time spent googling through every weird issue. For example, apparently g++ requires you to define the hash function in order to use a hash_map of string keys. Or, there are a zillion different ways to split a string, none of them standard. Then there are 2 different I/O and 2 different string libraries given its C heritage, and if you make the wrong choices, performance is terrible. I'm totally sure that given some more rewriting, the C++ implementation can be made the fastest. It's just a question of how much pain you go through to find the right rewrites. All the other implementations were written in the most straightforward, simplest way possible. C++ abjectly fails the "get it done quick" criterion.]

My most pleasant surprise learning Awk was its shortcuts for reading and writing files. Like shell, there is no concept of a file handle — you don’t open and close files, you just specify the filename and the VM figures it out. Even Perl, king of syntactic shorthand, doesn’t have this useful feature; and even Ruby, with its elegant open()-block-cleanup construct, is clunkier. It sounds minor, but this eliminates a number of bugs you can make in scripts.

But what I most appreciate about Awk is the discourse structure of its programs: every clause is a potentially conditional action to be performed with the current record. If you want actions to be taken upon program start or exit, you declare special clauses for that. Awk manages all these features while being staying incredibly small and simple — the advantages of being a domain-specific language. I think it feels a little more like a super-flexible, index-challenged version of SQL than it does a standard scripting language. I suspect Awk’s simplicity and specialization is part of why Mike Brennan was able to make Mawk so insanely fast. If your only datatypes are strings and hashes, then compile-time type inference is pretty easy.

Awk is also well-suited to the “Disk is the New Tape” era. That is, right now, hard drive sizes are rapidly growing — allowing very large datasets — but random access seek times aren’t catching up. In this setting, the only way to process data will be via linear scans, accessing one item of data at a time. (E.g. running variance, online learning, column stores, etc.) This is the core philosophy behind Hadoop’s computation model — and Awk’s. If hard drives are like tape drives, then it’s worth looking in to other blast-from-the-past technologies! (Similar point about SAS, in fact.)

There are many Awk tutorials on the web. This one is decent, though I strongly recommend Ken Church’s classic tutorial Unix for Poets. It shows how to do all sorts of great things with Unix text processing tools, including Awk.

All the code, results, and data can be obtained at github.com/brendano/awkspeed. I’d love to see results for more languages. And I hope someday someone tries writing an LLVM Awk — will it be even faster?

18 comments to “Don’t MAWK AWK - the fastest and most elegant big data munging language!”

  1. Adding units for the timings to the blog post would help put it into context.

  2. Cool! I’ll definitely have to learn more about AWK’s awesomeness.

  3. Pete - fixed, thanks. Anything else confusing?

  4. Two points about Java:

    - It takes only a few lines to separate reading and splitting lines to different thread and have it communicate with main tread through BlockingQueue

    - Java’s standard string split is slow. Essentially it compiles the column separator as a regular expression and uses it every time it is called. Rolling your own split function *should* be unnecessary but having one around is useful.

    For one time use these tricks are not really necessary but when you have to read several gigabytes of tab-separated data several times in a row…

    The separate read/split thread could probably be made an integral part of AWK.

  5. You should try doing a simple ruby –help before posting such nonsense. I’m sure you missed similar features in other languages. The unix way is to pipe the file and processes stdin. No need to open and close a file. I didn’t read the entire article, but if you didn’t you should post all your code.

  6. Err, you should read the article before calling BS. He’s reading from multiple files and writing back to similarly named files. If there’s a way to write to multiple files in that manner, while using some global values, I don’t want to see it in shell, it would look almost as much like snoopy barf as perl.

  7. Got your perl tweaked down to around 1:25s avg: http://gist.github.com/184768. Made it a bit more idiomatic modern Perl and then optimized (reduced hash lookups to once per line by stashing the value off to a lexical scalar, yadda yadda yadda).

    Interesting post though; now I need to grab mawk for RHEL boxen at work and see what speed gains it gets vs the stock awk.

  8. Fletch, thanks! That’s great. Obviously I don’t know much perl :)

    OJ, I never thought about using threads like that. Interesting. Have you implemented something like this before? And assuming you have 2 cores available, how much can it increase performance?

  9. Actually, a Perl script can look a lot like an awk script if we exploit
    some of Perl’s command line arguments … see the perlrun man page
    and variables … see the perlvar man page.

    Building on Fletch’s rewrite and writing the vocabulary to stderr
    here is a two line Perl script to do the job.


    #!/usr/bin/perl -ani_org
    # USAGE: ./me file1 file2 file3 ... 2>vocab # bash style redirect of stderr

    exists $jm{$F[1]} or print STDERR “$F[1]\n”;
    print(($im{$ARGV,$F[0]} ||= ++$i{$ARGV}).’ ‘.($jm{$F[1]} ||= ++$j).” $F[2]\n”);

    Where ‘-a’ causes perl to do an awk-like automatic split in to @F
    ‘-n’ causes perl to put a loop around the script
    ‘-i_orig’ cause perl to update each file “in place”
    first backing up each file to file_orig
    ‘$ARGV’ is the name of the file currently being read
    ‘$;’ we are relying on the default for this not be in
    any of the item names ($F[0]) or file names ($ARGV)
    in the expression $im{$ARGV,$F[0]}

    As a possible tweak for speed we could remove some of the hashing
    by resetting %imap at the top of each file by adding one line of code.
    This is a rather crude way of what Fletch did with ‘my’ variables.
    Here we get rid of one more hash access than Fletch did.
    But he could do that in his code also.


    #!/usr/bin/perl -ani_org
    # USAGE: ./me file1 file2 file3 ... 2>vocab # bash style redirect of stderr

    if ($curfile ne $ARGV) { undef %imap; $i = 0; $curfile = $ARGV }
    exists $jm{$F[1]} or print STDERR “$F[1]\n”;
    print(($im{$F[0]} ||= ++$i).’ ‘.($jm{$F[1]} ||= ++$j).” $F[2]\n”);

    Okay, I’m stretching the definition of a line …

    If one dislikes the structure of the print statements, we could
    exploit some special variables
    ‘$,’ sets the output field separator
    ‘$\’ sets the output record separator
    and write


    #!/usr/bin/perl -ani_org
    # USAGE: ./me file1 file2 file3 ... 2>vocab # bash style redirect of stderr

    BEGIN { $, = ' '; $\ = "\n" }
    if ($curfile ne $ARGV) { undef %imap; $i = 0; $curfile = $ARGV }
    exists $jm{$F[1]} or print STDERR $F[1];
    print(($im{$F[0]} ||= ++$i), ($jm{$F[1]} ||= ++$j), $F[2]);

    If we rejected writing the vocabulary to stderr we add to the BEGIN statement

    #!/usr/bin/perl -ani_org
    # USAGE: ./me file1 file2 file3 ...

    BEGIN {
    open(VOCAB, '>vocab') or die $!;
    $, = ' ';
    $\ = "\n";
    }
    if ($curfile ne $ARGV) { undef %imap; $i = 0; $curfile = $ARGV }
    exists $jm{$F[1]} or print STDERR $F[1];
    print(($im{$F[0]} ||= ++$i), ($jm{$F[1]} ||= ++$j), $F[2]);

    Finally, if we don’t like the way the ‘-i’ option works we can
    add one line to do it our selves


    #!/usr/bin/perl -an
    # USAGE: ./me file1 file2 file3 ...

    BEGIN {
    open(VOCAB, '>vocab') or die $!;
    $, = ' ';
    $\ = "\n";
    }
    if ($curfile ne $ARGV) { # starting new file
    open(STDOUT, '>', $ARGV . 'n') or die $!;
    undef %imap;
    $i = 0;
    $curfile = $ARGV
    }
    exists $jm{$F[1]} or print VOCAB $F[1];
    print(($im{$F[0]} ||= ++$i), ($jm{$F[1]} ||= ++$j), $F[2]);

  10. [...] Don’t MAWK AWK – the fastest and most elegant big data munging language! [...]

  11. As a general comment: I’m actually *really* surprised at how well Java is doing here. I’ve spent more time than I’d like to admit doing this sort of task in Java, and inevitably what happens when I need it to be repeatable is that I have to tear apart java.io and hand roll specific implementations, and the code typically becomes at a minimum twice as fast and usually more when I do. There’s too much in memory copying and the unicode support is a real slow down if you don’t need it.

    On the String.split front: An easy speedup is to do static final Pattern WHITESPACE = Pattern.compile(” “); and use WHITESPACE.split(string) instead. It seemed to shave about 10 seconds off the runtime on my computer, which left it still slower than awk.

  12. I’ve done a faster ruby version. http://gist.github.com/185234.
    I’m on a linux-x86_64, 3ghz Penom II, gawk seems much faster on my box

    33.8s mawk
    36.3s gcc c
    51.0s java
    67.0s perl Fletch.pl
    71.7s python
    87.8s perl
    95.8s nawk
    101.4s gawk
    114.0s gcc
    133.0s ruby1.9 eay.rb
    136.8s ruby1.8 eay.rb
    327.6s ruby1.8
    372.9s ruby1.9

  13. Brendano: On my dual core machine, the difference between threaded and non-threaded version is around 25%. I’m not sure why the difference is not any greater as this thing puts both CPUs near full load. I thought it could be thread communication overhead, but packing multiple lines into larger packets did not help.

    It is interesting that the single thread version takes CPU load over 100%, probably due parallelized GC.

    The code is available here. Uglier and more wordy than I thought.

  14. I’ve just hit this post and I hope I’m still in time to comment on it. I’ve been using AWK for more than a decade now, to the point that I have written a whole application development framework for Web applications, largely based on AWK. I confirm that MAWK is lightening fast but I’m getting the impression it is no longer actively developed. I hope I’m wrong. MAWK is still affected by a few bugs that seem to have been around for years now, and nobody seems to be interested with fixing them. I have just stumbled upon this one.

  15. It seems to me that the issue of processing speed is almost irrelevant for data formating tasks. All tools seem fast enough. For a typical new data set, one needs to do it once and then never do it again. A lot more time is spent on writing and testing the code to accomplish this task. Based on these presumptions, the AWK implementations still beat everything else. AWK was designed for digging through formated text files and it’s nice to see that it still excels at such tasks.

  16. I am an AWK fan and I use gawk. I didn’t knew that mawk it wasn’t interpreted language, so I gave it a try, inspired by your article.

    I did some tests but I miss certain functionality like gensub, asort, asorti functions .

    Are there any mawk functions I can use as a replacement of the above ones?

  17. Not that I know of…

  18. This is kind of silly. The purpose of awk, etc is for a quick and dirty. Mawk, like dash is broken in subtle ways. Always script to standard available utilities or pay the price.

    Why on earth would you use Ruby for this? That’s just plain ignorant, although vanilla PHP might be interesting. Assuming your data set is formatted as you specified, C code follows. Linear search might be an issue if nelem gets overlarge, but likely not even close to a bottleneck. 140 odd lines with comments, with assumptions galore.


    // File: 2num.c
    //
    // Last Modified: 2010-03-04 02:54:53
    //
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define BUFSIZE 65535
    char big_buf[BUFSIZE + 1];

    #define LIST_BREAK 16384
    struct list_struct {
    char **list;
    int size,ix,new;
    } list1,list2;

    int in_fd,out_fd,vocab;

    //———————————————————————-
    // CHECK_LIST
    // Simple Linear search is faster up to around 16K elements
    //———————————————————————-
    check_list(struct list_struct *lp, char *val) {
    char **p;
    int ix;
    for(p = lp->list,ix = 0; ix ix && lp->list[ix] != NULL; ix++) {
    if(strcmp(lp->list[ix],val) == 0) {
    return(ix + 1);
    break;
    }
    }
    // Not there
    if(lp->ix > lp->size) {
    lp->list = realloc(lp->list,lp->size + LIST_BREAK);
    }
    lp->list[lp->ix++] = strdup(val);
    lp->new = 1;
    return(lp->ix);
    }

    //———————————————————————-
    // PROCESS
    //———————————————————————-
    void process(char **vals) {
    int n1,n2;
    char obuf[32];

    n1 = check_list(&list1,vals[0]);
    n2 = check_list(&list2,vals[1]);
    if(list2.new) { // Optmizeme into a buffer
    write(vocab,vals[1],strlen(vals[1]));
    write(vocab,”\n”,1);
    list2.new = 0;
    }
    // Optimize into a buffer …
    sprintf(obuf,”%d %d %s\n”,n1,n2,vals[2]);
    write(out_fd,obuf,strlen(obuf));
    }

    //———————————————————————-
    // PARSE
    //———————————————————————-
    char **parse(char *line) {
    static char *p[3];
    p[0] = line;
    p[1] = index(line,’ ‘);
    if(p[1] == NULL) {
    p[1] = “”;
    p[2] = “”;
    return(p);
    }
    *(p[1])++ = ”;
    p[2] = index(p[1],’ ‘);
    if(p[2] == NULL) {
    p[2] = “”;
    } else {
    *(p[2])++ = ”;
    }
    return(p);
    }

    //———————————————————————-
    // MAIN
    //———————————————————————-
    main(int argc, char *argv[]) {
    int ix, iy, bytes_read,buflen;
    char *p,*q, **vals;
    char buf[4096];

    list1.list = malloc(LIST_BREAK);
    list1.size = LIST_BREAK;
    list1.ix = 0;
    list1.new = 0;

    list2.list = malloc(LIST_BREAK);
    list2.size = LIST_BREAK;
    list2.ix = 0;
    list2.new = 0;

    vocab = open(”vocab”,O_WRONLY|O_CREAT|O_TRUNC,0666);
    if(vocab == -1) {
    perror(”Open vocab”);
    exit(-1);
    }
    for(ix = 1; ix 0;) {
    // Fill the buffer
    for(buflen = 0; buflen < BUFSIZE;) {
    bytes_read = read(in_fd, big_buf + buflen,BUFSIZE - buflen);
    if(bytes_read 0 && p != NULL;) { // Parse off lines and process them
    q = index(p,’\n’);
    if(q == NULL) { // Read Some More
    buflen = strlen(p);
    memcpy(big_buf,p,buflen);
    break;
    }
    *q++ = ”;
    vals = parse(p);
    process(vals);
    p = q;
    }
    }
    close(in_fd);
    close(out_fd);
    list1.ix = 0;
    }
    close(vocab);
    }


Leave a comment

XHTML - You can use:<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>