Calculating running variance in Python and C++

It’s fairly obvious that an average can be calculated online, but interestingly, there’s also a way to calculate a running variance and standard deviation. Read all about it here.

I’m playing around with the Netflix Prize data of 100 million movie ratings, and a huge problem is figuring out how to load and calculate everything in memory. I’m having success with NumPy, the numeric library for Python, because it compactly stores arrays with C/Fortran binary layouts. For example, 100 million 32-bit floats = 100M * 4 = 400MB of memory, which is manageable. And it’s much easier to play around interactively in ipython/matplotlib rather than write C++ for everything.

Unfortunately, the simple ways to calculate variance on an array of that size create wasteful intermediate data structures as long as the original array.

>>> mean( (x-mean(x)) ** 2 )            # two intermediate structures
>>> tmp=x-mean(x); tmp**=2; mean(tmp)   # one intermediate structure

That’s an extra 400 or 800 megs of memory being thrown around. (And if x was an array of integers, the x-mean(x) step implicitly converts to 64-bit doubles which, well, doubles things again!)

So, following John Cook’s explanation, I wrote running_stat, a C++ and Python implementation of running variance. It takes almost no memory and is faster than NumPy’s native variance function. Demo:

In [1]: from numpy import *
In [2]: x = arange(1e8)                      # python RSIZE = 774 MB

In [3]: timeit -n1 -r5 std(x)                # RSIZE goes as high as 2.2 GB
1 loops, best of 5: 4.01 s per loop

In [4]: import running_stat
In [5]: timeit -n1 -r5 running_stat.std(x)   # RSIZE = 774 MB the whole time
1 loops, best of 5: 1.66 s per loop

The C++ implementation is very simple and can be ripped out of running_stat.cc.

Link: github.com/brendano/running_stat

I wonder if Haskell’s laziness, perhaps along with my friend Patrick’s Haskell BLAS bindings, might magically solve some of the memory usage in the naive implementation.

4 comments to “Calculating running variance in Python and C++”

  1. Thanks for the plug but BLAS is pretty irrelevant here. Any decent array library will support a fold opration to give you the result with a constant amount of memory.

    Also, in python, couldn’t you just use a for loop?

  2. could use a python for loop, but it’s slower than having numpy do an array operation because it’s in C :)

    i guess figuring out how to push array operation into low-level libraries is a step simpler than matrix operations in low-level libraries…

  3. I thought you were after an windowed variance.

    If you are ever interested, you can compute the windowed max-min in constant time, irrespective of the window size! This is not entirely trivial, by the way. See my code:

    http://code.google.com/p/maxminfilters/

  4. Pretty cool, thanks!

    Is windowed variance even harder?


Leave a comment

XHTML - You can use:<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>