Calculating running variance in Python and C++
It’s fairly obvious that an average can be calculated online, but interestingly, there’s also a way to calculate a running variance and standard deviation. Read all about it here.
I’m playing around with the Netflix Prize data of 100 million movie ratings, and a huge problem is figuring out how to load and calculate everything in memory. I’m having success with NumPy, the numeric library for Python, because it compactly stores arrays with C/Fortran binary layouts. For example, 100 million 32-bit floats = 100M * 4 = 400MB of memory, which is manageable. And it’s much easier to play around interactively in ipython/matplotlib rather than write C++ for everything.
Unfortunately, the simple ways to calculate variance on an array of that size create wasteful intermediate data structures as long as the original array.
>>> mean( (x-mean(x)) ** 2 ) # two intermediate structures >>> tmp=x-mean(x); tmp**=2; mean(tmp) # one intermediate structure
That’s an extra 400 or 800 megs of memory being thrown around. (And if x was an array of integers, the x-mean(x) step implicitly converts to 64-bit doubles which, well, doubles things again!)
So, following John Cook’s explanation, I wrote running_stat, a C++ and Python implementation of running variance. It takes almost no memory and is faster than NumPy’s native variance function. Demo:
In [1]: from numpy import * In [2]: x = arange(1e8) # python RSIZE = 774 MB In [3]: timeit -n1 -r5 std(x) # RSIZE goes as high as 2.2 GB 1 loops, best of 5: 4.01 s per loop In [4]: import running_stat In [5]: timeit -n1 -r5 running_stat.std(x) # RSIZE = 774 MB the whole time 1 loops, best of 5: 1.66 s per loop
The C++ implementation is very simple and can be ripped out of running_stat.cc.
Link: github.com/brendano/running_stat
I wonder if Haskell’s laziness, perhaps along with my friend Patrick’s Haskell BLAS bindings, might magically solve some of the memory usage in the naive implementation.
30. November 2008 at 8:06 am :
Thanks for the plug but BLAS is pretty irrelevant here. Any decent array library will support a fold opration to give you the result with a constant amount of memory.
Also, in python, couldn’t you just use a for loop?
30. November 2008 at 8:51 pm :
could use a python for loop, but it’s slower than having numpy do an array operation because it’s in C :)
i guess figuring out how to push array operation into low-level libraries is a step simpler than matrix operations in low-level libraries…
3. December 2008 at 5:51 pm :
I thought you were after an windowed variance.
If you are ever interested, you can compute the windowed max-min in constant time, irrespective of the window size! This is not entirely trivial, by the way. See my code:
http://code.google.com/p/maxminfilters/
3. December 2008 at 5:55 pm :
Pretty cool, thanks!
Is windowed variance even harder?