Faking big arrays in matplotlib

18 Jul 2012 // programming

Sometimes, in your number-crunching, you may deal with large arrays. This causes problems when you try to graph your results.

In Python, I use the venerable numpy library, which quite happily copes with, memory permitting, arrays that are tens of thousands long.

Unfortunately, if you pass these large numpy arrays into the matplotlib library to generate pretty pictures, matplotlib will choke.

The first time I encountered this, my terminal froze as my macbook air used up all its RAM and I lost my mouse. It took me a while to figure out where the problem was but I finally traced it to Matplotlib. I then tried all sorts of things to reduce the memory load (I was looping over a bunch of images of ~1000×1000 array of floats) including all sorts of sheningans like ordering the garbage collector to clean up at the end of each loop.

Recently, I did some graphing and came across the same problem, except I now had much larger numpy arrays (mapping RNA reads onto genes ~100,000 base-pairs). Even with shitty hacks, Matplotlib still choked.

And then I figured out the solution. The trick was to precalculate an intermediate array that was bigger than the screen coordinate but much smaller than the original coordinates. After all, that's what Matplotlib does internally. In my particular problem, I wanted to display histograms. So here is a squash_array function that takes a histogram, and returns a new histogram that is much shorter of target width n_new_hist:

def squash_array(hist, n_new_hist):
  new_heights = numpy.zeros(n_new_hist)
  new_counts = numpy.zeros(n_new_hist)
  n_hist = len(hist)
  ratio = float(n_new_hist)/float(n_hist)
  for i_hist in range(n_hist):
    left = i_hist*ratio
    right = (i_hist+1)*ratio
    i_new_hist = int(math.floor(left))
    j_new_hist = int(math.floor(right))
    if j_new_hist == n_new_hist or i_new_hist == j_new_hist:
      new_heights[i_new_hist] += hist[i_hist]
      new_counts[i_new_hist] += 1.0
    else:
      # if this bin is sandwiched between two new bins
      # then have to work out the ratios of the two new bins
      # and add only a fraction to each
      left_part = (j_new_hist - left)/ratio
      right_part = (right - j_new_hist)/ratio
      new_heights[i_new_hist] += hist[i_hist]*left_part
      new_counts[i_new_hist] += left_part
      new_heights[j_new_hist] += hist[i_hist]*right_part
      new_counts[j_new_hist] += right_part
  new_hist = numpy.zeros(n_new_hist)
  for i_new_hist in range(n_new_hist):
    new_hist[i_new_hist] = new_heights[i_new_hist]/new_counts[i_new_hist]
  return new_hist

Of course, the new histogram will now have width n_new_hist. As well, if your original histogram was in whole counts, the new histogram will have fractional counts to cope with bins that straddle the boundaries of the original bins. In that case, the new histogram will take a bit from one, and a bit from the other. When you plot this with matplotlib, the ticks will be all wrong! Never fear, here's my tick faking routine. First we set the x ticks as the original histogram and extract the parameters (this assumes you've imported matplotlib as pylab):

# set original x ticks
pylab.xlim([0, n_hist])
locs, labels = pylab.xticks()
n_tick = len(locs)
labels = ["%.f" % l for l in locs]

Then we generate the fake ticks and set those instead.

# calculate new fake locs to fake x ticks
interval = n_new_hist*(locs[1]-locs[0])/n_hist
locs = [i*interval for i in range(n_tick)]
pylab.xticks(locs, labels)
pylab.xlim([0, n_new_hist])

And now when you plot the histogram, everything will look as if you plotted the original histogram (make sure you fix the ticks after plotting the histogram):

pylab.bar(range(n_new_hist), squash_array(hist, n_new_hist))

If you choose n_new_hist well, then your system will be able to plot a 100,000 bin histogram just as easily as a 500 bin histogram.