Parsing Large XML files, Serially, in Python

28 Jun 2013 // programming

So you've been told you have to read this bioinformatic data format, and you just realized that it's essentially one cluster-fuck of XML that's 750MB large.

As you might have figured out already, to read large XML files in one go:

import xml.etree.ElementTree as etree
data = etree.parse('a_very_big.xml')

will kill your memory and freeze your computer.

Before you slit your wrists in despair, let me tell you that in Python, it's not that bad. The good news is that the Python ElementTree module has a great API to handle this situation. The bad news is that nobody has put all the information in one place. Hopefully this article fixes that.

Iterate it

You need to read it in a chunk at a time. All the XML docs say use SAX, but don't. The ElementTree library has a perfectly good Python iterator that does the job:

import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
  print event, elem

This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.

Namespaces

start-ns is probably the most important thing you will encounter to process XML serially. ElementTree has wisely provided this call for you to gather all the namespaces in the file. You will need to store this in a special namespace dictionary nsmap:

nsmap = {}
for event, elem in etree.iterparse(xmL, events=('start-ns')):
  ns, url = elem
  nsmap[ns] = url
print nsmap

This nsmap is the key to your sanity. Basically ElementTree reports all tags in the fully expanded name, where namespaces are expanded out in full:

GAML:attribute -> {http://www.bioml.com/gaml/}attribute

However, once you've collected your nsmap, you can construct the tags in a straightforward using the fixtag function:

def fixtag(ns, tag, nsmap):
  return '{' + nsmap[ns] + '}' + tag

So in the GAML example above, to look for closing tags:

nsmap = {}
results = []
for event, elem in etree.iterparse(xmL, events=('end', 'start-ns')):
  if event == 'start-ns':
    ns, url = elem
    nsmap[ns] = url
  if event == 'end':
    if elem.tag == fixtag('GAML', 'peptide', nsmap):
      result = process_peptide(elem, nsmap)
      results.append(result)

And we send the peptide subbranch of the XML file into a parsing function. This structure will be nicely parsed.

In one of my XML files, I have some implicit top-level namespace, where tag expands out to:

peptide -> {http://regis-web.systemsbiology.net/protXML}peptide

Turns out the namespace is actually '', so for fixtag, I use:

fixtag('', peptide, nsmap)

Pruning the branch

Now here's the most important part for serial reading, once you've parsed your chunk, you have to clear it with .clear():

for event, elem in etree.iterparse(xmL, events=('end')):
  if event == 'end':
    if elem.tag == fixtag('GAML', 'peptide', nsmap):
      process_peptide(elem, nsmap)
      elem.clear()

The garbage collection will then claim the freed memory. It is thus very important that you choose independent sub-trees to parse, so that clearing that sub-tree won't affect other parts.

You may have noticed that in the parsing function process_peptide, I pass in both elem and nsmap. That's because you need to generate the full tag to find things even in your sub-tree:

ydata = etree.findall(fixtag('GAML', 'ydata', nsmap))

Another alternative is that the find functions in etree can take nsmap and do some parsing, allowing perhaps clearer strings in the search path:

data = etree.findall('GAML:ydata/GAML:xdata', namespaces=nsmap)

Going down the rabbit hole

Now if your XML document creator has created a well thought-out document, then it's quite easy to traverse the tree by following the tags. But hey, there's no such thing as a well-thought out XML document. Have you ever seen shit like this?

<group>
  <group>
    <group>
       <parameter name="this" value="shit"/>
       <parameter name="always" value="happens"/>
       <parameter name="to" value="me"/>
    </group>
  </group>
</group>

Well I have, and this kind of nesting screws up your tag matching if you're looking for a <group> tag. The answer is that you need to keep track of the level of nesting using the start event in iterparse:

group_nesting = 0
for event, elem in etree.iterparse(xmL, events=('start', 'end')):
  if event == 'start':
    if elem.tag == 'group':
      group_nesting += 1
  if event == 'end':
    if elem.tag == 'group':
      if group_nesting == 3:
        process_params(elem, nsmap)
        elem.clear()
      group_nesting -= 1

Reading XML serially in Python: it's not so bad, is it?