Skip to content

Parse, Traverse and Search XML Docs

October 24, 2010

In this series of articles you will learn XML processing with SAX and DOM. We will make sure that the practical examples are simple yet they give you the right amount of details to dive into professional development. It will be easier to get your hands on available training source code. I am not going to use any other example based on this source code in my following web services’ tutorials. The web service example code uses obsolete libraries. The code can be used for other good purposes based on standard Python libraries.

I assume that you are already familiar with Python interpreter, packages and modules and have some fluency with the language and syntax. I also assume that you have good understanding of data structures and you have used them in programming tasks.

Parsing XML

Parsing an XML document is very simple. We use the “binary.xml” from the given example source code to demonstrate this.

>>> from xml.dom import minidom
>>> binaryxml = minidom.parse('~/diveintopython/py/kgp/binary.xml')
>>> binaryxml
<xml.dom.minidom.Document instance at 0xb777c38c>
>>> print binaryxml.toxml()
<?xml version="1.0" ?>
<grammar>
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>

First we have imported minidom module from xml.dom package. minidom.parse takes one argument and returns a parsed representation of the XML document.

The object returned from minidom.parse is a Document object. It is a descendant of the Node class. This Document object is the root level of a complex tree like structure of interlocking Python objects that completely represents the XML document you passed to minidom.parse.

toxml is a method of the Node class (and is therefore available on the Document object you got from minidom.parse. toxml prints out the XML that this Node represents. For the Document node, this prints out the entire XML document.

Traversal

To understand traversal and nodes try the following example code.

>>> binaryxml.childNodes
[<xml.dom.minidom.DocumentType instance at 0xb777c68c>, <DOM Element: grammar at 0xb777c6ec>]
>>> binaryxml.childNodes[0]
<xml.dom.minidom.DocumentType instance at 0xb777c68c>
>>> binaryxml.firstChild
<xml.dom.minidom.DocumentType instance at 0xb777c68c>

Every Node has a childNodes attribute, which is a list of the Node objects. A Document always has only one child node, the root element of the XML document (in this case, the grammar element). To get the first, and in this case last as well, child node, just use regular list syntax. Remember, there is nothing special going on here; this is just a regular Python list of regular Python objects. Getting the first child node of a node is a common activity. The Node class has a firstChild attribute, which is synonymous with childNodes[0]. (There is also a lastChild attribute, which is synonymous with childNodes[-1].)

>>> grammarElement = xmldoc.firstChild
>>> print grammarElement.toxml()
<grammar>
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>

The toxml method is defined in the Node class therefore it is available to any (XML) node and not just the Document element.

>>> grammarElement.childNodes
[<DOM Text node "\n">, <DOM Element: ref at 17533332>, \
<DOM Text node "\n">, <DOM Element: ref at 17549660>, <DOM Text node "\n">]
>>> print grammarElement.firstChild.toxml()
>>> print grammarElement.childNodes[1].toxml()
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
>>> print grammarElement.childNodes[3].toxml()
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
>>> print grammarElement.lastChild.toxml()

The XML in binary.xml seems to have two elements/nodes in grammar the two ref elements. But you’re missing something! The carriage returns … After the '<grammar>' and before the first '<ref>' there is a carriage return and this counts as a child node of the grammar element. Similarly there is a carriage return after each '</ref>'. These count as child nodes. Therefore grammar.childNodes has actually a list of 5 objects: 3 Text objects and 2 Element objects.

The first child is a Text object representing the carriage return after the '<grammar>' tag and before the first '<ref>' tag. The second child is an Element object representing the first ref element. The fourth child is an Element object representing the second ref element. The last child is a Text object representing the carriage return after the '</ref>' end tag and before the '</grammar>' end tag.

>>> grammarElement
<DOM Element: grammar at 19167148>
>>> refElement = grammarElement.childNodes[1]
>>> refElement
<DOM Element: ref at 17987740>
>>> refElement.childNodes
[<DOM Text node "\n">, <DOM Text node "  ">, <DOM Element: p at 19315844>, \
<DOM Text node "\n">, <DOM Text node "  ">, \
<DOM Element: p at 19462036>, <DOM Text node "\n">]
>>> pElement = refElement.childNodes[2]
>>> pElement
<DOM Element: p at 19315844>
>>> print pElement.toxml()
<p>0</p>
>>> pElement.firstChild
<DOM Text node "0">
>>> pElement.firstChild.data
u'0'

As you saw in the previous example, the first ref element is grammarNode.childNodes[1], since childNodes[0] is a Text node for the carriage return. The ref element has its own set of child nodes, one for the carriage return, a separate one for the spaces, one for the p element, and so forth. You can even use the toxml method here, deeply nested within the document. The p element has only one child node (you can’t tell that from this example, but look at pNode.childNodes if you don’t believe me), and it is a Text node for the single character '0'. The .data attribute of a Text node gives you the actual string that the text node represents. But what is that 'u' in front of the string?

The 'u' means that the string has unicode encoding unlike the default ASCII in Python environment. The reason is that web services standardize encoding with unicode and that is the same reason why 'u' precedes  '0'.

Searching for elements

Traversing XML documents by stepping through each node can be tedious. If you want to find and get to an element in a hige XML document then there is a smart way to do it. getElementsByTagName.

>>> from xml.dom import minidom
>>> binaryxml = minidom.parse('binary.xml')
>>> refList = binaryxml.getElementsByTagName('ref')
>>> refList
[<DOM Element: ref at 136138108>, <DOM Element: ref at 136144292>]
>>> print refList[0].toxml()
<ref id="bit">
  <p>0</p>
  <p>1</p>
</ref>
>>> print refList[1].toxml()
<ref id="byte">
  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>

getElementsByTagName takes one argument, the name of the element you wish to find. It returns a list of Element objects corresponding to the XML elements that have that name. In this case you have two ref elements.

>>> firstRef = refList[0]
>>> print firstRef.toxml()
<ref id="bit">
<p>0</p>
<p>1</p>
</ref>

The first object in your refList is the 'bit' ref element. You can use the same getElementsByTagName method on this Element to find all the <p> elements within the 'bit' ref element. The getElementsByTagName method returns a list of all the elements it found (in this case 2).

>>> pList = firstRef.getElementsByTagName("p")
>>> pList
[<DOM Element: p at 136140116>, <DOM Element: p at 136142172>]
>>> print pList[0].toxml()
<p>0</p>
>>> print pList[1].toxml()
<p>1</p>>>> pList = xmldoc.getElementsByTagName("p")
>>> pList
[<DOM Element: p at 136140116>, <DOM Element: p at 136142172>, <DOM Element: p at 136146124>]
>>> pList[0].toxml()
'<p>0</p>'
>>> pList[1].toxml()
'<p>1</p>'
>>> pList[2].toxml()
'<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>'

Note carefully that the difference between this and the previous example. Previously, you were searching for p elements within firstRef, but here you are searching for p elements within binaryxml, the root-level object that represents the entire XML document. This does find the p elements nested within the ref elements within the root grammar element. The first two p elements are within the first ref (the ‘bit’ ref). The last p element is the one within the second ref (the ‘byte’ ref).

This article reduces theory and provides more focus on techniques by example. This tutorial has been derived from Dive into Python, Chapter 9.

Advertisements
One Comment leave one →
  1. March 30, 2014 11:32 AM

    Hey there, You have performed an incredible job. I’ll definitely digg it and in my view recommend to my friends. I’m sure they will be benefited from this website.
    NEIL http://www.net-ict.be/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: