Home | People | XML | Emulation |
---|---|---|---|
Python |
The Python programming language provides an increasing amount of support for XML technologies. This document attempts to introduce some basic XML processing concepts to readers who have not yet started to use Python with XML, and it takes the form of a tutorial. It is assumed that the reader knows the basic terminology of XML and is comfortable with "XML as text".
(There is also a Polish translation of this document.)
I believe that Python release 2.0 or greater is most appropriate for XML processing - this is partly due to the introduction of Unicode support and the reasonably high probability that, for a number of readers, some XML documents encountered will use character sets which are not so easily supported using traditional Python character strings. I cannot really imagine how Python 1.5.2, for example, is able to handle "non-Western-European" characters, but given the length of time since the introduction of Python 2.0, as well as the familiarity the Python community has with the newer features of the releases from that time until the present, Python 2.0 is a safe and not unreasonable requirement.
Some releases of Python come with a fair amount of built-in XML support,
and the extent of this support can be tested by starting Python interactively
and testing for the presence of the minidom
module:
import xml.dom.minidom
Should this module be imported without complaints from the interpreter, it would appear that your Python version is fairly recent and probably good enough for the purposes of this tutorial. Otherwise, you should download the PyXML package, choosing release 0.6.6 or greater.
To access the functions defined in the examples in this page, you can use
xhtmlhook by first installing the software,
then adding http://www.boddie.org.uk/python/
to your
PYTHONPATH
or to sys.path
, and then importing this
page as XML_intro
(which is done below in the first prompt-based
example).
If you just want to copy and paste the code, feel free to ignore references
to xhtmlhook
and XML_intro
: if you have copied the
example functions and pasted them into Python (either at the prompt or in a file)
then the rest of the example code should just work.
All the activities in this tutorial require the import of the
minidom
module. Therefore, for all of the program fragments
featured, this module must have been made available through an import
statement similar to the following:
import xml.dom.minidom
We could import just the classes we need from the module, but we can leave our options open at this point. Entering the following statement gives us an idea of the contents of the module:
dir(xml.dom.minidom)
Before we start to experiment, here is a note about namespaces. It is very tempting not to use namespaces when starting out with XML documents and XML processing, but namespaces provide an interesting way of associating XML elements with certain meanings, applications and "domains". Rather than cause confusion later by introducing namespaces after the basic operations have been presented, I believe that they are easy enough to use at the start not to cause confusion at all.
First, import the minidom
module:
import xml.dom.minidom
To create a new XML document, just instantiate a new Document
object:
def get_a_document(): doc = xml.dom.minidom.Document()
This document is not really interesting without some contents, however, so we should add something to it.
XML documents have a single "root element" inside which all other elements and pieces of text are placed. We could create an XML document which describes departments in a company with a "root element" called "business" within the namespace "http://www.boddie.org.uk/paul/business" - we want to communicate the fact that the "business" element means something special to us, and that the use of our namespace indicates that it is our special "business" element rather than any old home-made "business" element; the following statement creates such an element:
business_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "business")
At this point, the element exists but has not been placed in the document; we need to add it to the document at the "root level" as follows:
doc.appendChild(business_element)
Let us return the created objects at the end of this function:
return doc, business_element
Now, the "root element" has been added - we can investigate this by querying the document about the elements within it. At the Python prompt...
>>> import xhtmlhook # to read this document >>> from XML_intro.Creating import * >>> doc, business_element = get_a_document()
>>> doc.childNodes
[<DOM Element: business at 136860108>]
This shows a list of elements with only one element present within it. We can, of course, examine the list and the element more closely:
>>> doc.childNodes[0].namespaceURI
'http://www.boddie.org.uk/paul/business'
This was the namespace that we gave our element.
>>> doc.childNodes[0].localName
'business'
This was the element name that we used.
We can add elements within this one. For example, a "location" element might be interesting to describe a particular location of part of a company, and such an element could be created in a similar way to that used above:
def add_a_location(doc, business_element): location_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "location")
We add the element as a child node of the old element in the same fashion as before:
business_element.appendChild(location_element)
Let us return the created object at the end of this function:
return location_element
At this point, it is possible to navigate down from the "root" of the document to the newly added element:
>>> location_element = add_a_location(doc, business_element)
>>> doc.childNodes[0].childNodes
[<DOM Element: location at 136781996>]
>>> doc.childNodes[0].childNodes[0]
<DOM Element: location at 136781996>
>>> doc.childNodes[0].childNodes[0].namespaceURI
'http://www.boddie.org.uk/paul/business'
>>> doc.childNodes[0].childNodes[0].localName
'location'
Text is a central feature of XML documents - inside elements, blocks of text may be stored and retrieved. We might have, in our example, an element which is called "surroundings", and this could be found within the "location" element as a means of describing the surroundings of a particular company location. Inside the "surroundings" element, there could be a block of text which forms such a description.
Here, we create the new "surroundings" element and add it within the "location" element:
def add_surroundings(doc, location_element): surroundings_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "surroundings") location_element.appendChild(surroundings_element)
And we specify the descriptive text within this new element by creating a new text node:
description = doc.createTextNode("A quiet, scenic park with lots of wildlife.")
Of course, we need to add this to the document, and since the text is to be included within the "surroundings" element, it makes sense to add it to that element as a child node:
surroundings_element.appendChild(description)
Let us return the created object from this function:
return surroundings_element
We may now find our way from the document "root", if we want to:
>>> surroundings_element = add_surroundings(doc, location_element)
>>> doc.childNodes[0].childNodes[0].childNodes[0].childNodes[0]
<DOM Text node "A quiet, s...">
It is possible to see the entire contents of the text node using the
nodeValue
attribute of the node:
>>> doc.childNodes[0].childNodes[0].childNodes[0].childNodes[0].nodeValue
'A quiet, scenic park with lots of wildlife.'
We can, as with elements (as we shall see in a moment), add many text nodes within an element:
def add_more_surroundings(doc, surroundings_element): description = doc.createTextNode(" It's usually sunny here, too.") surroundings_element.appendChild(description)
Here is the "proof" that this worked:
>>> add_more_surroundings(doc, surroundings_element)
>>> surroundings_element.childNodes
[<DOM Text node "A quiet, s...">, <DOM Text node " It's usual...">]
We might want all our text within one node in future, however. Fortunately, a method exists to collect text nodes together (note the "-ize" spelling):
def fix_element(element): element.normalize()
The results can be investigated, too:
>>> fix_element(surroundings_element)
>>> surroundings_element.childNodes[0].nodeValue
"A quiet, scenic park with lots of wildlife. It's usually sunny here, too."
Elements in XML documents may have attributes attached to them. For example, the "location" element could have another element within it (alongside the "surroundings" element) entitled "building", and this new element could have an attribute called "name":
def add_building(doc, location_element): building_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "building")
After adding this element...
location_element.appendChild(building_element) return building_element
...it should be noticed that the new element appears after the "surroundings" element as a child element (or node) of "location". This can be seen at the Python prompt as follows:
>>> building_element = add_building(doc, location_element)
>>> location_element.childNodes
[<DOM Element: surroundings at 136727844>, <DOM Element: building at 136286548>]
Now, we may add an attribute directly to the new element like this:
def name_building(building_element): building_element.setAttributeNS("http://www.boddie.org.uk/paul/business", "business:name", "Ivory Tower")
After the namespace and the element name, the value is specified. This attribute does not need to be explicitly added to the element, although we could have used other means of creating and adding it. This can be tested as follows:
>>> name_building(building_element)
>>> building_element.getAttributeNS("http://www.boddie.org.uk/paul/business", "name")
'Ivory Tower'
One important thing to note is the use of the "qualified name" when setting the attribute and the "local name" when getting the attribute value.
One might expect that by setting an attribute with a particular namespace, there would not be any need to explicitly state a prefix to appear before the local name in the final, written XML document - after all, it should be possible to detect that a namespace has been employed and that a prefix is required in the full attribute name so that the attribute is recognised as being associated with that namespace. Unfortunately, we do not seem to have that luxury and must explicitly specify a prefix as part of the qualified name when setting such an attribute; in the above example, the qualified name consists of the prefix "business" and local name "name".
All the above effort should not be wasted, and so we will attempt to write
the document out. One of the easiest ways of doing this, whilst respecting
the namespaces that we have carefully included, is to use another module
(along with the minidom
) module:
import xml.dom.ext import xml.dom.minidom
Inside this module are a number of useful functions and classes. However,
we shall use the PrettyPrint
function to write the document to
"standard output", ie. the screen:
def write_to_screen(doc): xml.dom.ext.PrettyPrint(doc)
We can then try this out:
>>> from XML_intro.Writing import * >>> write_to_screen(doc)
<?xml version='1.0' encoding='UTF-8'?> <business xmlns='http://www.boddie.org.uk/paul/business' xmlns:business='http://www.boddie.org.uk/paul/business'> <location> <surroundings>A quiet, scenic park with lots of wildlife.</surroundings> <building business:name='Ivory Tower'/> </location> </business>
We could have used another, simpler printing function or class, but we are
usually so accustomed to (or spoilt by) nicely formatted textual XML
documents that anything less than PrettyPrint
probably would not
do! Especially since PrettyPrint
allows us to write the document
to a file:
def write_to_file(doc, name="/tmp/doc.xml"): file_object = open(name, "w") xml.dom.ext.PrettyPrint(doc, file_object) file_object.close()
Or even easier:
def write_to_file_easier(doc, name="/tmp/doc.xml"): xml.dom.ext.PrettyPrint(doc, open(name, "w"))
NOTE: Should we not use "wb" for portability reasons?
>>> write_to_file(doc)
XML would not be very useful if we could not read it back later for
subsequent processing. Fortunately, minidom
makes this fairly
easy to achieve:
import xml.dom.minidom def get_a_document(name="/tmp/doc.xml"): return xml.dom.minidom.parse(name)
Or, if a file is already open for the purposes of reading...
def get_a_document_from_file(file_object): return xml.dom.minidom.parse(file_object)
Unfortunately, if a file is written out in a prettyprinted fashion, it gets read in with all the extra padding, and this padding appears as text nodes between elements where we never inserted any kind of text nodes before. For example:
>>> from XML_intro.Reading import * >>> doc2 = get_a_document()
>>> doc2.childNodes[0].childNodes[0]
<DOM Text node "\012">
What is this new text node? (We expected to see the "location" element's DOM object here instead.) Well, it appears to be a newline character which was inserted into our file to make the contents of that file look good when viewed in a text editor, for example. There are more of these things, too:
>>> doc2.childNodes[0].childNodes[1]
<DOM Text node " ">
This text node is a piece of indentation - something which made the textual form of the "location" element appear slightly further right than the "left margin" used by the "business" element. If we keep looking, though, we can find the "location" element:
>>> doc2.childNodes[0].childNodes[2]
<DOM Element: location at 137096548>
So how do we avoid referencing the wrong nodes in the document?
Here is a brief description of each of the above options:
What we could do in the above case is to loop over the child nodes and examine each node to see what type it is. If it is an element then we investigate it, starting with "business" just to be safe:
def find_business_element(doc): business_element = None for e in doc.childNodes: if e.nodeType == e.ELEMENT_NODE and e.localName == "business": business_element = e break return business_element
We know that if business_element
is not None
,
then we found an element called "business". Naturally, we can change the name
of the element to be found according to the situation. Note that we compare
the value of the nodeType
attribute to a special attribute
called ELEMENT_NODE
. It is not obvious from this example, but
special attributes such as ELEMENT_NODE
and
TEXT_NODE
actually belong to the
xml.dom.minidom.Node
class, and are therefore available in (or
shared by) all node objects.
NOTE: This should be documented somewhere. I haven't investigated it yet.
It should possble to show the "surroundings" element's textual contents
just by searching for the "surroundings" element, finding its child nodes,
and then getting their values. Here, we use the
getElementsByTagNameNS
method on the document object to find all
occurrences of the "surroundings" element in the document:
def get_surroundings_elements(doc): return doc.getElementsByTagNameNS("http://www.boddie.org.uk/paul/business", "surroundings")
The result can then be investigated:
>>> elements = get_surroundings_elements(doc2)
>>> elements
[<DOM Element: surroundings at 137101100>]
So, the contents of the element can indeed be discovered, too:
>>> elements[0]
<DOM Element: surroundings at 137101100>
>>> elements[0].childNodes
[<DOM Text node "A quiet, s...">]
>>> elements[0].childNodes[0].nodeValue
u"A quiet, scenic park with lots of wildlife. It's usually sunny here, too."
Note that the description that appears as the value of the text node is
actually a Unicode string, but this should not concern you too much - Unicode
strings are widely accepted by minidom
and behave a lot like
"traditional" Python strings.
Now, there are a number of problems with the above approach:
The principal problem with using such convenience functions is that any structure or context that an element may have, which is taken into consideration by descending into the document yourself, is lost when the results of the function are returned: all "surroundings" elements are bundled together into one package and handed back. However, we can learn about the context of the elements by exploring various useful attributes of those elements.
Every "surroundings" element should have a "location" element as its parent. We can check this at the Python prompt:
>>> elements[0].parentNode
<DOM Element: location at 137096548>
This at least allows us to compare "location" nodes against each other using something like this:
def examine_descriptions(elements): if elements[0].parentNode is elements[1].parentNode: print "They both describe the same location."