Python and XML - XML Processing with Python

Python and XML - XML Processing with Python

by Christopher A. Jones, Fred L. Drake Jr

View All Available Formats & Editions

If you are a Python programmer who wants to incorporate XML into your skill set, this is the book for you. Python has attracted a wide variety of developers, who use it either as glue to connect critical programming tasks together, or as a complete cross-platform application development language. Yet, because it is object-oriented and has powerful text manipulation

…  See more details below


If you are a Python programmer who wants to incorporate XML into your skill set, this is the book for you. Python has attracted a wide variety of developers, who use it either as glue to connect critical programming tasks together, or as a complete cross-platform application development language. Yet, because it is object-oriented and has powerful text manipulation abilities, Python is an ideal language for manipulating XML.Python & XML gives you a solid foundation for using these two languages together. Loaded with practical examples, this new volume highlights common application tasks, so that you can learn by doing. The book starts with the basics then quickly progresses to complex topics, like transforming XML with XSLT, querying XML with XPath, and working with XML dialects and validation. It also explores the more advanced issues: using Python with SOAP and distributed web services, and using Python to create scalable streams between distributed applications (like databases and web servers).The book provides effective practical applications, while referencing many of the tools involved in XML processing and Python, and highlights cross-platform issues along with tasks relevant to enterprise computing. You will find ample coverage of XML flow analysis and details on ways in which you can transport XML through your network.Whether you are using Python as an application language, or as an administrative or middleware scripting language, you are sure to benefit from this book. If you want to use Python to manipulate XML, this is your guide.

Read More

Product Details

O'Reilly Media, Incorporated
Publication date:
Product dimensions:
7.10(w) x 9.18(h) x 0.96(d)

Read an Excerpt

Chapter 1: Python and XML

Python and XML are two very different animals, each with a rich history. Python is a full-scale programming language that has grown from scripting world roots in a very organic way, through the vision and guidance of Python's inventor Guido van Rossum. Guido continues to take into account the needs of Python developers as Python matures. XML, on the other hand, though strongly impacted by the ideas of a small cadre of visionaries, has grown from standards committee roots. It has seen both quiet adoption and wrenching battles over its future. Why bother putting the two technologies together?

Before the Python/XML combination, there seemed no easy or effective way to work with XML in a distributed environment. Developers were forced to rely on a variety of tools used in awkward combination with one other. We used shell scripting and Perl to process text and interact with the operating system, and then used Java XML API's for processing XML and network programming. The shell provided an excellent means of file manipulation and interaction with the Unix system, and Perl was a good choice for simple text manipulation, providing access to the Unix APIs. Unfortunately, neither sported a sophisticated object model. Java, on the other hand, featured an object-oriented environment, a robust platform API for network programming, threads, and graphical user interface (GUI) application development. But with Java, we found an immediate lack of text manipulation power; scripting languages typically provided strong text processing. Python presented a perfect solution--as it seemed to combine the strengths of all of these various options.

Like most scripting languages, Python features excellent text and file manipulation capabilities. Yet unlike most scripting languages, Python sports a powerful object-oriented environment with a robust platform API for network programming, threads, and graphical user interface development. It can be extended with components written in C and C++ with ease, allowing it to be connected to most existing libraries. To top it off, Python has been shown to be more portable than other popular interpreted languages, running comfortably on platforms ranging from massively parallel Connection Machines to personal digital assistants and other embedded systems. As a result, Python is an excellent choice for XML programming and distributed application development.

It could be said that Python brings much sanity and robustness to the scripting world, much in the same way that Java once brought sanity and robustness to the C++ world. As always, there are trade-offs. In moving from C++ to Java, you find a simpler language with stronger object-oriented underpinnings. In moving to a simpler language further removed from the low-level details of memory management and the hardware, you gain robustness and an improved ability to locate coding errors. You also encounter a rich API equipped with easy thread management, network programming, and support for Internet technologies and protocols. As may be expected, this flexibility comes at a cost: You also encounter some reduced performance when comparing it with languages such as C and C++.

Likewise, when choosing a scripting language like Python over C, C++, or even Java, you do make some concessions. You trade performance for robustness and for the ability to develop more rapidly. In the area of enterprise and Internet systems development, choosing reliable software, flexible design, and rapid development and deployment are factors that outweigh the performance gains you might get by using a language such as C++. If you do need some of the performance back, you can still implement speed-sensitive components of your application in C or C++, but you can avoid doing so until you have profiling data to help you pinpoint what is really a problem and what only might be a problem. (How to perform the analysis and write extensions in C/C++ is a topic for other books.)

Regardless of your feelings on scripting languages, Java, or C++, this book focuses on XML and the Python language. For those who are new to XML, let's start with an overview of why it is interesting, and then we'll move on to using it from Python and seeing how we make our XML applications easier to create.

Key Advantages of XML

XML has a few key advantages that make it the data language of choice on the Internet. These advantages were designed into XML from the beginning, and, in fact, are what make it so appealing to Internet developers.

Application Neutrality

First, XML is both human- and machine-readable. This is not a subtle point: Have you ever tried to read a Microsoft Word document with a text editor? You can't if it was saved as a .doc file, because the information in a .doc document is in a binary (computer readable only) format, even though most Word documents primarily consist of text. A Word document cannot be shared with any other application besides Word--unless that application has been taught the intricacies of Word's binary format. In this case, the application must also be taught to expect changes in Word's format each time there is a new release from Microsoft.

This sounds annoying for the developer, but how bad is it, really? After all, Word is incredibly popular, so it must not be too hard to figure out. Let's look at the top of the Word file that contains this chapter:

Ï_ࡱ_á                > _ ÿ    _           _   B_       _  D_  _  
ÿÿÿ    ?_  @_  A_ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á 7         _  ≤_¿      _     _  >_  _ 
bjbjU_U_                         __ 0¸_ 7|  7|  W_  _     C            
           ÿÿ_         ÿÿ_         ÿÿ_                 l     Ê_      
Ê_  Ê_      Ê_      Ê_      Ê_      Ê_  ¶           _      

This certainly looks familiar to anyone who has ever opened a word file with a text editor. We don't see our recognizable text (the content we intended) so we must assume it is buried deep in the file. Determining what the true content is and where it is can be difficult, but it shouldn't be. It is our data, after all. Let's try another supported format: "Rich Text Format," or RTF. Unlike the .doc file, this format is text-based, and should therefore be a bit easier to decipher. We search down in the file to find the start of our text:

\par }\pard \s34\qr
date-967302179\pnrnot1\adjustright\rin0\lin0\itap0 {\b0\fs48 Combining
Python and XML}{
\b0\deleted\fs48\revauthdel1\revdttmdel-2041034726 Fundamentals}{\b0\f
s48\revised\revauth1\revdttm-2041034726 ?}{\b0\fs48 
\par }\pard\plain \qj 

Better: The chapter title is visible, so we can try to decipher the structure from that point forward. The markup appears to be complex, and there's a hint of an old version of the chapter title. To extract the text we actually want, we need to understand the Word model for revision tracking, which still presents many challenges.

XML, on the other hand, is application-neutral. In other words, an XML document is usually processed by an XML parser or processor, but if one is not available, an XML document can be easily read and parsed. Data kept in XML is not trapped within the constraints of one particular software application. The ability to read rich data files can become very valuable when for example, 20 years from now, you dig up a CD-ROM of old business forms that you suddenly find you need again--Will QuickBooks still allow you to extract this same data in 2021. With XML you can read the data with any text editor (provided there aren't radical shifts in text encoding). You can even with your eyes.

Let's look at this chapter in XML? Using markup from a common document type for software manuals and documentation (DocBook), it appears somewhat verbose, and doesn't include change-tracking information, but we can identify the text quite easily now:

  <title> Python and XML</title>
  <para>Python and XML are two very different animals, each with a
    rich history.  Python is a full-scale programming language that has grown
    from scripting world roots, and has done so in a very organic way

Note also that additional characters appear in the document (other than the document content;) these are called markup. We saw this in the RTF version of the document as well, but there were many more bits of text that were difficult to decipher, and we can reasonably surmise that the strange data in the MS Word document would correspond to this in some way. Were this a book on RTF, you would quickly surmise two things: RTF is much more like a printer control language than the example of XML we just looked at, and that writing a program that understands it will be quite difficult. In this book, we're going to show you that XML can be used to define languages that fit your application, and that creating programs that can decipher XML is not too difficult a task, especially with the help of Python.

Hierarchical Structure

XML is hierarchical, and allows you to choose your own tag names. This is quite different from HTML. In XML, you are free to create elements of any type, and stack other elements within those elements. For example, consider an address entry:

<?xml version="1.0"?>
  <name>Bubba McBubba</name>
  <street>123 Happy Go Lucky Ln.</street>

In the above well-formed XML code, I came up with a few record names and then lumped them together with data. XML processing software, such as a parser (which you use to interpret the syntactic constructs in an XML document), would be able to represent this data in many ways, because its structure has been communicated. For example, if we were to look at what an application programmer might write in source code, we could turn this record into an object initialized this way:

addr = Address(  ) = "Bubba McBubba"
addr.street = "123 Happy Go Lucky Ln." = "Seattle"
addr.state = "WA" = "98056"

This approach makes XML well suited as a format for many serialized objects. (There are some constructs for which XML is not so well suited, including many formats for large numerical datasets used in scientific computing.) XML's hierarchical structure makes it easy to apply the concept of object interfaces to documents--it's quite easy to build application-specific objects directly from the information stream given mappings from element names to object types. We'll later see that we can model more than simple hierarchical structures with XML.

Platform Neutrality

Remember that XML is cross-platform. While this is mainly a feature of its text-based format, it's still very much true. The use of certain text encodings ensures that there are no misconceptions among platforms as to the arrangement of an XML document. Therefore, it's easy to pass an XML purchase order from a Unix machine to a wireless personal digital assistant. XML is designed for use in conjunction with existing Internet infrastructure using HTTP, SSL, and other messaging protocols as they evolve. These qualities make XML lend itself to distributed applications; it has been successfully used as a foundation for message queuing systems, instant messaging applications, and remote procedure call frameworks. We'll examine these applications further in Chapter 9 and Chapter 10. It also means that our document example given earlier is more than simply application-neutral, but can be readily moved from one type of machine to another without loss of information. A chapter of a technical book can be written by a programmer on his or her favorite flavor of Unix, and then sent to a publisher using book composition software on a Macintosh, and the many difficult format conversions can be avoided.

International Language Support

As the Internet becomes increasingly pervasive in our daily lives, both for professional and personal use, we become more aware of the world around us: that it is a culture-rich and diversified place. As technologists, however, we are still learning the significance of making our software work in ways that supports more than one language at a time; making our text-processing routines "8-bit safe" is not only no longer sufficient, it's no longer even close.

Standards bodies all over the world have come up with ways that computers can interchange text written in their national languages, and sometimes they've come up with several, each having varying degrees of acceptance. Unfortunately, most applications do not include information about which language or interchange standard their data is written in, so it is difficult to share information across the cultural and linguistic boundaries the different standards represent, and sometimes it is difficult to share information within such boundaries if multiple standards are prominent.

The difficulties are compounded by very substantial cultural differences that present themselves about how text is handled. There are many different writing systems in addition to the western European left-to-right, top-to-bottom in which this book is written; right-to-left is not uncommon, and top-to-bottom "lines" of text arranged right-to-left on the page is used in China. Hebrew uses a right-to-left writing system, but numbers are written using Arabic numerals from left to right. Other systems support textual annotations written in parallel with the text. Consider what happens when a document includes text from different writing systems!

Standards bodies are aware of this problem, and have been working on solutions for years. The editors of the XML specification have wisely avoided proposing new solutions to most of these problems, and are instead choosing to build on the work of experts on the topic and existing standards.

The International Organization for Standardization (ISO) and the Unicode Consortium ( have arrived at a single standard that, while not perfect, is perhaps the most capable standard attempting to unify the world's text representations, with the intent that all languages and alphabets (including ideographic and hieroglyphic character sets). The standard is known as ISO/IEC 10646, or more commonly, Unicode. Not all national standards bodies have agreed that Unicode is the standard for all future text interchange applications, especially in Asia, but there is wide-spread belief that Unicode is the best thing available to serve everyone. The standard deals with issues including multi-directional text, capitalization rules, and encoding algorithms that can be used to ensure various properties of data streams. The standard does not deal specifically with language issues that are not tied intimately to character issues. Software sensitive to natural language may still need to do a lot beyond using Unicode to ensure proper collation of names in a particular language (or multiple languages!). Some languages will require substantial additional support for proper text rendering (Arabic, for instance, which requires different letterforms for characters based on their position within a word and based on neighboring letterforms).

What the World-Wide Web Consortium did to make it easier to use both the older interchange standards and Unicode was both a simple and masterful stroke. It required that all XML documents be Unicode, and specified that they must describe their own encoding in such a way that all XML processors were able to determine what encoding any XML document was written in. A few specific encodings must be recognized by all processors, so that it is always possible to generate XML that can be read anywhere, and can represent all of the world's characters. There is also a feature that allows the content of XML documents to be labeled with the actual language it is written in, but that's not used as much as it could be at this time.

Since XML documents are Unicode documents, the languages of the world are supported. The use of Unicode and encodings in XML will be discussed in some detail in Chapter 2. Unicode strings have been a part of Python since version 2.0, and the Python standard library includes support for a large number of encodings.

The XML Specifications

In the trade press, we often see references about how XML "now supports" some particular industry-specific application. The article that follows is often confused, offering some small morsel of information about an industry consortium that has released a new specification for an XML-based language to support interoperability of data within the consortium's industry. As technical people, we usually note that it doesn't apply for the industries we involved in, or it does but the specification is too early a draft to be useful. In fact, our managers will probably agree with us most of the time, or they'll be privy to some relevant information that causes them to disagree. If we step up the corporate ladder a couple more rungs, however, we often find an increase in the level of confusion over XML. Sometimes, this is accompanied by either a call to "adopt XML" (too often with a list of specific specifications that are not intended to be used together), or a reaction that XML is too immature to use at all.

So we need to think about just what we can work with that will meet the following criteria:

  • Make technical sense for our application
  • Is sufficiently well-defined that implementation is possible
  • Can be explained and justified to (at least) our direct managers
  • Won't freak out the upper management

Ok, we're technical people, so we may have to ignore that last item; it certainly won't be covered in this book. In fact, most of this really can't be covered in technical material. There are many specifications in various stages of maturity, and most are specific to one industry or another. What we can do is point out what the foundation specifications are, because those you will need regardless of your industry or other requirements.

XML 1.0 Recommendation

The XML specification itself is a document created and maintained by the W3C. As of this writing, the current version is Extensible Markup Language (XML) 1.0 (Second Edition), and is available from the W3C web site at (The second edition differs from the first only in that some editorial corrections and clarifications have been made; the specification is stable.)

XML itself is not a markup language, but a meta-language that can be used to define specific markup languages. In this, it inherits much from SGML. The specification covers five aspects of markup languages:

  • Range of structural forms which can be marked
  • Specific syntax of markup components
  • A schema language used to define specific languages
  • Definition of validity constraints
  • Minimum requirements for processing tools

Unlike SGML, XML allows itself to be used without defining an explicit markup language in any formal way. Whether or not this is useful for your applications, it has greatly accelerated the acceptance of XML-based technologies in some developer communities. This can happen because of the lower cost of entrance to the XML space: It is possible to adopt XML without learning some of the more esoteric corners of the specification, and development prototypes can start using XML technologies without a lot of advance planning.

Chapter 2 presents the most widely used parts of the specification and goes into more depth on items of greater importance to most readers of this book. If any of the details are of particular interest to you, please spend some time reading relevant parts of the specification. While it is at times a bit convoluted, it is not generally a difficult specification to read....

Read More

Meet the Author

Jones has extensive background in Internet systems programming and XML. He is co-founder of Planet 7 Technologies, a Seattle-based commercial software company specializing in XML transport software.

Drake is a member of the PythonLabs team, and has been contributing to Python since 1995. He took over maintenance of Python's documentation in 1998, changing the face of both the printed and online forms. He holds a Bachelor of Architecture degree as well as a Master of Science in computer science.

Customer Reviews

Average Review:

Write a Review

and post it to your social network


Most Helpful Customer Reviews

See all customer reviews >