Uh-oh, it looks like your Internet Explorer is out of date.

For a better shopping experience, please upgrade now.

Learning XML

Learning XML

5.0 1
by Erik T. Ray, John Posner (Editor), Chris Maden

The arrival of support for XML—the Extensible Markup Language—in browsers and authoring tools has followed a long period of intense hype. Major databases, authoring tools (including Microsoft's Office 2000), and browsers are committed to XML support. Many content creators and programmers for the Web and other media are left wondering, "What can XML and


The arrival of support for XML—the Extensible Markup Language—in browsers and authoring tools has followed a long period of intense hype. Major databases, authoring tools (including Microsoft's Office 2000), and browsers are committed to XML support. Many content creators and programmers for the Web and other media are left wondering, "What can XML and its associated standards really do for me?" Getting the most from XML requires being able to tag and transform XML documents so they can be processed by web browsers, databases, mobile phones, printers, XML processors, voice response systems, and LDAP directories, just to name a few targets.In Learning XML, the author explains XML and its capabilities succinctly and professionally, with references to real-life projects and other cogent examples. Learning XML shows the purpose of XML markup itself, the CSS and XSL styling languages, and the XLink and XPointer specifications for creating rich link structures.The basic advantages of XML over HTML are that XML lets a web designer define tags that are meaningful for the particular documents or database output to be used, and that it enforces an unambiguous structure that supports error-checking. XML supports enhanced styling and linking standards (allowing, for instance, simultaneous linking to the same document in multiple languages) and a range of new applications.For writers producing XML documents, this book demystifies files and the process of creating them with the appropriate structure and format. Designers will learn what parts of XML are most helpful to their team and will get started on creating Document Type Definitions. For programmers, the book makes syntax and structures clear It also discusses the stylesheets needed for viewing documents in the next generation of browsers, databases, and other devices.

Editorial Reviews

The Barnes & Noble Review
By now we've seen dozens of introductory XML books, and we're becoming harder to impress. Learning XML impresses us mightily.

Erik Ray's day job is helping O'Reilly implement XML workflow -- talk about eating your own dog food! In this book, he does exactly what he promises: presents a "birds-eye view of the XML landscape." It's not a programming book: It's focused on key ideas and tools you need to understand whatever you want to do with XML -- document management, web sites, application development/integration, B2B, you name it.

Learning XML is just plain well written. Clear explanations. Simple examples. Insight into what's solid about XML and what's still in flux. And coverage of every topic that matters, from links to presentation, document models to transformation, internationalization to basic programming. If you're just starting out with XML, you're lucky to have it. (Bill Camarda)

Bill Camarda is a consultant and writer with nearly 20 years' experience in helping technology companies deploy and market advanced software, computing, and networking products and services. His 15 books include Special Edition Using Word 2000 and Upgrading & Fixing Networks For Dummies®, Second Edition.

Product Details

O'Reilly Media, Incorporated
Publication date:
Edition description:
Older Edition
Product dimensions:
7.02(w) x 9.18(h) x 0.77(d)

Read an Excerpt

Chapter 2: Markup and Core Concepts


The Anatomy of a Document
Elements: The Building Blocks of XML
Attributes: More Muscle for Elements
Namespaces: Expanding Your Vocabulary
Entities: Placeholders for Content
Miscellaneous Markup
Well-Formed Documents
Getting the Most out of Markup
XML Application: DocBook

This is probably the most important chapter in the book, as it describes the fundamental building blocks of all XML-derived languages: elements, attributes, entities, and processing instructions. It explains what a document is, and what it means to say it is well-formed or valid. Mastering these concepts is a prerequisite to understanding the many technologies, applications, and software related to XML.

How do we know so much about the syntactical details of XML? It's all described in a technical document maintained by the W3C, the XML recommendation (http://www.w3.org/TR/2000/REC-xml-20001006). It's not light reading, and most users of XML won't need it, but you many be curious to know where this is coming from. For those interested in the standards process and what all the jargon means, take a look at Tim Bray's interactive, annotated version of the recommendation at http://www.xml.com/axml/testaxml.htm.

The Anatomy of a Document

Example 2-1 shows a bite-sized XML example. Let's take a look.

Example 2.1. A Small XML Document

<?xml version="1.0"?>
<time-o-gram pri="important">
  <message>Don't forget to recharge K-9 
    <emphasis>twice a day</emphasis>. 
    Also, I think we should have his 
    bearings checked out. See you soon 
    (or late). I have a date with 
    some <villain>Daleks</villain>...
  <from>The Doctor</from>

It's a goofy example, but perfectly acceptable XML. XML lets you name the parts anything you want, unlike HTML, which limits you to predefined tag names. XML doesn't care how you're going to use the document, how it will appear when formatted, or even what the names of the elements mean. All that matters is that you follow the basic rules for markup described in this chapter. This is not to say that matters of organization aren't important, however. You should choose element names that make sense in the context of the document, instead of random things like signs of the zodiac. This is more for your benefit and the benefit of the people using your XML application than anything else.

This example, like all XML, consists of content interspersed with markup symbols. The angle brackets (<>) and the names they enclose are called tags. Tags demarcate and label the parts of the document, and add other information that helps define the structure. The text between the tags is the content of the document, raw information that may be the body of a message, a title, or a field of data. The markup and the content complement each other, creating an information entity with partitioned, labeled data in a handy package.

Although XML is designed to be relatively readable by humans, it isn't intended to create a finished document. In other words, you can't open up just any XML-tagged document in a browser and expect it to be formatted nicely.[1]XML is really meant as a way to hold content so that, when combined with other resources such as a stylesheet, the document becomes a finished product style and polish .

[1]Some browsers, such as Internet Explorer 5.0, do attempt to handle XML in an intelligent way, often by displaying it as a hierarchical outline that can be understood by humans. However, while it looks a lot better than munged-together text, it is still not what you would expect in a finished document. For example, a table should look like a table, a paragraph should be a block of text, and so on. XML on its own cannot convey that information to a browser.

We'll look at how to combine a stylesheet with an XML document to generate formatted output in Chapter 4, "Presentation: Creatingthe End Product". For now, let's just imagine what it might look like with a simple stylesheet applied. For example, it could be rendered as shown in Example 2-2.

Example 2.2. The Memorandum, Formatted with a Stylesheet

Priority: important
To: Sarah
Subject: Reminder
Don't forget to recharge K-9 twice a day. 
Also, I think we should have his bearings checked out. 
See you soon (or late).  I have a date with some Daleks...
From: The Doctor

The rendering of this example is purely speculative at this point. If we used some other stylesheet, we could format the same memo a different way. It could change the order of elements, say by displaying the From: line above the message body. Or it could compress the message body to a width of 20 characters. Or it could go even further by using different fonts, creating a border around the message, causing parts to blink on and off--whatever you want. The beauty of XML is that it doesn't put any restrictions on how you present the document.

Let's look closely at the markup to discern its structure. As Figure 2-1 demonstrates, the markup tags divide the memo into regions, represented in the diagram as boxes containing other boxes. The first box contains a special declarative prolog that provides administrative information about the document. (We'll come back to that in a moment.) The other boxes are called elements. They act as containers and labels of text. The largest element, labeled <time-o-gram>, surrounds all the other elements and acts as a package that holds together all the subparts. Inside it are specialized elements that represent the distinct functional parts of the document. Looking at this diagram, we can say that the major parts of a <time-o-gram> are the destination (<to>), the sender (<from>), a message teaser (<subject>), and the message body (<message>). The last is the most complex, mixing elements and text together in its content. So we can see from this example that even a simple XML document can harbor several levels of structure.

Figure 2.1. Elements in the memo document

A Tree View

Elements divide the document into its constituent parts. They can contain text, other elements, or both. Figure 2-2 breaks out the hierarchy of elements in our memo. This diagram, called a tree because of its branching shape, is a useful representation for discussing the relationships between document parts. The black rectangles represent the seven elements. The top element (<time-o-gram>) is called the root element. You'll often hear it called the document element, because it encloses all the other elements and thus defines the boundary of the document. The rectangles at the end of the element chains are called leaves, and represent the actual content of the document. Every object in the picture with arrows leading to or from it is a node.

Figure 2.2. Tree diagram of the memo

There's one piece of Figure 2-2 that we haven't yet mentioned: the box on the left labeled pri. It was inside the <time-o-gram> tag, but here we see it branching off the element. This is a special kind of content called an attribute that provides additional information about an element. Like an element, an attribute has a label (pri) and some content (important). You can think of it as a name/value pair contained in the <time-o-gram> element tag. Attributes are used mainly for modifying an element's behavior rather than holding data; later processing might print "High Priority" in large letters at the top of the document, for example.

Now let's stretch the tree metaphor further and think about the diagram as a sort of family tree, where every node is a parent or a child (or both) of other nodes. Note, though, that unlike a family tree, an XML element has only one parent. With this perspective, we can see that the root element (a grizzled old <time-o-gram>) is the ancestor of all the other elements. Its children are the four elements directly beneath it. They, in turn, have children, and so on until we reach the childless leaf nodes, which contain the text of the document and any empty elements. Elements that share the same parent are said to be siblings.

Every node in the tree can be thought of as the root of a smaller subtree. Subtrees have all the properties of a regular tree, and the top of each subtree is the ancestor of all the descendant nodes below it. We will see in Chapter 6, "Transformation:RepurposingDocuments", that an XML document can be processed easily by breaking it down into smaller subtrees and reassembling the result later. Figure 2-3 shows some examples of subtrees in our <time-o-gram> example.

Figure 2.3. Some subtrees

And that's the 10-minute overview of XML. The power of XML is its simplicity. In the rest of this chapter, we'll talk about the details of the markup.

The Document Prolog

Somehow, we need to tip off the world that our document is marked up in XML. If we leave it to a computer program to guess, we're asking for trouble. A lot of markup languages look similar, and when you add different versions to the mix, it becomes difficult to tell them apart. This is especially true for documents on the World Wide Web, where there are literally hundreds of different file formats in use.

The top of an XML document is graced with special information called the document prolog. At its simplest, the prolog merely says that this is an XML document and declares the version of XML being used:

<?xml version="1.0"?>

But the prolog can hold additional information that nails down such details as the document type definition being used, declarations of special pieces of text, the text encoding, and instructions to XML processors.

Let's look at a breakdown of the prolog, and then we'll examine each part in more detail. Figure 2-4 shows an XML document. At the top is an XML declaration (1). After this is a document type declaration (2) that links to a document type definition (3) in a separate file. This is followed by a set of declarations (4). These four parts together comprise the prolog (6), although not every prolog will have all four parts. Finally, the root element (5) contains the rest of the document. This ordering cannot be changed: if there is an XML declaration, it must be on the first line; if there is a document type declaration, it must precede the root element.

Figure 2.4. A Document with a prolog and a root element

Let's take a closer look at our <time-o-gram> document's prolog, shown here in Example 2-3. Note that because we're examining the prolog in more detail, the numbers in Example 2-3 aren't the same as those in Figure 2-4.

Example 2.3. A Document Prolog

<?xml version="1.0" encoding="utf-8"?> 
()<!DOCTYPE time-o-gram                                          
()    PUBLIC "-//LordsOfTime//DTD TimeOGram 1.8//EN"             
()    "http://www.lordsoftime.org/DTDs/timeogram.dtd"            
()    <!ENTITY sj "Sarah Jane">                                  
()    <!ENTITY me "Doctor Who">

. The XML declaration describes some of the most general properties of the document, telling the XML processor that it needs an XML parser to interpret this document.

. The document type declarationdescribes the root element type, in this case <time-o-gram>, and (on lines 3 and 4) designates a document type definition(DTD) to control markup structure.

. The identity code, called a public identifier, specifies the DTD to use.

. A system identifierspecifies the location of the DTD. In this example, the system identifier is a URL.

. This is the beginning of the internal subset, which provides a place for special declarations.

. Inside this internal subset are two entity declarations.

. The end of both the internal subset (]) and the document type declaration (>) complete the prolog.

Each of these terms is described in more detail later in this chapter.

The XML declaration

The XML declaration is an announcement to the XML processor that this document is marked up in XML. Its form is shown in Figure 2-5. The declaration begins with the five-character delimiter <?xml (1), followed by some number of property definitions (2), each of which has a property name (3) and value in quotes (4). The declaration ends with the two-character closing delimiter ?> (5).

Figure 2.5. XML declaration syntax

There are three properties that you can set:


Sets the version number. Currently there is only one XML version, so the value is always 1.0. However, as new versions are approved, this property will tell the XML processor which version to use. You should always define this property in your prolog.


Defines the character encoding used in the document, such as US-ASCII or iso-8859-1. If you know you're using a character set other than the standard Latin characters of UTF-8 (e.g., Japanese Katana, or Cyrillic), you should declare this property. Otherwise, it's okay to leave it out. Character encodings are explained in Chapter 7, "Internationalization"....

Meet the Author

Erik T. Ray has worked for O'Reilly Media, Inc. as a software developer and XML specialist since 1995. He helped to establish a complete publishing solution using DocBook-XML and Perl to produce books in print, on CD-ROM, and for the new Safari web library of books. As the author of the O'Reilly best seller Learning XML and numerous articles to technical journals, Erik is known for his clear and entertaining writing style. When not hammering out code, he enjoys playing card games, reading about hemorrhagic fevers, practicing Buddhist meditation, and collecting toys. He lives in Saugus, MA with his wife Jeannine and 7 parrots.

Customer Reviews

Average Review:

Post to your social network


Most Helpful Customer Reviews

See all customer reviews

Learning XML 5 out of 5 based on 0 ratings. 1 reviews.
Guest More than 1 year ago
XML in some circles has become the so-called 'next bit thing' and this book covers it extensively, from its 'history,' to viewing it properly, creating a document type definition to XML programming. Although XML is still in a kind of 'experimental' stage, this book shows numerous examples of how it can be used effectively. XML lets you create your 'own markup language,' but there are many rules involved with setting up a XML document. Things like elements, attributes, and namespaces all come into play and are explained extensively in this book. There is also some discussion about DocBook, which originated from SGML. Also covered are stylesheets, a section about XHTML (a XML-HTML 'hybrid), document models (both DTD's and schemas), and internatonalization standards. The book's appendixes contain a variety of online resources, standards, and even a glossary. While it may not be THE definitive XML reference book, it covers a lot of ground in its aproximately 335 pages.