Chapter 2: Markup and Core Concepts
The Anatomy of a Document
Elements: The Building Blocks of XML
Attributes: More Muscle for Elements
Namespaces: Expanding Your Vocabulary
Entities: Placeholders for Content
Getting the Most out of Markup
XML Application: DocBook
This is probably the most important chapter in the book, as it
describes the fundamental building blocks of all XML-derived
languages: elements, attributes, entities, and processing
instructions. It explains what a document is, and what it means
to say it is well-formed or valid. Mastering these concepts is a
prerequisite to understanding the many technologies,
applications, and software related to XML.
How do we know
so much about the syntactical details of XML?
It's all described in a technical document maintained by the W3C,
the XML recommendation (http://www.w3.org/TR/2000/REC-xml-20001006).
It's not light reading, and most users of XML won't need it,
but you many be curious to know where this is coming from. For
those interested in the standards process and what all the
jargon means, take a look at Tim Bray's interactive, annotated version
of the recommendation at http://www.xml.com/axml/testaxml.htm.
The Anatomy of a Document
Example 2-1 shows a bite-sized XML
example. Let's take a look.
Example 2.1. A Small XML Document
<message>Don't forget to recharge K-9
<emphasis>twice a day</emphasis>.
Also, I think we should have his
bearings checked out. See you soon
(or late). I have a date with
It's a goofy example, but perfectly acceptable XML. XML lets
you name the parts anything you want, unlike HTML, which limits you
to predefined tag names. XML doesn't care
how you're going to use the document, how it will appear when
formatted, or even what the names of the elements mean. All that matters
is that you follow the basic rules for markup described in
this chapter. This is not to say that matters of organization aren't
important, however. You should choose element names that make sense in
the context of the document, instead of random things like
signs of the zodiac. This is more for your benefit and the benefit of
the people using your XML application than anything else.
This example, like all XML, consists of content interspersed
with markup symbols. The angle brackets (<>) and the names
they enclose are called tags. Tags demarcate and label
the parts of the document, and add other information that helps define
the structure. The text between the tags is the content of the document,
raw information that may be the body of a message, a title, or a field
of data. The markup and the content complement each other, creating an
information entity with partitioned, labeled data in a handy
Although XML is designed to be relatively readable by humans,
it isn't intended to create a finished document. In other words, you
can't open up just any XML-tagged document in a browser and expect
it to be formatted nicely.XML is really meant as a way to hold content so that,
when combined with other resources such as a stylesheet, the document
becomes a finished product style and polish .
We'll look at how to combine a stylesheet
with an XML document to generate formatted output in
Chapter 4, "Presentation: Creatingthe End Product". For now, let's just imagine what it
might look like with a simple stylesheet applied. For example, it
could be rendered as shown in Example 2-2.
Example 2.2. The Memorandum, Formatted with a Stylesheet
Don't forget to recharge K-9 twice a day.
Also, I think we should have his bearings checked out.
See you soon (or late). I have a date with some Daleks...
From: The Doctor
The rendering of this example is purely speculative at this
point. If we used some other stylesheet, we could format the same memo
a different way. It could change the order of elements, say by
displaying the From: line above the message body. Or it could
compress the message body to a width of 20 characters. Or it could
go even further by using different fonts, creating a border around the
message, causing parts to blink on and off--whatever you
want. The beauty of XML is that it doesn't put any restrictions on
how you present the document.
Let's look closely at the markup to discern its
structure. As Figure 2-1 demonstrates, the
markup tags divide the memo into regions, represented in the diagram
as boxes containing other boxes. The first box contains a special
declarative prolog that provides administrative information about the
document. (We'll come back to that in a moment.) The other boxes are
called elements. They act as containers and
labels of text. The largest element, labeled <time-o-gram>, surrounds all the other
elements and acts as a package that holds together all the
subparts. Inside it are specialized elements that represent the
distinct functional parts of the document. Looking at this diagram, we
can say that the major parts of a <time-o-gram> are the destination (<to>), the sender (<from>), a message teaser (<subject>), and the message body (<message>). The last is the most complex,
mixing elements and text together in its content. So we can see
from this example that even a simple XML document can harbor several
levels of structure.
Figure 2.1. Elements in the memo document
A Tree View
Elements divide the document into its constituent parts. They
can contain text, other elements, or both. Figure 2-2 breaks out the hierarchy of elements in
our memo. This diagram, called a tree because
of its branching shape, is a useful representation for discussing the
relationships between document parts. The black rectangles represent the seven
elements. The top element (<time-o-gram>) is called the root
element. You'll often hear it called the
document element, because it encloses all the
other elements and thus defines the boundary of the document. The
rectangles at the end of the element chains are called
leaves, and represent the actual content of
the document. Every object in the picture with arrows leading to or
from it is a node.
Figure 2.2. Tree diagram of the memo
There's one piece of Figure 2-2 that we
haven't yet mentioned: the box on the left labeled
pri. It was inside the
<time-o-gram> tag, but here we see it
branching off the element. This is a special kind of content called an
attribute that provides additional information
about an element. Like an element, an attribute has a
label (pri) and some content
You can think of it as a name/value pair contained in the
<time-o-gram> element tag. Attributes
are used mainly for
modifying an element's behavior rather than holding data;
later processing might print "High Priority" in large letters
at the top of the document, for example.
Now let's stretch the tree metaphor further and think about
the diagram as a sort of family tree, where every node is a parent or a
child (or both) of other nodes. Note, though, that unlike a family tree,
an XML element has only one parent.
With this perspective, we can see that the root element (a
grizzled old <time-o-gram>) is
the ancestor of all the other elements. Its children are the four elements
directly beneath it. They, in turn, have children, and so on until we
reach the childless leaf nodes, which contain the text of the document
and any empty elements.
Elements that share the same parent are said to be siblings.
Every node in the tree can be thought of as the root of a
smaller subtree. Subtrees have all the
properties of a regular tree, and the top of each subtree is the ancestor
of all the descendant nodes below it. We will see in
Chapter 6, "Transformation:RepurposingDocuments", that an XML document can be processed
easily by breaking
it down into smaller subtrees and reassembling the result
later. Figure 2-3 shows some examples
of subtrees in our <time-o-gram>
Figure 2.3. Some subtrees
And that's the 10-minute overview of XML. The power of XML is
its simplicity. In the rest of this chapter, we'll talk about the details
of the markup.
The Document Prolog
Somehow, we need to tip off the world that our document is
marked up in XML. If we leave it to a computer program to
guess, we're asking for trouble. A lot of markup languages look similar,
and when you add different versions to the mix, it becomes
difficult to tell them apart. This is especially true for
documents on the World Wide Web, where there are literally hundreds of
different file formats in use.
The top of an XML document is graced with special
information called the document prolog. At its
simplest, the prolog merely says that this is an XML document and
declares the version of XML being used:
But the prolog can hold additional information that nails down
such details as the document type definition being used, declarations of
special pieces of text, the text encoding, and instructions to XML
Let's look at a breakdown of the prolog, and then we'll examine each
part in more detail. Figure 2-4 shows an
XML document. At the top is an XML declaration (1). After this is
a document type declaration (2) that links to a document type
definition (3) in a separate file. This is followed by a set of
declarations (4). These four parts together comprise the prolog (6),
although not every prolog will have all four parts. Finally, the root
element (5) contains the rest of the document. This ordering cannot be
changed: if there is an XML declaration, it must be on the first
line; if there is a document type declaration, it must precede the
Figure 2.4. A Document with a prolog and a root element
Let's take a closer look at our
<time-o-gram> document's prolog,
shown here in Example 2-3. Note that because
we're examining the prolog in more detail, the numbers in
Example 2-3 aren't the same as those in
Example 2.3. A Document Prolog
<?xml version="1.0" encoding="utf-8"?>
() PUBLIC "-//LordsOfTime//DTD TimeOGram 1.8//EN"
() <!ENTITY sj "Sarah Jane">
() <!ENTITY me "Doctor Who">
. The XML declaration describes
some of the most general properties of the document, telling the
XML processor that it needs an XML parser to interpret this
document type declarationdescribes the root element type, in this case
<time-o-gram>, and (on lines 3 and 4)
designates a document type definition(DTD) to control markup structure.
. The identity code, called a
public identifier, specifies the DTD to use.
. A system identifierspecifies the location of the DTD. In this example, the system
identifier is a URL.
. This is the beginning of the
internal subset, which
provides a place for special declarations.
. Inside this internal subset are two
. The end of both the internal
subset (]) and the document type declaration
(>) complete the prolog.
Each of these terms is described in more detail later in
The XML declaration
The XML declaration is an announcement to the XML processor
that this document is marked up in XML. Its form is shown in
Figure 2-5. The declaration
begins with the five-character delimiter
<?xml (1), followed by some number of property
definitions (2), each of which has a property name (3) and value in
quotes (4). The declaration ends with the two-character closing delimiter
Figure 2.5. XML declaration syntax
There are three properties that you can set:
Sets the version number. Currently there is only
one XML version, so the value is always 1.0. However, as new versions
are approved, this property will tell the XML processor
which version to use. You should always define this property in your
Defines the character encoding used in the document,
such as US-ASCII or
iso-8859-1. If you know you're using a character
set other than the standard Latin characters of UTF-8 (e.g., Japanese
Katana, or Cyrillic), you should declare this property. Otherwise,
it's okay to leave it out. Character encodings are explained in Chapter 7, "Internationalization"....