XSLT: The Ultimate Guide to Transforming Web Data

XSLT: The Ultimate Guide to Transforming Web Data

by Johan Hjelm, Peter Stark
Extensible Style Language Transformation (XSLT) is the language used in XSL style sheets to transform XML documents into other XML documents. XSLT represents a major breakthrough for digital content, providing a new way of translating Web content from XML into the various standards populating the wired and wireless world such as HTML, XHTML, WML, and others.Written by


Extensible Style Language Transformation (XSLT) is the language used in XSL style sheets to transform XML documents into other XML documents. XSLT represents a major breakthrough for digital content, providing a new way of translating Web content from XML into the various standards populating the wired and wireless world such as HTML, XHTML, WML, and others.Written by the leading architects of the XSLT transformation technology responsible for the standardization efforts for the W3C technologies, this book introduces Web developers and content designers to what is widely expected to replace Perl as a Web translator. Readers will find expert guidance on how to create the transformation sheets that guide the process of translation, how to optimize content for the most frequent formats through the use of transformation hints, as well as how to install and use the necessary software.

Editorial Reviews

From the Publisher
if you are looking for an accessible, hands-on approach, particularly one aimed at XSLT beginners, this is a useful enough volume?" (Computer Bulletin, November 2002)
The Barnes & Noble Review
XSLT is a simple tool for doing simple things -- changing the structure of XML documents, renaming elements and attributes, fragmenting one XML document into several. But these "simple" capabilities give you remarkable power. Suddenly, you can translate content for a wide variety of applications and environments, including wireless devices. And you're no longer required to compromise the ways you create or organize information in order to control how that information is displayed. If you're working with XML, XSLT is your next "need-to-know" technology -- and XSLT Professional Developer's Guide is the place to learn it.

Written by Johan Hjelm and Peter Stark, leaders of the W3C XSLT standardization effort, this book starts with XSLT's rationale, then gives you a feel for XSLT by walking you through a simple translation. You'll gradually build your XSLT programming skills, starting with simple techniques like if and choose; move on to attributes, variables, sorting, counting, and comparing strings, and result tree fragments. Once you've nailed the basics, it's on to more advanced techniques -- such as calling templates and using XSLT extensions. You'll find expert coverage of XSLT's relationship with databases and DOM, styling for mobile presentation, and a whole lot more. (Bill Camarda)

Bill Camarda is a consultant, writer, and web/multimedia content developer with nearly 20 years' experience in helping technology companies deploy and market advanced software, computing, and networking products and services. His 15 books include Special Edition Using Word 2000 and Upgrading & Fixing Networks For Dummies®, Second Edition.

Product Details

Publication date:
Professional Developer's Guide Series, #8
Edition description:
Product dimensions:
7.51(w) x 9.20(h) x 0.85(d)

Read an Excerpt

An XML-Based Web

The World Wide Web (WWW) contains a massive amount of information. Most information on the Web is marked up in Hypertext Markup Language (HTML), which enables Web browsers to display the information. But there are surveys that indicate that 70 percent of all HTML markup has errors. With the HTML markup the information is divided up into headers, paragraphs, lists, links, and other structural elements that we consider to make up a document. The browsers use the markup to present the document according to its internal rules about how headers, paragraphs, lists, and links look like. What is missing in this model is information about what kind of information is contained inside the headers and paragraphs of the document. All the HTML markup tells us is that the following text is a paragraph, not whether it is an address, a receipt, or a fragment from a best-selling novel. Here is a typical HTML paragraph:

<p> John Wiley & Sons, Inc. </p>
<p> www. wiley. com/ compbooks/</p>

The <p> elements don't indicate what kind of information they contain. The browser presents them as any other paragraphs. The fact that they actually represent a name and a Web address is not available to the browser, or to the user who is viewing the document. Consider what we can do with XML:

<name> John Wiley & Sons, Inc.</ name>
<www> www. wiley. com/ compbooks/</ www>
</ address>

We defined our own elements that represent the kind of information we want to describe, instead of just saying where in the document structure it fits in. This is the power of XML over plain HTML.

Most content on the WebÐ in terms of pages, if not in number of servers-- comes from databases, and is presented as HTML only as a convenience. Since a database is a structured format, using XML as an intermediary format to the presentation comes naturally. This is especially true if the information is going to be used in several different presentations.

Now that we have used XML to indicate what kind of information the document contains, how does the browser know how to present our markup? All browsers know how to present HTML markup. Browsers don't know how to present the <address> element that we just invented.

There are two ways:

1. Use Cascading Style Sheets (CSS). In the style sheet, you declare how each element should be presented (for instance, <address> should be 40 points Times Roman Bold). The style sheet for the document can be stored in a separate file, and used for other documents that contain the same elements. A characteristic of CSS is that the style rules never change the actual markup. Style rules are added as a layer above the markup.

2. Use XSLT to transform the markup into something that is known to all browsers; for example, HTML.

The following example is part of a language for describing mobile telephones, expressed in XML:

<manufacturer> Ericsson</ manufacturer>
<model> R320</ model>
<network> GSM 900</ network>
</ phone>

The meaning of these element names is not defined by XML but is declared separately in a document called the Document Type Description (DTD) or an XML schema (which is another type of document).

Both of these tell which data types the elements can be, in which order they should come, which elements can be contained in other elements, and so on. The author of the document can define a DTD, or he or she can use a predefined DTD.

A DTD (or a schema) enables you to declare what an element name should mean, but it has to be done within a set of rules. The benefit of XML is that it represents an agreement about what an element is and where the < and > should go and a few other things such as how to find out the character encoding of the document. It does not sound like much, and in fact, it is not very much: The XML specification is very brief. But it is what it takes to make two computers on the Web read the same document without protests. When the XML processor reads the XML document, it passes up the element names and their content to an application that understands their meaning (or at least can present the information on a screen or on paper).

Before XML, there was no common agreement on how to mark up information. The markup (the elements and the attributes) adds information to the data (what is inside the elements). With XML, the Web has a universal data format that can represent everything from documents to the primitives of a communication protocol.

XML is, despite claims that you might sometimes see, not intended to represent all types of information. Not all data can be represented as an XML tree, either, because not all data has the required structure. It is quite possible to represent an image as XML. (There is nothing that says that the content of elements has to be text, although the element names have to be.) But given the success of XML on the Web, the simple tree structure with elements and attributes seems to be capable of describing most types of data. We will look more at XML in Chapter 2, "XML Technologies: XML, XHTML, and WML."

The Web and its infrastructure are full of XML documents, and they are proliferating because XML enables authors of DTDs to create their own applications. An XML application is nothing but a way to describe an information set. Descriptions can be quite varied, depending on who creates them-- and the same situation applies to XML markup. A mobile phone can be classified by its network and manufacturer, as we did previously. But in an inventory application, the main thing might not be the manufacturer but instead the package. It might be classified by the battery duration or by any other attribute that it might have. What the XML elements describe might depend on what the author perceives as important at the moment.

The Case for Transformations

With so much data on the Web expressed as XML, there will be a repeated need to change XML documents and not just the trivial cases described previously. There is a need to change the structure, to change the names of the nodes, and to insert or remove content. There are a number of reasons for these needs:

  • To present documents that are represented in an in-house XML document type in a browser that can not handle that specific markup language.
  • To present the information in a Web browser, a Wireless Access Protocol (WAP) browser, or as plain text, the private format of the XML document must first be changed into HTML, Wireless Markup Language (WML), or plain text. The process of transforming the XML documents into the language that the presentation device supports is called styling.
  • To upgrade XML documents to a newer document type with richer functionality. A Web author who was on the cutting edge in 1998 and created Web services for mobile phones will have lots of useless WML 1.0 documents lying around and will want to upgrade to a new WML version that is supported by at least one mobile phone on the Web.
  • To support the multiple XML-based schema languages. Most developers need to use one but still want to be able to change their schemas into a different schema language.

These use cases are common and will be more common in the near future. Today, most users are using PCs to browse the Web, but in Japan, 30 million subscribers are using mobile telephones with the iMode system, which uses a special version of HTML. In Europe and in the rest of Asia, manufacturers have released mobile phones using WML, which is an XML application, to present data. Market figures point to a prolific use of mobile phones with Internet access (to the tune of several hundred million dollars in the next few years). It will not be long before mobile phones have overtaken the PC as the default Web browsing device. Television sets are also becoming Web enabled, as well as more exotic equipment such as microwave ovens and refrigerators having Internet access. And DoCoMo (the company behind iMode) as well as WAP Forum (the organization creating WML) have decided that future versions of their systems will be based on XHTML.

With Web-enabled mobile phones and TVs, there is an increased need for tools that style the same XML document for different types of presentations. And there will be an increasing number of applications where the result is not presentation but input to another program. XML has not only established itself as the default data format of the Web, but it has also established itself as the favorite format for data exchange (for instance, between different database applications). Where there once was one document type, HTML, there are now many document types. The number of markup languages on the Web increases. More languages mean more versions and variants to keep track of-- and transform between.

Enter XSLT

If you have a document in an XML format and want to transform it into other XML formats, you need some way of describing how the original format is similar and different from the format into which you want to transform the document. The XSLT language is used to express rules for how an XML document should be changed, transformed, renamed, and filtered. Each rule identifies a set of nodes from the source XML document and then describes which rules should be applied to those nodes; for example, move this element here, change the name of that element, add this text, and so on. The rules are contained in a document called the transformation sheet (or style sheet), a name that has historic origins. We use transformation sheet and style sheet interchangeably in this book.

The transformation sheet, together with the original document, is the input to an XSLT processor, which generates the output document. The result of a transformation does not have to be an XML document, however. It can be plain text, HTML, or any other data types that can be described in XSLT. As long as they have a structure and that structure can be transformed into some other structure, you can use XSLT to transform the document.

For simple transformations, XSLT is simple-- and a transformation sheet can be written by hand in a text editor such as EMACS or Windows Notepad. For complicated transformations, an XSLT authoring tool is recommended. XSLT performs much the same functions as other scripting languages, like PERL and ECMAscript (that used to be called Javascript). It is, however, designed especially to process XML. While it is a fullfledged programming language and has been used to write advanced software, it is optimized to transform content from XML to other formats.

Again, consider the following XML document that can then be outputted from a database (a serialization of a relational database table where the element names are the names of the columns and the values are the values in the fields):

<manufacturer> Ericsson</ manufacturer>
<model> R320</ model>
<network> GSM 900</ network>
</ phone>

Now, we want to translate this information into an HTML table that looks like the following:

<table xmlns=" http:// www. w3. org/ 1999/ xhtml">
<tr>< th> Manufacturer</ th>< td> Ericsson</ td></ tr>
<tr>< th> Model</ th>< td> R320</ td></ tr>
<tr>< th> Network</ th>< td> GSM 900</ td></ tr>
</ table>

Here is the transformation sheet that will perform this action:

<table xmlns=" http:// www. w3. org/ 1999/ xhtml"
xmlns: xsl=" http:// www. w3. org/ 1999/ XSL/ Transform">
<th> Manufacturer</ th>
<td>< xsl: value-of select="/ phone/ manufacturer" /></ td>
</ tr>
<th> Model</ th>
<td>< xsl: value-of select="/ phone/ model" /></ td>
</ tr>
<th> Network</ th>
<td>< xsl: value-of select="/ phone/ network" /></ td>
</ tr>
</ table>

When the XSLT processor processes the transformation sheet, the xsl: value-of element selects an element from the source XML document, takes the text from inside the element, and outputs the text as the result. The HTML elements are copied to the output stream as unchanged. The resulting document is, as expected, an XHTML table that has the same values as the XML document:

<table xmlns=" http:// www. w3. org/ 1999/ xhtml">
<tr>< th> Manufacturer</ th>< td>
Ericsson</ td></ tr>
<tr>< th> Model</ th>< td> R320</
td></ tr>
<tr>< th> Network</ th>< td> GSM 900</ td></ tr>
</ table>

The xmlns attributes and the xsl prefixes are part of XML namespaces and are essential for a transformation sheet to work. We will return to the topic of XML names later, but because they are such an essential part of XSLT, here is a short explanation.

A namespace is just a set of names that can be arbitrary. The set has a unique name, identified by the Uniform Resource Indicator (URI), as shown earlier. One way of thinking about this concept is that the URI anchors the namespace. Because all URIs are unique (they are unique because they are based on the Domain Name System, or DNS, a central registry that assures that Internet domains do not occur in different places), the names in the namespace will also be unique. In the transformation sheet that we showed earlier, the xmlns: xsl attribute declares that all names that are prefixed with xsl belong to the XSLT namespace. The xmlns attribute declares that all names that do not have prefixes belong to the HTML namespace. Because namespaces have unique names, it is possible for the XSLT processor to distinguish between the names that are part of the XSLT language and the names that are part of the XML documents that the transformation sheet is going to transform from and to.

Why XML and Not C?

We will look more closely at how a transformation takes place and will examine the XSLT language more in Chapter 3, "Simple Transformations." But as you might have noted, the XSLT language, which is used to transform XML documents, is itself expressed in XML. A transformation sheet is an XML document. In other words, if you use an XML editor to edit the documents that you need to transform, you can use the same editor to also edit the transformation sheet. Also, as we saw in the example, XSLT language constructs (for example, the xsl: value-of element) can be embedded into an XML document. In other words, you can transform transformation sheets. It is all XML.

It is not necessary to use XSLT to transform XML documents. It is possible to write a C or Java program or to use your favorite script language to perform all of the transformations described previously without using XSLT. Most programming languages today have standard libraries that you can use to change XML documents. So, why use XSLT? Why learn a new language?

As more and more information is stored as XML, the need for tools to create and change XML documents increases. As the need increases, it spreads from advanced developers that use C or Java every day to HTML authors and people who do not want to learn a complete programming language just to make a simple change in a few XML documents.

Not everyone is familiar with programming languages such as C or Java. And, using any of these languages-- even for a very simple transformation-- requires that the developer must pay attention to more concepts than the transformation itself: memory management, variables, compilation, and all of the concepts that come with it. Because you are reading this book, you are either not satisfied with using C or Java or script to change XML documents or you are not an expert in those fully fledged programming languages.

The XML transformation sheet, as we noted earlier, is also an XML document. In other words, it can be managed by using the same mechanisms and tools that you use to manage your other XML documents. If you just have a few files, you might not need anything except the file system and the Web server to handle your document management. But if you have many documents and use a content-management system, you will recognize the advantage of being able to handle all documents by using the same system.

When to Use XSLT

What XSLT provides is a simple tool for performing simple actions. Here is a summary of what a transformation sheet can do:

  • Change the structure of the XML document. This function might mean changing the place of elements, inserting new elements, and removing elements. If in a new version of a document type a new element with more functionality has replaced an old one, it is easy to transform the old document to the new one.
  • Name elements and attributes. This action might mean to map names from one namespace into another. Or, one company might call paragraphs "paragraphs" while another company calls them "p" and a third company calls them "para." A simple transformation sheet can change the vocabulary into the desired one. It is also possible to rename elements to attributes.
  • Fragment one XML document into several smaller ones.

In practice, the transformations that can be done to an XML document depend on how much structure the document has. In a document that has a pronounced structure, where almost every character is inside its own element and attributes, a transformation sheet has many places to "hook into" and change. In a document that has a sparse structure, however (perhaps just one element with text inside), there is not much that a transformation sheet can do. As a rule, it is easy to transform "down" from much structure into less structure, but it is difficult to transform "up."

As we described previously, the formatting of the output does not have to be associated with the elements. There are other types of meanings, however, that can be associated with the element name. Some XML languages contain element types that are associated with semantic meaning. The WML language that is used for Wireless Access Protocol (WAP) phones has element types that represent complex logic that must be executed when the element is processed by the application (WAP browser). Element types that come with a heavy baggage of logic cannot, without losing functionality on the way, be transformed into another element from a different language unless that other element has equivalent or more functionality as the original. It is easy to transform "dumb" XML documents that have little semantics in them into documents that have many semantics. The opposite is, however, difficult.

XSLT is unidirectional. Transformations are not reversible. This concept is something that you need to consider-- not when writing transformation sheets, but when writing the content and selecting the markup language that you will use for the original content.

Style Sheets and XSL

As we mentioned in the introduction, one of the advantages of XML is the separation of the markup and the formatting. For online formatting of documents, the advantage is the association of element names with formatting through a Cascading Style Sheet (CSS). CSS is the most popular style sheet language on the Web and is intended for lightweight formatting of documents, which means that the formatting can be adapted to different presentation devices.

CSS is much less capable than XSL, the other formatting language of the W3C, but is easier to learn. XSL consists of two pieces: XSLT and XSL-FO (formatting objects, which are used to format an XML document with XSL). A CSS style sheet cannot change the structure of the source document or rename any of the names; rather, it can only format the presentation. Actually, CSS can do very little of what XSLT can do but is implemented in several browsers (although the quality of the implementation varies), which means that it is more suited for on-screen display formatting than XSL. XSLT and CSS are often used together. The classic use case is to transform a data-centric XML document into an HTML document plus a CSS style sheet. You can change links to style sheets in a document by using XSLT.

One special scenario of styling use is to use XSLT to transform an XML document into formatting objects, as defined by the XSL Formatting Objects (XSL-FO) language. Originally, XSL Transformation and XSL Formatting Objects were one language called XSL. But as the work on XSL progressed, the transformation part was factored out and became XSLT. What was left of XSL was the definition of the formatting objects. This information explains the origin of the name, XSLT, or XSL Transformations.

CSS is not written in an XML-based language because it was defined before XML existed. In other words, CSS style sheets are not XML documents. They cannot be handled with the same tools as other XML documents-- something that is a disadvantage if you are trying to create a consistent data environment.

If you want to do more advanced formatting of the document (for instance, format it as a book that will be printed), you will want to look at transformations into XML-based formatting. XSL demands more in terms of processing power and is more suited for use on large documents (for instance, manuals that are to be printed from an electronic storage format).

We will talk more about CSS and XSL Formatting Objects in Chapter 8, "XSLT and Style." But now, we will look more at the XML technologies behind XSLT.

Meet the Author

JOHAN HJELM is Senior Research Project Manager and a Senior Specialist for Ericsson in Yokosuka Research Park, Japan. He is the author of several books, including Designing Wireless Information Services (Wiley).
PETER STARK is Senior Engineer for Ericsson and has been working in the W3C (World Wide Web Consortium) and IETF (Internet Engineering Task Force) standards efforts for several years. He is the editor for XHTML Basic.

Customer Reviews

Average Review:

Write a Review

and post it to your social network


Most Helpful Customer Reviews

See all customer reviews >