- Shopping Bag ( 0 items )
Ships from: Sussex, WI
Usually ships in 1-2 business days
Ships from: Skokie, IL
Usually ships in 1-2 business days
Ever since the invention of the printing press, writers have made notes on manuscripts to instruct the printers concerning typesetting, and other production issues. These notes were called markup, and a collection of such notes that conform to a defined syntax and grammar can be called a language. For example, proofreaders use a hand-written markup language (ML) to communicate corrections to authors. Even the modern use of punctuation is a form of markup that remains with the text to advise the reader how to interpret that text. Most of these MLs use a distinct appearance so as to differentiate markup from the text to which it refers. Proofreaders' marks use a combination of cursive handwriting and special symbols to distinguish markup from the typeset text. Similarly, punctuation uses special symbols that cannot be confused with the alphabet and numbers that represent the textual content. Some punctuation symbols are so necessary to the understanding and production of printed English that they were included in the ASCII character set, the basis of the character sets used in almost all modern computers. Therefore these symbols also became part of modern programming language syntaxes, the standardization of the symbol set driving their re-appropriation for roles other than the punctuation of English.
The ASCII standard also defined a set of symbols (the "CO control characters", with hexadecimal values 00 to 10 that were intended to be used to markup the structure of data transmissions. Only a few of these symbols found wide-spread acceptance, and their use was often inconsistent. The most common example is the character(s) used to delimit the end of a line of text in a document.
Teletype machines used the physical motion-based character pair CR-LF (carriage-return, line-feed), that was later used by both MS-DOS and MS-Windows. In contrast, Unix uses a single LF character, and the MacOS uses a single CR character. Because of these conflicting and non-standard uses of ASCII, document interchange between these systems often requires a translation step - a simple text file cannot be shared without conversion - and this is just the simplest of markup issues that doesn't even address the question of what constitutes a "line" of text. Most word-processing programs have eliminated the use of a text "line", and have instead treated end-of-line markup as "end-of-paragraph", with the ASCII period-space (" . ") or period-space-space (" . ") strings being used to delimit sentences (though this method is imperfect).
Various forms of delimiters have been used to define the boundaries of containers for content, special symbol glyphs, presentation style of the text, or other special features of a document. For example, the C and C++ programming languages use braces ( . . . ) to delimit units of data or code, such as functions, data structures, and object definitions. A typesetting language, intended for manual human editing, might use more readable strings like . begin and . end. Other languages use other characters, or literal strings of characters - commonly called tags. Of course, there has often been conflict between different sets of tags and their interpretation. Without common delimiter vocabularies, much less common internal data formats, it has been very difficult to convert data from one format to another, or otherwise share data between applications and organizations.
In 1969, a person walked on the Moon for the first time. In the same year, Ed Mosher, Ray Lorie, and Charles F. Goldfarb of IBM Research invented the first modern markup language, Generalized Markup Language (GML). GML was a self-referential language for marking the structure of an arbitrary set of data, and was intended to be a meta-language - a language that could be used to describe other languages, their grammars and vocabularies. GML later became Standard Generalized Markup Language (SGML). In 1986, SGML was adopted as an international data storage and exchange standard by the International Organization for Standardization (ISO), designated ISO 8879 (see http://www.iso.ch). With the major impact of the World Wide Web (WWW) upon human commerce and communications, it could be argued that the quiet invention of GML was a more significant event in the history of technology than the high adventure of that first trip to another celestial body.
SGML is an extremely powerful (and rather complicated) markup language that has been widely used by the U.S. government and its contractors, large manufacturing companies, and publishers of technical information. Publishers often construct paper documents, such as books, reports, and reference manuals in SGML. These SGML documents are then transformed into a presentable format, and then sent to the typesetter and printer. SGML is also used to exchange technical specifications for manufacturing. However, its complexities and the high cost of its implementation have meant that most businesses and individuals cannot afford to embrace this useful technology.
More information about SGML can be found at http://www.oasis-open.orglcover
With advances in the development of the World Wide Web there was a drive for a simpler approach.
Origins and Goals of XML
In 1996, the World Wide Web Consortium (or W3C, ht tp: //www.w3 . org) began the process of designing an extensible markup language that would combine the flexibility and power of SGML with the widespread acceptance of HTML. The language that became XML drew on the specification of SGML, and indeed, was specified to be a subset of this language. Using SGML as a starting point allowed the design team to concentrate on making what already worked simpler. SGML already provided an open-ended language that could be extended by anyone for their own purposes. The intention that XML should be simpler than SGML was driven by the consideration of ease-of-use: in part the reading and writing of markup by persons using simple and commonly available tools, but also the simplifying of computer processing of documents and interchange datasets. Due to its many optional features, SGML is so complex that it is difficult to write generic parsers, whereas XML parsers are much simpler. In addition, XML leverages existing Internet protocols and software for easy data processing and transmission. Being a proper subset of SGML, XML also retains backwards compatibility with existing SGML-oriented systems, so data marked up in XML could still be used in these systems, saving SGML-based industries a lot of money in conversion costs, whilst leveraging the greater accessibility provided by the Web.
XML 1.0 became a W3C Recommendation in February 1998. The formal specification, including the grammar in Extended Backus-Naur Form (EBNF) notation, is readily available on the Web from the W3C (at http://www.w3.org/TR/REC-xml); and there is also an excellent annotated version by Tim Bray, one of the co-editors of the XML specification (at http://www.xml.com/axml/testaxml.htm).
An XML 1.0 FAQ maintained by Peter Flynn et al. on behalf of the W3C's XML Special Interest Group at http://www. uoC.ielxmllproaides extensive links to other topics related to XML.
XML is a simple, standard way to delimit text data. It has been described as "the ASCII of the Web". It is as if you could use your favorite programming language to create an arbitrary data structure, and then share it with anyone using any other language on any other computing platform. XML tags name the concept you are describing, and named attributes modify the tagged structures. So, you can formally describe the syntax you have devised and share it with others.
Without worrying too much about the particulars of the syntax, we can immediately see how powerful a mechanism the simple addition of tags describing the information they envelop is.
This data description mechanism in XML means it is a great way to share information over the Internet, because:
However, XML is not, directly, a replacement for HTML. You can read every word of the XML Recommendation (the World Wide Web Consortium's equivalent of a standard) and not find a single word related to visual presentation. Unlike HTML, which fuses data and presentation, XML is about data alone.
Although XML itself is data, the XML community has not forgotten presentation. Unlike traditional methods of presenting data, which relied on extensive bodies of code, the presentation techniques for styling XML are data driven. These range from the simple to the extremely complex. Regardless of the technique chosen, however, XML styling is accomplished through another document dedicated to the task, called a style sheet. In it a designer specifies formatting styles and rules that determine when the styles should be applied. The same style sheet can then be used with multiple documents to create a similar appearance...
Professional XML is a broad compendium that investigates and describes how the total XML concept will work for programmers. It's the next edition of the popular XML Applications (Wrox 1998).
The focus of Professional XML is on real-world applications that use XML as an enabling technology. It presents good design techniques, and shows how to interface XML-enabled applications with Web applications and database systems. It explores the frontiers of XML and previews some nascent technologies. Whether your requirements are oriented toward data exchange or visual styling, this book will cover all the relevant techniques in the XML community.
Professional XML is for anyone who wants to use XML to build applications and systems. Web site developers can learn techniques to take their sites to the next level of sophistication. Managers, designers, and software architects can learn where XML fits into their systems and how to use it to solve problems in application integration. For further details about the book, and other books in our range, visit the Wrox Press Web Site.
In Chapter 5 we looked at how to write applications using the Document Object Model. In this chapter we'll look at an alternative way of processing an XML document: the SAX interface. We'll start by discussing why you might choose to use the SAX interface rather than the DOM. Then we'll explore the interface by writing some simple applications. We'll also discuss some design patterns that are useful when creating more complex SAX applications, and finally we'll look at where SAX is going next.
SAX is a very different style of interface from DOM. With DOM, your application asks what is in the document by following object references in memory; with SAX, the parser tells the application what is in the document by notifying the application of a stream of parsingevents.
SAX stands for "Simple API for XML". Or if you really want it in full, the Simple Application Programming Interface for Extensible Markup Language.
As the name implies, SAX is an interface that allows you to write applications to read the data held in an XML document. It's primarily a Java interface, and all of our examples will be in Java. (Since we don't have the space to explain Java in this chapter we will assume knowledge of it for the purposes of this exposition. See Beginning Java 2, Wrox Press ISBN 1861002238, or the documentation at http://www.java.sun.com for more information.)
The SAX interface is supported by virtually every Java XML parser, and the level of compatibility is excellent. For a list of some of the implementations see http://www.xmlsoftware.com or David Megginson's site at http://www.megginson.com/SAX/
To write a SAX application in Java, you'll need to install the SAX classes (in addition to the Java JDK, of course). In most cases you'll find that the XML Parser does this for you automatically (we'll tell you where you can get parsers shortly). Check to see that classes such as org.xml.sax.Parser are present somewhere on your classpath. If not, you can install them from http://www.megginson.com/SAX/
We'll say a few words later on about where SAX came from and where it's going. But for the moment, we'll just mention a most remarkable feature: SAX doesn't belong to any standards body or consortium, nor to any company or individual; it just exists in cyberspace for anyone to implement and everyone to use. In particular, unlike most of the XML family of standards it has nothing to do with the W3C.
SAX development is co-ordinated by David Megginson, and its specification can be found on his site: http://www.megginson.com/SAX/ That specification, with trivial editorial changes, is reproduced for convenience in Appendix C of this book.
There are essentially three ways you can read an XML document from a program.
1. You can just read it as a file and sort out the tags for yourself. This is the hacker's approach, and we don't recommend it. You'll quickly find that dealing with all the special cases (different character encodings, escape conventions, internal and external entities, defaulted attributes and so on) is much harder work than you thought; probably you won't deal with all these special cases correctly and sooner or later someone will feed you a perfectly good XML document that your program can't handle. Avoid the temptation: it's not as if XML parsers are expensive (most are free).
2. You can use a parser that analyses the document and constructs a tree representation of its contents in memory: the output from the parser passes into the Document Object Model, or DOM. Your program can then start at the top of the tree and navigate around it, following references from one element to another to find the information it needs.
3. You can use a parser that reads the document and tells your program about the symbols it finds, as it finds them. For example it will tell you when it finds a start tag, when it finds some character data, and when it finds an end tag. This is called an event-based interface because the parser notifies the application of significant events as they occur. If this is the right kind of interface for you, use SAX.
Let's look at event-based parsing in a little more detail.
You may have come across the term 'event-based' in user interface programming, where an application is written to respond to events such as mouse-clicks as they occur. An event-based parser is similar: in particular, you have to get used to the idea that your application is not in control. Once things have been set in motion you don't call the parser, the parser calls you. That can seem strange at first, but once you get used to it, it's not a problem. In fact, it's much easier than user-interface programming, because unlike a user going crazy with a mouse, the XML parsing events occur in a rather predictable sequence. XML elements have to be properly nested, so you know that every element that's been opened will sooner or later be closed, and so on.
Consider a simple XML file such as the following:
As the parser processes this, it will call a sequence of methods such as the following (we'll describe the actual method names and parameters later, this is just for illustration):
startElement( "books" )
startElement( "book" )
characters( "Professional XML" )
endElement( "book" )
endElement( "books" )
All your application has to do is to provide methods to be called when the events such as startElement and endElement occur.
Given that you have a choice, it's important to understand when it's best to use an event-based interface like SAX, and when it's better to use a tree-based interface like the DOM.
Both interfaces are well standardized and widely supported, so whichever you choose, you have a wide choice of good quality parsers available, most of which are free. In fact many of the parsers support both interfaces.
The following sections outline the most obvious benefits of the SAX interface.
Because there is no need to load the whole file into memory, memory consumption is typically much less than the DOM, and it doesn't increase with the size of the file. Of course the actual amount of memory used by the DOM depends on the parser, but in many cases a 100Kb document will occupy at least 1Mb of memory.
A word of caution though: if your SAX application builds its own in-memory representation of the document, it is likely to take up just as much space as if you allowed the parser to build it.
Your application might want to construct a data structure using high-level objects such as books, authors, and publishers rather than low-level elements, attributes, and processing instructions. These "business objects" might only be distantly related to the contents of the XML file; for example, they may combine data from the XML file and other sources. If you want to build up an application-oriented data structure in memory in this way, there is very little advantage in building up a low-level DOM structure first and then demolishing it. Just process each event as it occurs, to make the appropriate incremental change to your business object model.
If you are only interested, say, in counting how many books have arrived in the library this week, or in determining their average price, it is very inefficient and quite unnecessary to read all the data that you don't want into memory along with the small amount that you do want. One of the beauties of SAX is that it makes it very easy to ignore the data you aren't interested in.
As the name suggests, it's really quite simple to use.
If it's possible to get the information you need from a single serial pass through the document, SAX will almost certainly be the fastest way to get it.
Having looked at the benefits it is only fair to address the potential drawbacks in using SAX.
Because the document is not in memory you have to handle the data in the order it arrives. SAX can be difficult to use when the document contains a lot of internal cross-references, for example using ID and IDREF attributes.
Complex searches can be quite messy to program as the responsibility is on you to maintain data structures holding any context information you need to retain, for example the attributes of the ancestors of the current element.
SAX 1.0 doesn't tell you anything about the contents of the DTD. Actually the DOM doesn't tell you much about it either, though some vendors have extended the DOM interface to do so. This isn't a problem for most applications: the DTD is mainly of interest to the parser; and as we'll see towards the end of the chapter the problem is fixed in SAX 2.0.
The design principle in SAX is that it doesn't provide you with lexical information. SAX tries to tell you what the writer of the document wanted to say, and avoids troubling you with details of the way they chose to say it. For example:
These restrictions are only a problem if you want to reproduce the way the document was written, perhaps for the benefit of future editing. For example, if you are writing an application designed to leave the existing content of the document intact, but to add some extra information from another source, the document author might get upset if you change the order of the attributes arbitrarily, or lose all the comments. In fact, most of the restrictions apply just as much to the DOM, although it does give you a little more information in some areas: for example, it retains comments. Again, many of the restrictions are fixed in SAX 2.0; though not all, for example the order of attributes is still a closely guarded secret, as is the choice of delimiter (single or double quotes).
The DOM allows you to create or modify a document in memory, as well as reading a document from an XML source file. SAX, by contrast, is designed for reading XML documents, not for writing them.
Actually it turns out that the SAX interface is quite handy for writing XML documents as well as reading them. As we'll see later, the same stream of events that the parser sends to the application when reading an XML document can equally be sent from the application to an XML generator when writing one.
Although there are many XML parsers that support the SAX interface, At the time of writing there isn't a parser built into a mainstream web browser that supports it. You can incorporate a SAX-compliant parser within a Java applet, of course, but the overhead of downloading it from the server may strain the patience of a user with a slow Internet connection. In practice, your choice of interfaces for client-side XML programming is rather limited...
What Does This Book Cover?
This book explains and demonstrates the essential techniques for designing, using, and displaying XML documents. First and foremost, this book covers the fundamentals of XML as they are codified by the World Wide Web Consortium (W3C). The W3C is the standards body that originated XML in a formal way and continues to develop specifications for XML. Although the wider XML community is increasingly jumping in and offering new XML-related ideas outside the control of the W3C, the W3C is still central and important to the development of XML.
The focus of this book is on learning how to use XML as an enabling technology in real-world applications. It presents good design techniques, and shows how to interface XML-enabled applications with Web applications and database systems. It explores the frontiers of XML and previews some nascent technologies. Whether your requirements are oriented toward data exchange or visual styling, this book will cover all the relevant techniques in the XML community.
Each chapter contains a practical example. As XML is a platform-neutral technology, the examples cover a variety of languages, parsers, and servers. All the techniques are relevant across all the platforms, so you can get valuable insight from the examples even if they are not implemented using your favorite platform.
Who Is This Book For?
This book is for anyone who wants to use XML to build applicationsand systems. Web site developers can learn techniques to take their sites to the next level of sophistication, while programmers and software architects can learn where XML fits into their systems and how to use it to solve problems in application integration.
XML applications are usually distributed in nature and are commonly Web oriented. This is not a book specifically about distributed systems or Web development, so you do not need deep familiarity with those areas. A general awareness of multi-tier architectures and internetworking via the Web will be sufficient.
How is this Book Structured?
Each chapter of this book takes up a separate topic pertaining to XML. Chapter 1 provides a conceptual introduction to the main aspects of XML. Chapters 2 and 3 are closely related as they cover the fundamentals of XML. Chapter 2 gets you started by covering the basic syntax and rules of XML. Chapter 3 takes you forward by providing tools for formally defining your own problem-specific XML vocabulary. The remaining chapters, however, are largely self-contained in terms of the techniques and technologies they present.
The main chapters are tied together with a unifying example. The example will assume a publisher wants to present their catalog of books in XML form. We will start by devising rules for describing books in a catalog, then build on those rules to show how each technology takes a turn in helping us build XML applications. You will see how book catalogs can be turned into documents, how such documents can be manipulated and accessed in code, and how their content can be styled for human readers. Since such an application would not, in practice, exist in a vacuum, you will also see how XML applications interface with databases.
There are several threads that run through this book which are outlined in the next section. This should allow you to read through the book focusing only on those issues that are important to you, skimming other sections.
XML is evolving from its simple roots as a document markup language to a large, wide-ranging field of related markup technologies. It is this growth that is powering XML applications. With growth comes divergence. Different readers will come to this book with different expectations. XML is different things to different people. While we hope that you will read this book cover to cover, that is not necessary. Indeed, that may not even be the best way for everyone to approach this book.
This book has three threads springing from a common core. While you can certainly start at the first chapter and work your way sequentially through to the last, you can follow a more direct path to the knowledge you need. Everyone should read the core chapters to gain a common understanding of what XML encompasses. From there, you can approach XML as data or as content for visual presentation and styling.
Chapters 2 (Well-formed XML) and 3 (DTDs) cover the fundamentals of XML 1.0. Chapter 2 gives you the basic syntax, while Chapter 3 tells you how to formally specify an XML vocabulary in a way that every XML programmer is expected to understand. These chapters form the irreducible minimum you need to understand XML and begin working with it. Chapter 4 (on Data Modeling) gives you effective guidelines and lessons in creating good XML structures. It's hard to recover from a bad XML vocabulary, but a good one will forgive a lot of programming mistakes. Chapter 5 teaches you the Document Object Model (DOM), the W3C's API for XML documents, among other things. This takes you out of the realm of documents and into the world of applications.
These four chapters are enough for you to begin XML applications programming. When you are finished with them, you will know what XML is, how to structure it, and how to manipulate XML documents in code. Although a wealth of XML techniques lies ahead, you will have a firm foundation upon which to build.
So the 'Core' thread includes:
Chapter 2: Well-formed XML
Chapter 3: Document Type Definitions
Chapter 4: Data Modeling
Chapter 5: Document Object Model
XML as Data
As you will see in the core chapters, XML, unlike HTML, clearly separates the content of a document from its visual representation. In fact, for the purposes of many applications, visual rendering of XML documents is not important. These applications treat XML as data. The concern here is with using XML as an interface between programs and systems. This may be the most exciting area of XML today especially where XML can enable e-commerce as a technique for Web applications that negotiate commercial transactions.
Chapter 6 starts this thread. It discusses an event driven API (called SAX) for manipulating XML documents. As such, the API is especially useful for processing very large quantities of XML, streams of XML, or for when you need the smallest possible footprint in a parser. Chapter 7 introduces Namespaces and Schemas, two areas that let us express concepts more creatively and effectively than we can with DTDs. They are the emerging core for describing data in XML.
Chapter 8 will show you how to link documents and query within a document for a particular element. The querying technology used in the examples actually stems from the styling side of XML, and this chapter does double duty by appearing in the 'Presentation' thread as well as this one. It is useful in this thread for demonstrating how queries can be used in quickly finding elements we need, and for showing how we can relate different XML documents. Chapter 9 (Manipulating XML) also covers techniques for transforming XML documents for various purposes. It is interesting from the standpoint of data because it is presents some very powerful techniques for translating between vocabularies. It will prove useful for the interchange of data, particularly in e-commerce and business-to-business situations. Again, this chapter also has a bearing on the 'Presentation' thread as it introduces the idea of transforming XML documents into other languages which can help when it comes to presenting XML for a user to view.
Chapter 10 (XML and Databases) is all about data. Relational databases and XML are two approaches to capturing data for computing although they play different roles. This chapter teaches you how to interface the traditional approach to data storage to the use of XML. Chapter 11 (Server to Server) will show you how to reach out to another server when you don't have the data locally. This is a novel technique that is going to become common as web applications move to the forefront of computing. Chapter 12 then draws on the information in these two preceding chapters in its discussion of the use of XML as the messaging medium for e-commerce interactions. In this case, the other server belongs to a business partner. They examine the issues of exchanging data in this context, where XML fits into this picture, and details of how it is used.
Wrapping up this thread is the discussion of WAP (the Wireless Application Protocol) and it's associated use of XML in the Wireless Markup Language (WML), in Chapter 14. Much of WAP is concerned with the metamorphosis of data from the verbose form of XML to a compact binary representation without losing the benefits of the former for use on mobile devices. Considering this problem and seeing WAP's solution will let you better appreciate the benefits of XML as a data exchange medium. In addition, if XML is going to be used to store and transfer data, you'll want to put your data on all the common data devices, which will increasingly include wireless devices like cell phones and dedicated Web devices.
So our XML-as-data thread consists of:
Chapter 6: SAX: The Simple API for XML
Chapter 7: Namespaces and Schemas
Chapter 8: Linking and Querying
Chapter 9: Manipulating XML
Chapter 10: XML and Databases
Chapter 11: Server to Server
Chapter 12: eBusiness
Chapter 14: WAP and WML
Visual Presentation of XML
The data thread is great for moving data about between machines, but if you plan to pass XML to a human, you will be interested in the styling thread. Unlike more traditional computing fields that have focused on data, such as relational databases, the XML community has given quite a bit of thought to how data can be rendered efficiently. XML's solution is, appropriately, data-driven. Whether we use CSS or XSL, we apply the data in style sheets to the data in an XML document to produce a visual representation for the human consumer of our data.
Chapter 8, Linking and Querying, starts this thread. This is because a subset of the querying technology lets a programmer specify a set of criteria that is used in to select a part of a document that has to be styled. Styling can be as precise as specifying how to render particular elements depending on the context in which they are found. The same type of element can be rendered differently depending on who its parent is, or what else appears near it. With the context in place, Chapter 9 tells programmers the knowledge for transforming XML, if needed, into some other format suited to presentation. This is at the heart of data-driven styling.
Chapter 13 (Styling) builds on Chapters 8 and 9 to teach you styling for XML. Our style sheets become powerful sets of rules that are applied to the data in XML documents to create a visual presentation. From one set of data, you can quickly and efficiently produce multiple views for presentation. This is where the benefits of separating data from presentation are fully realized.
Chapter 14 (WAP) is included in the presentation thread because styling is an important consideration for small devices, and small devices are the primary users of wireless communications. It addresses how designers can compress the visual representation to fit the constraints of a very small display. This parallels the consideration our data counterparts have to give to compressing the data to fit through a low bandwidth network connection. Because your styling is driven by a style sheet and not embedded with the data, you can create an effective presentation format specifically for the wireless device.
To recap, our presentation thread is comprised of:
Chapter 8: Linking and Querying
Chapter 9: Manipulating XML
Chapter 13: Styling
Chapter 14: WAP
Posted May 11, 2001
Coverage of a wide variety of topics under the XML umbrella. Coverage is accurate and wide, but, sometimes, not very deep. This book is an excellent Intermediate Level reference for XML, SOAP, The XML Document Object Model (DOM) and SAX.Was this review helpful? Yes NoThank you for your feedback. Report this reviewThank you, this review has been flagged.