The Unicode Standard, Version 3.0

The Unicode Standard, Version 3.0

by The Unicode Consortium


  • Characters for all the languages of the world
  • The standard for the new millennium
  • Required for XML and the Internet
  • The basis for modern software standards and products
  • The official way to implement ISO/IEC 10646
  • The key to global interoperability
The Unicode Standard, Version 3.0



  • Characters for all the languages of the world
  • The standard for the new millennium
  • Required for XML and the Internet
  • The basis for modern software standards and products
  • The official way to implement ISO/IEC 10646
  • The key to global interoperability
The Unicode Standard, Version 3.0

The authoritative, technical guide to the creation of software for worldwide use.

Detailed specifications for Unicode:

  • Structure, conformance, encoding forms, character properties, semantics, equivalence, combining characters, logical ordering, conversion, allocation, big/little endian usage, Korean syllable formation, control characters, case mappings, numeric values, mathematical properties, writing directions (Arabic, Japanese, English, and so on), character shaping (Arabic, Devanagari, Tamil, and so on)

Expanded implementation guidelines by experts in global software design:

  • Normalization, sorting and searching, case mapping, compression, language tagging, boundaries (characters, word, lines, and sentences), rendering of non-spacing marks, transcoding to other character sets, handling unknown characters, surrogate pairs, numbers, editing and selection, keyboard input, and more

Comprehensive charts, references, glossary, and indexes:

  • Codes, names, appearances, aliases, cross-references, equivalences, radical-stroke ideographic index, Shift-JIS index, and more


The comprehensive Unicode Character Database for:

  • Character codes, names,properties, decompositions, upper- ,lower-, and title cases, normalizations, shaping

International, national, and vendor character mappings for:

  • Western European, Japanese, Chinese, Korean, Greek, Russian, and others
  • Windows, Macintosh, Unix, and Linux

Unicode Technical Reportsthat extend the standard for:

  • Sorting, displaying, normalizing, linebreaking, compression, serialization, regular expressions, CR/LF, XML, case mappings, and more

Editorial Reviews

A book/CD-ROM technical guide to the Unicode character encoding standard, the international character code for information processing that includes all major scripts of the world and is the foundation for development of software for worldwide use. Early chapters cover information engineers need to produce a conforming implementation. Later chapters give basic information about each script and discuss specific characters. Includes reference and background information, plus a glossary. The CD-ROM contains a Unicode Character Database, technical reports, and international, national, and vendor character mappings for various languages. Annotation c. Book News, Inc., Portland, OR (

Product Details

Pearson Education
Publication date:
Edition description:
Product dimensions:
8.89(w) x 11.19(h) x 1.77(d)

Read an Excerpt

Chapter 2: General Structure

This chapter discusses the fundamental principles governing the design of the Unicode Standard and presents an overview of its main features. It includes discussion of text processes, unification principles, allocation of codespace, character properties, writing direction, and a description of combining marks and how they are employed in Unicode character encoding. This chapter also gives general requirements for creating a textprocessing system that conforms to the Unicode Standard. Formal requirements for conformance appear in Chapter 3, Conformance. Character properties, both normative and informative, are given in Chapter 4, Character Properties. A set of guidelines for implementers is provided in Chapter 5, Implementation Guidelines.

Architectural Context

A character code standard such as the Unicode Standard enables the implementation of useful processes operating on textual data. The interesting end products are not the character codes but the text processes, because these directly serve the needs of a system's users. Character codes are like nuts and bolts-minor, but essential and ubiquitous components used in many different ways in the construction of computer software systems. No single design of a character set can be optimal for all uses, so the architecture of the Unicode Standard strikes a balance among several competing requirements.

Basic Text Processes

Most computer systems provide low-level functionality for a small number of basic text processes from which more sophisticated text-processing capabilities are built. The following text processes are supported by most computer systems to some degree:

  • Rendering characters visible (including ligatures, contextual forms, and so on)
  • Breaking lines while rendering (including hyphenation)
  • Modifying appearance, such as point size, kerning, underlining, slant, and weight (light, demi, bold, and so on)
  • Determining units such as "word" and "sentence"
  • Interacting with users in processes such as selecting and highlighting text
  • Modifying keyboard input and editing stored text through insertion and deletion
  • Comparing text in operations such as determining the sort order of two strings, or filtering or matching strings
  • Analyzing text content in operations such as spell-checking, hyphenation, and parsing morphology (that is, determining word roots, stems, and affixes)
  • Treating text as bulk data for operations such as compressing and decompressing, truncating, transmitting, and receiving

Text Elements, Code Values, and Text Processes

One of the more profound challenges in designing a worldwide character encoding stems from the fact that, for each text process, written languages differ in what is considered a fundamental unit of text, or a text element.

For example, in traditional German orthography, the letter combination "ck" is a text element for the process of hyphenation (where it appears as "k-k"), but not for the process of sorting; in Spanish, the combination "ll" may be a text element for the traditional process of sorting (where it is sorted between "1" and "m"), but not for the process of rendering; and in English, the objects A" and "a" are usually distinct text elements for the process of rendering, but generally not distinct for the process of searching text. The text elements in a given language depend upon the specific text process; a text element for spell-checking may have different boundaries from a text element for sorting purposes.

A character encoding standard provides the fundamental units of encoding (that is, the abstract characters), which must exist in a unique relationship to the assigned numerical code values. These code values are the smallest addressable units of stored text.

An important class of text elements is called a grapheme, which typically corresponds to what a user thinks of as a "character." Figure 2-1 illustrates the relationship between abstract characters and graphemes.

The design of the character encoding must provide precisely the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired languages. These code values may not map directly to any particular set of text elements that is used by one of these processes.

Text Processes and Encoding

In the case of English text using an encoding scheme such as ASCII, the relationships between the encoding and the basic text processes built on it are seemingly straightforward: characters are generally rendered visible one by one in distinct rectangles from left to right in linear order. Thus one character code inside the computer corresponds to one logical character in a process such as simple English rendering.

When designing an international and multilingual text encoding such as the Unicode Standard, the rclationship between the encoding and implementation of basic text processes must be considered explicitly, for several reasons:

  • It is not always obvious that one set of text characters is an optimal encoding for a given language. For example, two approaches exist for the encoding of accented characters commonly used in French or Swedish: ISO/IEC 8859 defines letters such as "a" and "6" as individual characters, whereas ISO 5426 represents them by composition instead. In the Swedish language, both are considered distinct letters of the alphabet, following the letter 'z': In French, the diaeresis on a vowel merely marks it as being pronounced in isolation. In practice, both approaches can be used to implement either language.
  • No encoding can support all basic text processes equally well. As a result, some trade-offs are necessary. For example, ASCII defines separate codes for uppercase and lowercase letters. This choice causes some text processes, such as rendering, to be carried out more easily, but other processes, such as comparison, to become more difficult. A different encoding design for English, such as caseshift control codes, would have the opposite effect. In designing a new encoding scheme for complex scripts, such trade-offs must be evaluated and decisions made explicitly, rather than unconsciously.

    For these reasons, design of the Unicode Standard is not specific to the design of particular basic text-processing algorithms. Instead, it provides an encoding that can be used with a wide variety of algorithms. In particular, sorting and string comparison algorithms cannot assume that the assignment of Unicode character code numbers provides an alphabetical ordering for lexicographic string comparison. Culturally expected sorting orders require arbitrarily complex sorting algorithms. The expected sort sequence for the same characters differs across languages; thus, in general, no single acceptable lexicographic ordering exists. (See Section 5.17, Sorting and Searching, for implementation guidelines.)

    Text processes supporting many languages are often more complex than they are for English. The character encoding design of the Unicode Standard strives to minimize this additional complexity, enabling modern computer systems to interchange, render, and manipulate text in a user's own script and language-and possibly in other languages as well...

  • Meet the Author

    The Unicode Consortium is a non-profit organization founded to develop, extend, and promote use of the Unicode Standard. Members include companies and organizations that are the leaders in globalization technology. The Consortium is the source of unrivaled internationalization expertise.

    Customer Reviews

    Average Review:

    Write a Review

    and post it to your social network


    Most Helpful Customer Reviews

    See all customer reviews >