The Unicode Standard, Version 3.0



  • Characters for all the languages of the world
  • The standard for the new millennium
  • Required for XML and the Internet
  • The basis for modern software standards and products
  • The official way to ...
See more details below
Available through our Marketplace sellers.
Other sellers (Other Format)
  • All (20) from $1.99   
  • New (3) from $32.77   
  • Used (17) from $1.99   
Sort by
Page 1 of 1
Showing All
Note: Marketplace items are not eligible for any coupons and promotions
Seller since 2008

Feedback rating:



New — never opened or used in original packaging.

Like New — packaging may have been opened. A "Like New" item is suitable to give as a gift.

Very Good — may have minor signs of wear on packaging but item works perfectly and has no damage.

Good — item is in good condition but packaging may have signs of shelf wear/aging or torn packaging. All specific defects should be noted in the Comments section associated with each item.

Acceptable — item is in working order but may show signs of wear such as scratches or torn packaging. All specific defects should be noted in the Comments section associated with each item.

Used — An item that has been opened and may show signs of wear. All specific defects should be noted in the Comments section associated with each item.

Refurbished — A used item that has been renewed or updated and verified to be in proper working condition. Not necessarily completed by the original manufacturer.


Ships from: fallbrook, CA

Usually ships in 1-2 business days

  • Standard, 48 States
  • Standard (AK, HI)
Seller since 2005

Feedback rating:


Condition: New
Lebanon, Indiana, U.S.A. 2000 Textbook Binding New New, No remainder marks, no shelf wear, No surprises. Same day Shping.

Ships from: Vauxhall, NJ

Usually ships in 1-2 business days

  • Canadian
  • International
  • Standard, 48 States
  • Standard (AK, HI)
  • Express, 48 States
  • Express (AK, HI)
Seller since 2014

Feedback rating:


Condition: New
Brand new.

Ships from: acton, MA

Usually ships in 1-2 business days

  • Standard, 48 States
  • Standard (AK, HI)
Page 1 of 1
Showing All
Sort by
Sending request ...



  • Characters for all the languages of the world
  • The standard for the new millennium
  • Required for XML and the Internet
  • The basis for modern software standards and products
  • The official way to implement ISO/IEC 10646
  • The key to global interoperability
The Unicode Standard, Version 3.0

The authoritative, technical guide to the creation of software for worldwide use.

Detailed specifications for Unicode:

  • Structure, conformance, encoding forms, character properties, semantics, equivalence, combining characters, logical ordering, conversion, allocation, big/little endian usage, Korean syllable formation, control characters, case mappings, numeric values, mathematical properties, writing directions (Arabic, Japanese, English, and so on), character shaping (Arabic, Devanagari, Tamil, and so on)

Expanded implementation guidelines by experts in global software design:

  • Normalization, sorting and searching, case mapping, compression, language tagging, boundaries (characters, word, lines, and sentences), rendering of non-spacing marks, transcoding to other character sets, handling unknown characters, surrogate pairs, numbers, editing and selection, keyboard input, and more

Comprehensive charts, references, glossary, and indexes:

  • Codes, names, appearances, aliases, cross-references, equivalences, radical-stroke ideographic index, Shift-JIS index, and more


The comprehensive Unicode Character Database for:

  • Character codes, names,properties, decompositions, upper- ,lower-, and title cases, normalizations, shaping

International, national, and vendor character mappings for:

  • Western European, Japanese, Chinese, Korean, Greek, Russian, and others
  • Windows, Macintosh, Unix, and Linux

Unicode Technical Reportsthat extend the standard for:

  • Sorting, displaying, normalizing, linebreaking, compression, serialization, regular expressions, CR/LF, XML, case mappings, and more

Read More Show Less

Editorial Reviews

A book/CD-ROM technical guide to the Unicode character encoding standard, the international character code for information processing that includes all major scripts of the world and is the foundation for development of software for worldwide use. Early chapters cover information engineers need to produce a conforming implementation. Later chapters give basic information about each script and discuss specific characters. Includes reference and background information, plus a glossary. The CD-ROM contains a Unicode Character Database, technical reports, and international, national, and vendor character mappings for various languages. Annotation c. Book News, Inc., Portland, OR (
Read More Show Less

Product Details

  • ISBN-13: 9780201616330
  • Publisher: Pearson Education
  • Publication date: 2/16/2000
  • Edition description: BK&CD ROM
  • Pages: 1072
  • Product dimensions: 8.89 (w) x 11.19 (h) x 1.77 (d)

Meet the Author

The Unicode Consortium is a non-profit organization founded to develop, extend, and promote use of the Unicode Standard. Members include companies and organizations that are the leaders in globalization technology. The Consortium is the source of unrivaled internationalization expertise.
Read More Show Less

Read an Excerpt

Chapter 2: General Structure

This chapter discusses the fundamental principles governing the design of the Unicode Standard and presents an overview of its main features. It includes discussion of text processes, unification principles, allocation of codespace, character properties, writing direction, and a description of combining marks and how they are employed in Unicode character encoding. This chapter also gives general requirements for creating a textprocessing system that conforms to the Unicode Standard. Formal requirements for conformance appear in Chapter 3, Conformance. Character properties, both normative and informative, are given in Chapter 4, Character Properties. A set of guidelines for implementers is provided in Chapter 5, Implementation Guidelines.

Architectural Context

A character code standard such as the Unicode Standard enables the implementation of useful processes operating on textual data. The interesting end products are not the character codes but the text processes, because these directly serve the needs of a system's users. Character codes are like nuts and bolts-minor, but essential and ubiquitous components used in many different ways in the construction of computer software systems. No single design of a character set can be optimal for all uses, so the architecture of the Unicode Standard strikes a balance among several competing requirements.

Basic Text Processes

Most computer systems provide low-level functionality for a small number of basic text processes from which more sophisticated text-processing capabilities are built. The following text processes are supported by most computer systems to some degree:

  • Rendering characters visible (including ligatures, contextual forms, and so on)
  • Breaking lines while rendering (including hyphenation)
  • Modifying appearance, such as point size, kerning, underlining, slant, and weight (light, demi, bold, and so on)
  • Determining units such as "word" and "sentence"
  • Interacting with users in processes such as selecting and highlighting text
  • Modifying keyboard input and editing stored text through insertion and deletion
  • Comparing text in operations such as determining the sort order of two strings, or filtering or matching strings
  • Analyzing text content in operations such as spell-checking, hyphenation, and parsing morphology (that is, determining word roots, stems, and affixes)
  • Treating text as bulk data for operations such as compressing and decompressing, truncating, transmitting, and receiving

Text Elements, Code Values, and Text Processes

One of the more profound challenges in designing a worldwide character encoding stems from the fact that, for each text process, written languages differ in what is considered a fundamental unit of text, or a text element.

For example, in traditional German orthography, the letter combination "ck" is a text element for the process of hyphenation (where it appears as "k-k"), but not for the process of sorting; in Spanish, the combination "ll" may be a text element for the traditional process of sorting (where it is sorted between "1" and "m"), but not for the process of rendering; and in English, the objects A" and "a" are usually distinct text elements for the process of rendering, but generally not distinct for the process of searching text. The text elements in a given language depend upon the specific text process; a text element for spell-checking may have different boundaries from a text element for sorting purposes.

A character encoding standard provides the fundamental units of encoding (that is, the abstract characters), which must exist in a unique relationship to the assigned numerical code values. These code values are the smallest addressable units of stored text.

An important class of text elements is called a grapheme, which typically corresponds to what a user thinks of as a "character." Figure 2-1 illustrates the relationship between abstract characters and graphemes.

The design of the character encoding must provide precisely the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired languages. These code values may not map directly to any particular set of text elements that is used by one of these processes.

Text Processes and Encoding

In the case of English text using an encoding scheme such as ASCII, the relationships between the encoding and the basic text processes built on it are seemingly straightforward: characters are generally rendered visible one by one in distinct rectangles from left to right in linear order. Thus one character code inside the computer corresponds to one logical character in a process such as simple English rendering.

When designing an international and multilingual text encoding such as the Unicode Standard, the rclationship between the encoding and implementation of basic text processes must be considered explicitly, for several reasons:

  • It is not always obvious that one set of text characters is an optimal encoding for a given language. For example, two approaches exist for the encoding of accented characters commonly used in French or Swedish: ISO/IEC 8859 defines letters such as "a" and "6" as individual characters, whereas ISO 5426 represents them by composition instead. In the Swedish language, both are considered distinct letters of the alphabet, following the letter 'z': In French, the diaeresis on a vowel merely marks it as being pronounced in isolation. In practice, both approaches can be used to implement either language.
  • No encoding can support all basic text processes equally well. As a result, some trade-offs are necessary. For example, ASCII defines separate codes for uppercase and lowercase letters. This choice causes some text processes, such as rendering, to be carried out more easily, but other processes, such as comparison, to become more difficult. A different encoding design for English, such as caseshift control codes, would have the opposite effect. In designing a new encoding scheme for complex scripts, such trade-offs must be evaluated and decisions made explicitly, rather than unconsciously.

For these reasons, design of the Unicode Standard is not specific to the design of particular basic text-processing algorithms. Instead, it provides an encoding that can be used with a wide variety of algorithms. In particular, sorting and string comparison algorithms cannot assume that the assignment of Unicode character code numbers provides an alphabetical ordering for lexicographic string comparison. Culturally expected sorting orders require arbitrarily complex sorting algorithms. The expected sort sequence for the same characters differs across languages; thus, in general, no single acceptable lexicographic ordering exists. (See Section 5.17, Sorting and Searching, for implementation guidelines.)

Text processes supporting many languages are often more complex than they are for English. The character encoding design of the Unicode Standard strives to minimize this additional complexity, enabling modern computer systems to interchange, render, and manipulate text in a user's own script and language-and possibly in other languages as well...

Read More Show Less

Table of Contents

Unicode Consortium Members and Directors.
Full Members.
Current Associate Members.
Current Liaison Members.
Current Specialist Members.
Current Individual Members.
Current Members of the Board of Directors.

About the Unicode Standard.
Concepts, Architecture, Conformance, and Guidelines.
Character Block Descriptions.
Charts and Index.
Appendices and Tables.
The Unicode Character Database and Technical Reports.
On the CD-ROM.

Notational Conventions.
Extended BNF.

Unicode Website.
Unicode Anonymous FTP Site.
Unicode Public Mailing List.
How to Contact the Unicode Consortium.

1. Introduction.
Standards Coverage.
New Characters.

Design Basis.
Text Handling.
Interpreting Characters.
Text Elements.

The Unicode Standard and ISO/IEC 10646.
The Unicode Consortium.
The Unicode Technical Committee.

2. General Structure.
Architectural Context.
Basic Text Processes.
Text Elements, Code Values, and TextProcesses.
Text Processes and Encoding.

Unicode Design Principles.
Sixteen-Bit Character Codes.
Characters, Not Glyphs.
Plain Text.
Logical Order.
Dynamic Composition.
Equivalent Sequence.

Encoding Forms.
Character Encoding Schemes.

Unicode Allocation.
Allocation Areas.
Codespace Assignment for Graphic Characters.
Nongraphic Characters, Reserved and Unassigned Codes.

Writing Direction.
Combining Characters.
Sequence of Base Characters and Diacritics.
Multiple Combining Characters.
Multiple Base Characters.
Spacing Clones of European Diacritical Marks.

Special Character and Noncharacter Values.
Byte Order Mark (BOM).
Special Noncharacter Values.
Layout and Format Control Characters.
The Replacement Character.

Controls and Control Sequences.
Control Characters.
Representing Control Sequences.

Conforming to the Unicode Standard.
Characters Not Used in a Subset.

Referencing Versions of the Unicode Standard.

3 Conformance.
Conformance Requirements.
Byte Ordering.
Invalid Code Values.
Bidirectional Text.
Unicode Technical Reports.

Characters and Coded Representations.
Simple Properties.
Compatibility Decomposition.
Canonical Decomposition.

Special Character Properties.
Canonical Ordering Behavior.
Combining Classes.
Canonical Ordering.
Use with Collation.

Conjoining Jamo Behavior.
Syllable Boundaries.
Standard Syllables.
Hangul Syllable Composition.
Hangul Syllable Decomposition.
Hangul Syllable Names.

Bidirectional Behavior.
Directional Formatting Codes.
Basic Display Algorithm.
Resolving Embedding Levels.
Reordering Resolved Levels.
Bidirectional Conformance.
Implementation Notes.

4. Character Properties.
Case — Normative.
Combining Classes — Normative.
Directionality — Normative.
Jamo Short Names — Normative.
General Category — Normative in Part.
Numeric Value — Normative.
Mirrored — Normative.
Unicode 1.0 Names.
Mathematical Property.
Letters and Other Useful Properties.

5. Implementation Guidelines.
Transcoding to Other Standards.
Multistage Tables.
7-Bit or 8-Bit Transmission.
Mapping Table Resources.

ANSI/ISO C wchar_t.
Unknown and Missing Characters.
Unassigned and Private Use Character Codes.
Interpretable but Unrenderable Characters.
Reassigned Characters.

Handling Surrogate Pairs.
Handling Numbers.
Handling Properties.
Line Handling.
Regular Expressions.
Language Information in Plain Text.
Requirements for Language Tagging.
Working with Language Tags.
Language Tags and Han Unification.

Editing and Selection.
Consistent Text Elements.

Strategies for Handling Nonspacing Marks.
Keyboard Input.

Rendering Nonspacing Marks.
Positioning Methods.

Locating Text Element Boundaries.
Boundary Specification.
Example Specifications.
Grapheme Boundaries.
Word Boundaries.
Line Boundaries.
Sentence Boundaries.
Random Access.

Syntactic Rule.

Sorting and Searching.
Culturally Expected Sorting.
Unicode Character Equivalence.
Similar Characters.
Levels of Comparison.
Ignorable Characters.
Multiple Mappings.
Collating Out-of-Scope Characters.
Unmapped Characters.
Sublinear Searching.

Case Mappings.

6. Punctuation.
General Punctuation.
Punctuation: U+0020-U+00BF.
General Punctuation: U+2000-U+206F.
CJK Symbols and Punctuation: U+3000-U+303F.
CJK Compatibility Forms: U+FE30-U+FE4F.
Small Form Variants: U+FE50-U+FE6F.

7. European Alphabetic Scripts.
Latin 160.
Letters of Basic Latin: U+0041-U+007A.
Letters of the Latin-1 Supplement: U+00C0-U+00FF.
Latin Extended-A: U+0100-U+017F.
Latin Extended-B: U+0180-U+024F.
IPA Extensions: U+0250-U+02AF.
Latin Extended Additional: U+1E00-U+1EFF.
Latin Ligatures: FB00-FB06.

Greek: U+0370-U+03FF.
Greek Extended: U+1F00-U+1FFF.

Cyrillic: U+0400-U+04FF.

Armenian: U+0530-U+058F.

Georgian: U+10A0-U+10FF.

Runic: U+16A0-U+16F0.

Ogham: U+1680-U+169F.

Modifier Letters.
Spacing Modifier Letters: U+02B0-U+02FF.

Combining Marks.
Combining Diacritical Marks: U+0300-U+036F.
Combining Marks for Symbols: U+20D0-U+20FF.
Combining Half Marks: U+FE20-U+FE2F.

8. Middle Eastern Scripts.
Hebrew: U+0590-U+05FF.
Alphabetic Presentation Forms: U+FB1D-U+FB4F.

Arabic: U+0600-U+06FF.
Cursive Joining.
Arabic Presentation Forms-A: U+FB50-U+FDFF.
Arabic Presentation Forms-B: U+FE70-U+FEFF.

Syriac: U+0700-U+074F.
Syriac Shaping.
Syriac Cursive Joining.

Thaana: U+0780-U+07BF.

9. South and Southeast Asian Scripts.
Devanagari: U+0900-U+097F.

Bengali: U+0980-U+09FF.

Gurmukhi: U+0A00-U+0A7F.

Gujarati: U+0A80-U+0AFF.

Oriya: U+0B00-U+0B7F.

Tamil: U+0B80-U+0BFF.

Telugu: U+0C00-U+0C7F.

Kannada: U+0C80-U+0CFF.

Malayalam: U+0D00-U+0D7F.

Sinhala: U+0D80-U+0DFF.

Thai: U+0E00-U+0E7F.

Lao: U+0E80-U+0EFF.

Tibetan: U+0F00-U+0FBF.

Myanmar: U+1000-U+109F.

Khmer: U+1780-U+17FF.

10. East Asian Scripts.
CJK Unified Ideographs.
CJK Compatibility Ideographs: U+F900-U+FAFF.
Kanbun: U+3190-U+319F.
CJK and KangXi Radicals: U+2E80-U+2FD5.
Ideographic Description: U+2FF0-U+2FFB.

Hiragana: U+3040-U+309F.

Katakana: U+30A0-U+30FF.
Halfwidth and Fullwidth Forms: U+FF00-U+FFEF.

Hangul Jamo: U+1100-U+11FF.
Hangul Compatibility Jamo: U+3130-U+318F.
Hangul Syllables: U+AC00-U+D7A3.

Bopomofo: U+3100-U+312F.

Yi: U+A000-U+A4CF.

11. Additional Scripts.
Ethiopic: U+1200-U+137F.

Cherokee: U+13A0-U+13FF.

Canadian Aboriginal Syllabics.
Canadian Aboriginal Syllabics: U+1400-U+167F.

Mongolian: U+1800-U+18AF.

12. Symbols.
Currency Symbols.
Currency Symbols: U+20A0-U+20CF.

Letterlike Symbols.
Letterlike Symbols: U+2100-U+214F.

Number Forms.
Number Forms: U+2150-U+218F.
Superscripts and Subscripts: U+2070-U+209F.

Mathematical Operators.
Mathematical Operators: U+2200-U+22FF.
Arrows: U+2190-U+21FF.

Technical Symbols.
Control Pictures: U+2400-U+243F.
Miscellaneous Technical: U+2300-U+23FF.
Optical Character Recognition: U+2440-U+245F.

Geometrical Symbols.
Box Drawing: U+2500-U+257F.
Block Elements: U+2580-U+259F.
Geometric Shapes: U+25A0-U+25FF.

Miscellaneous Symbols and Dingbats.
Miscellaneous Symbols: U+2600-U+26FF.
Dingbats: U+2700-U+27BF.

Enclosed and Square.
Enclosed Alphanumerics: U+2460-U+24FF.
Enclosed CJK Letters and Months: U+3200-U+32FF.
CJK Compatibility: U+3300-U+33FF.

Braille: U+2800-U+28FF.

13. Special Areas and Format Characters.
Control Codes.
C0 Control Codes: U+0000-U+001F.
C1 Control Codes: U+0080-U+009F.

Layout Controls.
Layout Controls.

Deprecated Format Characters.
Deprecated Format Characters: U+206A-U+206F.

Surrogates Area.
Surrogates Area: U+D800-U+DFFF.

Private Use Area.
Private Use Area: U+E000-U+F8FF.

Specials: U+FEFF, U+FFF0-U+FFFF.

14. Code Charts.
Character Names List.
Images in the Code Charts and Character Lists.
Cross References.
Case Form Mappings.
Information about Languages.
Reserved Characters.

CJK Unified Ideographs.
Hangul Syllables.

15. Han Indices.
Han Radical-Stroke Index.
Shift-JIS Index.

Appendix A: Han Unification History.
Appendix B: Submitting New Characters.
Proposal Guidelines.
Requirements of Proposal Form and Process.
Interim Solutions.
Sending Proposals.

Appendix C: Relationship to ISO/IEC 10646.
Unicode 1.0.
Unicode 2.0.
Unicode 3.0.

Encoding Forms in ISO/IEC 10646.
Zero Extending.

UCS Transformation Formats.

Synchronization of the Standards.
Identification of Features for the Unicode Standard.
Character Names.
Character Functional Specifications.

D: Changes from Unicode Version 2.0.
Versions of the Unicode Standard.
Changes from Unicode Version 2.0 to Version 2.1.
New Characters Added.
Character Semantics Changes.
Changes Affecting Conformance.

Changes from Unicode Version 2.1 to Version 3.0.
New Characters Added.
Character Semantics Changes.
Changes Affecting Conformance.
Unicode Technical Reports.

Source Standards.
Source Dictionaries for Han Unification.
Other Sources for the Unicode Standard.
Selected Resources.

Unicode Names Index.
General Index.

Read More Show Less


This book, The Unicode Standard, Version 3.0, is the authoritative source of information on the Unicode character encoding standard, the international character code for information processing that includes all major scripts of the world and is the foundation for development of software for worldwide use. As well as encoding characters used for written communication in a simple and consistent manner, the Unicode Standard defines character properties and algorithms for use in implementations.Version 3.0 expands on material from Versions 2.0 and 2.1 and supersedes all other previous versions. The previous versions of the Unicode Standard are:
  • The Unicode Standard, Version 1.0, Volume 1 (1991)
  • The Unicode Standard, Version 1.0, Volume 2 (1992)
  • The Unicode Standard, Version 1.1, Unicode Technical Report #4 (1993)
  • The Unicode Standard, Version 2.0 (1996)
  • The Unicode Standard, Version 2.1, Unicode Technical Report #8 (1998)
Major additions to Version 3.0 include:
  • conformance rules for transformation formats
  • new scripts including Ethiopic, Khmer, Mongolian, Myanmar, and Sinhala
  • restructured and enhanced character block descriptions
  • clarified bidirectional algorithm
  • updated implementation guidelines
  • a Shift-JIS index
The Unicode Standard maintains consistency with the international standard ISO/IEC 10646. Version 3.0 of the Unicode Standard corresponds to ISO/IEC 10646-1:2000. 0.1 About the Unicode Standard
This book defines Version 3.0 of the Unicode Standard. The general principles and architecture of theUnicode Standard, requirements for conformance, and guidelines for implementers precede the actual coding information. Useful ancillary information is given in the appendices. The accompanying CD-ROM contains tables of use to implementers and all technical reports published to date.Concepts, Architecture, Conformance, and Guidelines
The first five chapters of Version 3.0 introduce the Unicode Standard and provide the information an engineer needs to produce a conforming implementation. Basic text processing, working with combining marks, encoding forms, and doing bidirectional text layout are all described. A special chapter on implementation guidelines answers many common questions that arise when implementing Unicode.
  • Chapter 1 introduces the standard's basic concepts, design basis, and coverage, and discusses basic text handling requirements.
  • Chapter 2 sets forth the fundamental principles underlying the Unicode Standard and covers specific topics such as text processes, overall character properties, and the use of combining marks.
  • Chapter 3 constitutes the formal statement of conformance. This chapter also presents the normative algorithms for three processes: the canonical ordering of combining marks, the encoding of Korean Hangul syllables by conjoining jamo, and the formatting of bidirectional text.
  • Chapter 4 describes character properties in detail, both normative (required) and informative. Tables giving additional character property information appear on the CD-ROM.
  • Chapter 5 discusses implementation issues, including compression, strategies for dealing with unknown and unsupported characters, and transcoding to other standards.
Character Block Descriptions
Chapters 6 through 13 contain the character block descriptions that give basic information about each script or collection and may discuss specific characters or pertinent layout information.
  • Chapter 6 describes the general punctuation characters.
  • Chapter 7 presents the European Alphabetic scripts, including Latin, Greek, Cyrillic, Armenian, Georgian, Runic, Ogham, and associated combining marks.
  • Chapter 8 presents the Middle Eastern, right-to-left scripts: Hebrew, Arabic, Syriac, and Thaana.
  • Chapter 9 covers the South and Southeast Asian scripts, including Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Tibetan, Thai, Lao, Khmer, and Myanmar.
  • Chapter 10 presents the East Asian scripts, including Han, Hiragana, Katakana, Hangul, Bopomofo, and Yi.
  • Chapter 11 presents other scripts, including Ethiopic, Cherokee, Canadian Aboriginal Syllabics, and Mongolian.
  • Chapter 12 presents symbols, including currency, letterlike and technical symbols, and mathematical operators.
  • Chapter 13 describes special characters such as the Private Use Area, surrogates, and specials.
Charts and Index
The next two chapters document the Unicode Standard's character code assignments, their names and important descriptive information, and Han indices that aid in locating specific ideographs encoded in Unicode.
  • Chapter 14 gives the code charts and the Character Names List. The code charts contain the normative character encoding assignments, and the names list contains normative information as well as useful cross references and informational notes.
  • Chapter 15 provides a radical-stroke index to East Asian ideographs, as well as a Shift-JIS index.
Appendices and Tables
The appendices contain detailed background information on important topics: character encoding systems, submission of proposals, and the history of Unicode and its relationship to ISO/IEC 10646.
  • Appendix A describes the history of Han Unification in the Unicode Standard.
  • Appendix B gives instructions on how to submit characters for consideration as additions to the Unicode Standard.
  • Appendix C details the relationship between the Unicode Standard and ISO/IEC 10646.
  • Appendix D lists the changes to the Unicode Standard since Version 2.0.
The appendices are followed by a glossary of terms, a bibliography, and two indices: an index to Unicode characters and an index to the text of Chapters 1 through 15.The Unicode Character Database and Technical Reports
The Unicode Character Database is the name for a collection of files that contain character code values, character names, and character property data. It is described more fully in the file UnicodeCharacterDatabase.html. Version 3.0.0 of the database is provided on the accompanying CD-ROM. Updates and revisions will be made available online. See for information on the latest available version.The following Unicode Technical Reports are formally part of this standard:
  • UTR #11: East Asian Width, Version 5.0
  • UTR #13: Unicode Newline Guidelines, Version 5.0
  • UTR #14: Line Breaking Properties, Version 6.0
  • UTR #15: Unicode Normalization Forms, Version 18.0
The latest available version of these reports is provided on the CD-ROM. Updates and revisions will be made available online. For information on the latest available version, see the CD-ROM
The CD-ROM contains the Unicode Character Database, which gives character codes, character names, character properties, and decompositions for decomposable or compatibility characters. In addition to the Unicode Character Database and Unicode Technical Reports that are part of this standard, the CD-ROM also contains additional technical reports (covering topics such as compression, collation, and transformation formats), as well as property-based mapping tables (for example, tables for case) and transcoding tables for international, national, and industry character sets (including the Han cross-reference table). For the complete contents of the CD-ROM, see its READ ME file. Please consult the Unicode Consortium's online resources (see Section 0.3, Resources) to obtain the most up-to-date versions of the materials on the CD-ROM.
0.2 Notational Conventions
Throughout this book, certain typographic conventions are used. In running text, an individual Unicode value is expressed as U+nnnn, where nnnn is a four-digit number in hexadecimal notation, using the digits 0-9 and the letters A-F (for 10 through 15, respectively).
  • U+0416 is the Unicode value for the character named CYRILLIC CAPITAL LETTER ZHE.
In tables, the U+ may be omitted for brevity.A range of Unicode values is expressed as U+xxxxAEU+yyyy, or U+xxxx—U+yyyy, or xxxx..yyyy, where xxxx and yyyy are the first and last Unicode values in the range, and the arrow, long dash, or two dots indicate a contiguous range inclusive of the endpoints.
  • The range U+0900→U+097F contains 128 character values.
All Unicode characters have unique names, which are identical to those of the English-language edition of International Standard ISO/IEC 10646. Unicode character names contain only uppercase Latin letters A through Z, digits, space, and hyphen-minus; this convention makes it easy to generate computer-language identifiers automatically from the names. Unified East Asian ideographs are named CJK UNIFIED IDEOGRAPH-X, where X is replaced with the hexadecimal Unicode value—for example, CJK UNIFIED IDEOGRAPH-4E00. The names of Hangul syllables are generated algorithmically; for details, see Hangul Syllable Names in Section 3.11, Conjoining Jamo Behavior.In running text, a formal Unicode name is shown in small capitals (for example, GREEK SMALL LETTER MU), and alternative names (aliases) appear in italics (for example, umlaut). Italics are also used to refer to a text element that is not explicitly encoded (for example, pasekh alef) or to set off a foreign word (for example, the Welsh word ynghyd). Phonemic transcriptions are shown between slashes, as in Khmer /khnyom/.The symbols used in the character names list are described at the beginning of Chapter 14, Code Charts.In the text of this book, the word "Unicode" when used alone as a noun refers to the Unicode Standard.In this book, unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, CE is used. Dates before the common era are labeled with BCE.Extended BNF
The Unicode Standard and technical reports use an extended BNF format for describing syntax. As different conventions are used for BNF, Table 0-1, Extended BNF, lists the notation used here.

A sequence of characters is sometimes listed in text with angle brackets, such as <a, grave> or <U+0061, U+0300>.

Table 0-1. Extended BNF

Symbols Meaning
x := ... production rule
x y the sequence consisting of x then y
x* zero or more occurrences of x
x? zero or one occurrence of x
x+ one or more occurrences of x
x y either x or y
( x ) for grouping
x y equivalent to (x y (x y))
{ x } equivalent to (x)?
"abc" string literals ( "_" is sometimes used to denote space for clarity)
'abc' string literals (alternative form)
\u1234 Unicode characters within string literals or character classes
\v00101234 Unicode scalar values within string literals or character classes
U+HHHH Unicode character literal: equivalent to '\uHHHH'
U-HHHHHHHH Unicode character literal: equivalent to '\vHHHHHHHH'
charClass character class (syntax below)


Character Classes. A character class is constructed from one or two base sets. It is either a single base set, the negation of a base set, or the (set) difference between two base sets. The base sets themselves are bounded by brackets, and contain lists of characters, ranges of characters, general categories, or negations of general categories. The syntax follows: charClass := baseSet '¬' baseSet baseSet '-' baseSet
baseSet := '' item (','? item)* ''
item := char char '-' char '{' '
¬'? category '}'

General categories are defined in Chapter 4, Character Properties, such as {Uppercase Letter} for uppercase letter. Main categories such as {Mark} are the equivalent of a list of multiple subcategories: {Non-Spacing Mark}{Spacing Combining Mark}{Enclosing Mark}. Examples are found in Table 0-2, Character Class Examples.

Table 0-2. Character Class Examples

Syntax Matches
a-z English lowercase letters
a-z-c English lowercase letters except for c
¬c all characters but c
0-9 European decimal digits
\u0030-\u0039 (same as above, using Unicode escapes)
0-9, A-F, a-f hexadecimal digits
{Letter},{Non-Spacing Mark} all letters and non-spacing marks
{L},{Mn} (same as above, using abbreviated notation)
{¬Cn} all assigned Unicode characters
\u0600-\u06FF-{Cn} all assigned Arabic characters

Operators used in this standard are listed in Table 0-3, Operators.

Table 0-3. Operators

~ allow break here (see Section 5.15, Locating Text Element Boundaries)
x do not allow a break here
is transformed to, or behaves like
/ integer division (rounded down)
% modulo operation; equivalent to the integer remainder for positive numbers


0.3 Resources
The Unicode Consortium provides a number of online resources for obtaining information and data about the Unicode Standard, as well as updates and corrigenda. They are listed below.

Unicode Web Site
Unicode Anonymous FTP Site
Unicode Public Mailing List
  • Postal address: P.O. Box 391476, Mountain View, CA 94039-1476 USA
Please check the Web site for up-to-date contact information, including telephone, fax, and courier delivery address.

Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Terms of Use.
  • - Barnes & reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)