The Unicode Standard, Version 4.0


The Unicode Standard provides a unique code number for every character in electronic text, no matter what the platform, no matter what the application, no matter what the language. It is required for XML and is at the core of modern software products. Unicode 4.0 contains 96,248 characters covering languages of the world. The Unicode Standard contains extensive descriptions of each writing system, as well as definitions of character properties and detailed conformance requirements. It is the complete and ...
See more details below
Available through our Marketplace sellers.
Other sellers (Paperback)
  • All (4) from $2.07   
  • New (1) from $65.00   
  • Used (3) from $2.07   
Sort by
Page 1 of 1
Showing All
Note: Marketplace items are not eligible for any coupons and promotions
Seller since 2014

Feedback rating:



New — never opened or used in original packaging.

Like New — packaging may have been opened. A "Like New" item is suitable to give as a gift.

Very Good — may have minor signs of wear on packaging but item works perfectly and has no damage.

Good — item is in good condition but packaging may have signs of shelf wear/aging or torn packaging. All specific defects should be noted in the Comments section associated with each item.

Acceptable — item is in working order but may show signs of wear such as scratches or torn packaging. All specific defects should be noted in the Comments section associated with each item.

Used — An item that has been opened and may show signs of wear. All specific defects should be noted in the Comments section associated with each item.

Refurbished — A used item that has been renewed or updated and verified to be in proper working condition. Not necessarily completed by the original manufacturer.

Brand new.

Ships from: acton, MA

Usually ships in 1-2 business days

  • Standard, 48 States
  • Standard (AK, HI)
Page 1 of 1
Showing All
Sort by
Sending request ...


The Unicode Standard provides a unique code number for every character in electronic text, no matter what the platform, no matter what the application, no matter what the language. It is required for XML and is at the core of modern software products. Unicode 4.0 contains 96,248 characters covering languages of the world. The Unicode Standard contains extensive descriptions of each writing system, as well as definitions of character properties and detailed conformance requirements. It is the complete and definitive user's guide for novices and experts alike. This edition, The Unicode Standard, Version 4.0, adds 47,188 new characters for minority and historic scripts, several sets of symbols, and a very large collection of additional CJK ideographs. It provides updated specifications covering structure, conformance, character behavior and semantics, as well as implementation guidelines, detailed discussions of writing systems, comprehensive charts, and an extensive glossary. The accompanying CD-ROM includes the text of all the Unicode Standard Annexes and the entire Unicode Character Database.
Read More Show Less

Product Details

  • ISBN-13: 9780321185785
  • Publisher: Addison-Wesley
  • Publication date: 9/5/2003
  • Pages: 1504
  • Product dimensions: 8.80 (w) x 11.20 (h) x 2.20 (d)

Table of Contents

Unicode Consortium Members and Directors
1 Introduction 1
2 General Structure 11
3 Conformance 55
4 Character Properties 95
5 Implementation Guidelines 107
6 Writing Systems and Punctuation 147
7 European Alphabetic Scripts 165
8 Middle Eastern Scripts 191
9 South Asian Scripts 217
10 Southeast Asian Scripts 265
11 East Asian Scripts 291
12 Additional Modern Scripts 321
13 Archaic Scripts 337
14 Symbols 349
15 Special Areas and Format Characters 383
16 Code Charts 413
17 Han Radical-Stroke Index 1189
A Han Unification History 1341
B Abstracts of Unicode Technical Reports 1343
C Relationship to ISO/IEC 10646 1347
D Changes from Unicode Version 3.0 1355
G: Glossary 1363
R: References 1385
I: Indices 1407
Read More Show Less


This book, The Unicode Standard, Version 4.0, is the authoritative source of information on the Unicode character encoding standard.

Version 4.0 expands on and supersedes all other previous versions. The text of the standard has been extensively rewritten to improve its structure and clarity.

Major additions to Version 4.0 since Version 3.0 include:

  • extensive additions of CJK characters to cover dictionaries and historic usage
  • many new symbols for mathematical and technical publication
  • substantially improved specification of conformance requirements, incorporating the character encoding model
  • encoding of supplementary characters
  • formalized policies for stability of the standard
  • clarification of semantics of special characters, including the byte order mark
  • major expansion of Unicode Character Database properties and of specifications for text boundaries and casing
  • more minority scripts, including Limbu, Tai Le, Osmanya, and Philippine scripts
  • more historic scripts, including Linear B, Cypriot, and Ugaritic
  • tightened definition of encoding terms, including UTF-32

Furthermore, many individual characters were added to meet the requirements of users and implementers alike. The Unicode Standard maintains consistency with the international standard ISO/IEC 10646. Version 4.0 of the Unicode Standard corresponds to ISO/IEC 10646:2003.

0.1 About the Unicode Standard

This book, together with the Unicode Standard Annexes described in Appendix B, and the Unicode CharacteVersion 4.0 of the Unicode Standard. The book gives the general principles, requirements for conformance, and guidelines for implementers, followed by character code charts and names.

Concepts, Architecture, Conformance, and Guidelines

The first five chapters of Version 4.0 introduce the Unicode Standard and provide the fundamental information needed to produce a conforming implementation. Basic text processing, working with combining marks, and encoding forms are all described. A special chapter on implementation guidelines answers many common questions that arise when implementing Unicode.

Chapter 1 introduces the standard's basic concepts, design basis, and coverage, and discusses basic text handling requirements.

Chapter 2 sets forth the fundamental principles underlying the Unicode Standard and covers specific topics such as text processes, overall character properties, and the use of combining marks.

Chapter 3 constitutes the formal statement of conformance. This chapter also presents the normative algorithms for two processes: the canonical ordering of combining marks and the encoding of Korean Hangul syllables by conjoining jamo.

Chapter 4 describes character properties in detail, both normative (required) and informative. Tables giving additional character property information appear in the Unicode Character Database.

Chapter 5 discusses implementation issues, including compression, strategies for dealing with unknown and unsupported characters, and transcoding tother standards.

Character Block Descriptions

Chapters 6 through 15 contain the character block descriptions that give basic information about each script or group of symbols and may discuss specific characters or pertinent layout information. Some of this information is required in order to produce conformant implementations of these scripts and other collections of characters.

Chapter 6 introduces writing systems and describes the general punctuation characters.

Chapter 7 presents the European Alphabetic scripts, including Latin, Greek, Cyrillic, Armenian, Georgian, and associated combining marks.

Chapter 8 presents the Middle Eastern, right-to-left scripts: Hebrew, Arabic, Syriac, and Thaana.

Chapter 9 covers the South Asian scripts, including Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Tibetan, and Limbu. Chapter 10 covers the Southeast Asian scripts, including Thai, Lao, Tai Le, Myanmar, Khmer, and Philippine scripts.

Chapter 11 presents the East Asian scripts, including Han, Hiragana, Katakana, Hangul, Bopomofo, and Yi.

Chapter 12 presents other scripts, including Ethiopic, Mongolian, Osmanya, Cherokee, Canadian Aboriginal Syllabics, Deseret, and Shavian.

Chapter 13 describes archaic scripts, including Ogham, Old Italic, Runic, Gothic, Ugaritic, Linear B, and Cypriot.

Chapter 14 presents symbols, including currency, letterlike and technical symbols, mathematical operators, and musical symbols.

Chapter 15 describes other topics such as private-use characters, surrogate code points, and special characters.

Charts and Han Radical-Stroke Index

The next two chapters document the Unicode Standard's character code assignments, their names and important descriptive information, and provide a Han radical-stroke index that aids in locating specific ideographs encoded in Unicode.

Chapter 16 gives the code charts and the Character Names List. The code charts contain the normative character encoding assignments, and the names list contains normative information as well as useful cross references and informational notes.

Chapter 17 provides a radical-stroke index to East Asian ideographs.


The appendices contain detailed background information on important topics regarding the history of the Unicode Standard and its relationship to ISO/IEC 10646.

Appendix A describes the history of Han Unification in the Unicode Standard.

Appendix B provides abstracts of Unicode Technical Reports and lists other important Unicode resources.

Appendix C details the relationship between the Unicode Standard and ISO/IEC 10646.

Appendix D lists the changes to the Unicode Standard since Version 3.0.

The appendices are followed by a glossary of terms, a bibliography, and two indices: an index to Unicode characters and an index to the text of the book.

0.2 The Unicode Character Database and Technical Reports

The Unicode Character Database is a collection of data files that contain character code points, character names and character property data. It is described more fully in of the Unicode Character Database, are found on the Unicode Web site.

The files for Version 4.0.0 of the Unicode Character Database are also supplied on the CDROM that accompanies this book.

Information on versions of the Unicode Standard can be found on the Unicode Web site.

All versions of all Unicode Technical Reports, Unicode Technical Standards, and Unicode Standard Annexes are available on the Unicode Web site.

The latest available version of each document at the time of publication is included on the CD-ROM. See Appendix B for a summary overview of important Unicode Technical Standards, Unicode Technical Reports and Unicode Standard Annexes.

On the CD-ROM

The CD-ROM also contains additional information, such as sample code, which is maintained on the Unicode ftp site or via http. For the complete contents of the CD-ROM see its ReadMe.txt file.

0.3 Notational Conventions

Throughout this book, certain typographic conventions are used.

Code Points

In running text, an individual Unicode code point can be expressed as U+n, where n is from four to six hexadecimal digits, using the digits 0-9 and uppercase letters A-F (for 10 through 15, respectively). There should be no leading zeros, unless the code point would have fewer than four hexadecimal digits; for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.

  • U+0416 is the Unicode code point for the character named .

In tables, the U+ may be omitted for brevity.

A range of Unicode code points is expressed as U+xxxx-U+yyyy or xxxx..yyyy, where xxxx and yyyy are the first and last Unicode values in the range, and the long dash or two dots indicate a contiguous range inclusive of the endpoints. For ranges involving supplementary characters, the code points in the ranges are expressed with five or six hexadecimal digits.

  • The range U+0900-U+097F contains 128 Unicode code points.
  • The Plane 16 private use characters are in the range 100000..10FFFD.

Character Names

All Unicode characters have unique names, which are identical to those of the English language edition of International Standard ISO/IEC 10646. Unicode character names contain only uppercase Latin letters A through Z, digits, space, and hyphen-minus; this convention makes it easy to generate computer-language identifiers automatically from the names. Unified CJK ideographs are named -X, where X is replaced with the hexadecimal Unicode code point--for example, -4E00.The names of Hangul syllables are generated algorithmically; for details, see Hangul Syllable Names in Section 3.12, Conjoining Jamo Behavior.

In running text, a formal Unicode name is shown in small capitals (for example,), and alternative names (aliases) appear in italics (for example, umlaut).Italics are also used to refer to a text element that is not explicitly encoded (for example, pasekh alef) or to set off a non-English word (for example, the Welsh word ynghyd).


A sequence of two or more code points may be represented by a comma-delimited list, set off by angle brackets. For this purpose angle brackets consist of U+003C - and U+003E - . Spaces are optional after the comma, and U+ notation for the code point is also optional; for example, "". the usage is clear from the context, a sequence of characters may also be represented with generic short names, for example as in "", or the angle brackets may be omitted.

In contrast to sequences of code points, a sequence of one or more code units may be represented by a list set off by angle brackets, but without comma delimitation or U+ notation. For example, the notation "" represents a sequence of bytes, as for the UTF-8 encoding form of a Unicode character. The notation "" represents a sequence of 16-bit code units, as for the UTF-16 encoding form of a Unicode character.


Phonemic transcriptions are shown between slashes, as in Khmer /khnyom/.

Phonetic transcriptions are shown between square brackets, using the International Phonetic Alphabet. (Full details on the IPA can be found on the International Phonetic Association's Web site.)

A leading asterisk is used to represent an incorrect or nonoccurring linguistic form.

The symbols used in the character names list are described at the beginning of Chapter 16, Code Charts.

In the text of this book, the word "Unicode" when used alone as a noun refers to the Unicode Standard.

Unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, is used. Dates before the common era are labeled with .

The term byte, as used in this standard, always refers to a unit of eight bits. This corresponds to the use of the term octet in some other standards.

Extended BNF

The Unicode Standard and technical reports use an extended BNF format for describing syntax. As different conventions are used for BNF, Table 0-1, Extended BNF, lists the notation used here.

0.3 Notational Conventions

In other environments, such as programming languages or mark-up, alternative notation for sequences of code points or code units may be used.

Character Classes. A code point class is a specification of an unordered set of code points. Whenever the code points are all assigned characters, it can also be referred to as a character class. The specification consists of any of the following:

  • A literal code point
  • A range of literal code points
  • A set of code points having a given Unicode character property value, as defined in the Unicode Character Database (see PropertyAliases.txt and PropertyValueAliases.txt)
  • Non-boolean properties given as an expression = or A, , such as "General_Category=Titlecase_Letter"
  • Boolean properties given as an expression = true or
  • A, true, such as "Uppercase=true"
  • Combinations of logical operations on classes

Further extensions to this specification of character classes are used in some Unicode Standard Annexes and Unicode Technical Reports. Such extensions are described in those documents, as appropriate.

A partial formal BNF syntax for character classes as used in this standard is given by the following.

char_class := "" char_class - char_class ""// set difference
:= "" item_list ""
:= "" property ("=" property_value ""
item_list := item (","? item)?
item := code_point // either literal or escaped
:= code_point - code_point // inclusive range

Whenever any character could be interpreted as a syntax character, it must be escaped. Where no ambiguity would result (with normal operator precedence), extra square brackets can be discarded. If a space character is used as a literal, it is escaped. Examples are found in Table 0-2, Character Class Examples.

Symbols Meaning

For more information about character classes, see Unicode Technical Report #18, "Unicode Regular Expression Guidelines."


Operators used in this standard are listed in Table 0-3, Operators.

0.4 Resources

The Unicode Consortium provides a number of online resources for obtaining information and data about the Unicode Standard, as well as updates and corrigenda. They are listed below.

  • Unicode Web Site
  • Unicode Anonymous FTP Site
  • Unicode Email Discussion List

Subscription instructions for the email discussion list are posted on the Unicode Web site.

  • a-z-c English lowercase letters except for c
  • 0-9 European decimal digits
  • \u0030-\u0039 (same as above, using Unicode escapes)
  • 0-9,A-F,a-f hexadecimal digits
  • {gc=letter},{gc=non-spacing_ mark} all letters and non-spacing marks
  • {gc=L},{gc=Mn} (same as above, using abbreviated notation)
  • {gcA,unassigned} all assigned Unicode characters
  • \u0600-\u06FF-{gc=unassigned} all assigned Arabic characters
  • Alphabetic=t characters
  • Line_BreakA,Infix_Numeric all code points that do not have the line break property of Infix_Numeric

0.4 Resources

How to Contact the Unicode Consortium

Contact the Unicode Consortium for membership information and to order publications (including additional copies of this book).

Postal address:
P.O. Box 391476
Mountain View, CA 94039-1476

Please check the Web site for up-to-date contact information, including telephone, fax, and courier delivery address.

Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Terms of Use.
  • - Barnes & reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously
Sort by: Showing 1 Customer Reviews
  • Anonymous

    Posted September 21, 2003

    All the languages of man

    Anyone dealing with XML or java soon runs into Unicode because this is the standard for representing characters in electronic form in those computer languages. Java, for instance, was designed from its inception to use Unicode. Earlier computer languages like C and C++ can have routines added to handle these, while C# uses XML and hence Unicode. But chances are, when you deal with Unicode, you only deal with a subset. Often only a small subset at that, unless you are using Chinese/Japanese. Typically you work with ascii and the codes for your spoken language if that is not a Western European language. Very few of us deal with much more than this. Which illustrates the appeal of the book. The Big Picture. ALL of Unicode. The breadth is stunning. It shows the written form of every major spoken language and many minor ones. Has the pictograms for Chinese [of course]. But also the symbols for Khmer, Canadian Aboriginal, Tamil, Syraic, et cetera, et cetera. Thumbing through this, you may encounter languages that you did not even know existed. It is one thing to say that we live in a multilingual world. But it is another to actually see it expressed comprehensively at the most basic level. There are two audiences for this book. The first is any computer person who has to deal with issues of internationalisation. But another audience is every Department of Languages or Cultural Anthropology in a university. If this describes your background, then you should know that you do not need facility in computing to appreciate the significance of this book. You can use it as a standard reference, akin to the Oxford English Dictionary vis-a-vis the English language. Look, ignore the computer stuff in the text. Yes, you can do this. The book groups related languages into common chapters. The explanatory text is lucid and the graphics for the languages lets you easily cross compare. Of course, at a higher level of meaning like sentences, you will need specialised texts in those languages. But to understand a language, you need to start at its letters or pictograms. Think of this book as an index into all the languages of man.

    Was this review helpful? Yes  No   Report this review
Sort by: Showing 1 Customer Reviews

If you find inappropriate content, please report it to Barnes & Noble
Why is this product inappropriate?
Comments (optional)