CJKV Information Processing


First published a decade ago, CJKV Information Processing quickly became the unsurpassed source of information on processing text in Chinese, Japanese, Korean, and Vietnamese. It has now been thoroughly updated to provide web and application developers with the latest techniques and tools for disseminating information directly to audiences in East Asia. This second edition reflects the considerable impact that Unicode, XML, OpenType, and newer operating systems such as Windows XP, Vista, Mac OS X, and Linux have ...

See more details below
Paperback (Second Edition)
BN.com price
(Save 10%)$59.99 List Price
Other sellers (Paperback)
  • All (12) from $37.78   
  • New (6) from $38.39   
  • Used (6) from $37.78   
Sending request ...


First published a decade ago, CJKV Information Processing quickly became the unsurpassed source of information on processing text in Chinese, Japanese, Korean, and Vietnamese. It has now been thoroughly updated to provide web and application developers with the latest techniques and tools for disseminating information directly to audiences in East Asia. This second edition reflects the considerable impact that Unicode, XML, OpenType, and newer operating systems such as Windows XP, Vista, Mac OS X, and Linux have had on East Asian text processing in recent years.

Written by its original author, Ken Lunde, a Senior Computer Scientist in CJKV Type Development at Adobe Systems, this book will help you:

  • Learn about CJKV writing systems and scripts, and their transliteration methods
  • Explore trends and developments in character sets and encodings, particularly Unicode
  • Examine the world of typography, specifically how CJKV text is laid out on a page
  • Learn information-processing techniques, such as code conversion algorithms and how to apply them using different programming languages
  • Process CJKV text using different platforms, text editors, and word processors
  • Become more informed about CJKV dictionaries, dictionary software, and machine translation software and services
  • Manage CJKV content and presentation when publishing in print or for the Web

Internationalizing and localizing applications is paramount in today's global market — especially for audiences in East Asia, the fastest-growing segment of the computing world. CJKV Information Processing will help you understand how to develop web and other applications effectively in a field that many find difficult to master.

Read More Show Less

Product Details

  • ISBN-13: 9780596514471
  • Publisher: O'Reilly Media, Incorporated
  • Publication date: 12/30/2008
  • Edition description: Second Edition
  • Edition number: 2
  • Pages: 912
  • Sales rank: 1,133,345
  • Product dimensions: 7.00 (w) x 9.10 (h) x 2.10 (d)

Meet the Author

Ken Lunde was born in 1965 in Madison, Wisconsin, grew up in Mount Horeb, Wisconsin, and entered the University of Wisconsin-Madison in 1985 as a freshman. He graduated with a Bachelor of Arts degree in linguistics in 1987. He received his Master of Arts degree in linguistics in 1988. He finally received his Doctor of Philosophy degree in linguistics in 1994, and his dissertation was entitled "Prescriptive Kanji Simplification." He joined Adobe Systems Incorporated in 1991, and is currently Project Manager, CJK Type Development.

Read More Show Less

Read an Excerpt

Chapter 9: Information Processing Techniques


Usually described as a scripting language, Perl, developed by Larry Wall, is much, much more than that. Perl's main strengths include rapid development. regular expressions (described later in this chapter), and hashes (associative arrays). It is not so much these individual features that provide Pert with extraordinary text-manipulation capabilities , but rather how these features are intertwined with one another. Other programming languages offer similar features, but there is often no convenient way for them to function together. in Perl, for example, a regular expression can be used to parse text, and at the same time used to 'store the resulting items into a hash for subsequent lookup.

Perl is the programming language of choice for those who write CGI programs or do other web-related programming (a topic that is discussed at the end of Chapter 13, The World Wide Web), because it is well suited for the task.

Although the current incarnation of Perl has no built-in support for internationalization (to the level that Java currently has), it is something that is being discussed by its developers. There are, however, clever ways to use Perl for handling multiple-byte data, most of which make use of regular expression tricks and techniques. The Perl code examples provided in Appendix W should he studied by any serious Pert programmer. Gisle Aas and Martin Schwartz have been diligently working on some extremely useful Unicode modules for Perl (Such as Unicode:: String, Unicode::Map8, and Unicode::Map), so you can expect some useful and interesting things to happen in the future. The Unicode Map module byMartin Schwartz, in particular, already supports code conversion between Unicode and a number of legacy CJKV encodings.

Kazumasa Utashiro has developed a useful japanese-enabling Perl library called jcodepl, which includes Japanese code conversion routines.** Some may find the Japanese version of Perl, called JPerl, to be useful, although I suggest using programming techniques. that avoid JPerl for optimal portability. JPerl adds: Japanese support to the following features: regular expressions, formats, some built-in functions (chop and split), and the tr / / / operator. The definitive guide to Perl is Programming Perl, Second Edition, by Larry Wall et al. (O'Reilly & Associates, 1996). Tom Christiansen and Nathan Torkington's Perl Cookbook (O'Reilly & Associates, 1998) is also highly recommended as a companion volume to Programming Perl. The comp.langperl.misc newsgroup should also be of interest. The best place to find Perl is at CPAN (Comprehensive Perl Archive Network).


Like Perl, Python is also sometimes described as a scripting language. Python was developed by Guido van Rossum, and is a high-level programming language that provides valuable programming features such as hashes and regular expressions.

An excellent guide to Python is Mark Lutz's Programming Python (O'Reilly & Associates, 1996). The comp.1angpython newsgroup should also be of interest if you want to learn about recent Python developments and join discussions. There is also a Python web site from which Python itself is available.


Tcl, which stands for Tool Command Language, is a programming language that was originally developed by John Ousterhout while a professor at UC Berkeley. Like Perl and Python, Tcl is considered a high-level scripting language that provides built-in facilities for hashes and regular expressions. John later founded Scriptics Corporation where Tcl is now being advanced.

Some important milestones in Tcl's history include its byte-code compiler introduced for Version 8.0, and support for Unicode (in the form of UTF-8 encoding) that began with Version 8.1. Tcl will also have a regex package comparable to Perl's by the time you read this. The lack of a byte-code compiler has always kept Tcl slower than Perl.

Tcl is rarely used alone, but rather with its GUI (Graphical User Interface) component called TK (standing for Tool Kit).

Other Programming Environments

While it is possible to write multiple-byte-enabled programs using all of the programming languages mentioned above, there are some programming environments that have done all this work for you, meaning that you need not worry about multiple-byte enabling your own source code because you depend on a module to do it for you. This may not sound terribly exciting for companies with sufficient resources and multiple-byte expertise, but may be a savior for smaller companies with limited resources.

One example of such a programming environment is Visix's Galaxy Global, multilingual product based on their Galaxy product. (Visix Software has since gone out of business.)

Perhaps of greater interest is Basis Technology's "Rosette: C++ Library for Unicode," which is a compact, general-purpose Unicode-based source code library. Embedded into an application, this library adds Unicode text processing capabilities that are robust and efficient across a variety of platforms (MacOS, Unix, Windows, and so on). Its functions adhere to the latest Unicode specifications. Major functions include code conversion between major legacy encodings and Unicode encodings, character classification (identification of a character), and character property conversion (such as half- to full- width katakana conversion). Basis Technology also offers a general-purpose code conversion utility, called "Uniconv," built using this library. Also of interest is UniScape's Global C and Global Checker packages, Sybase's Unilib, and Alis Technologies' Batam (their own Tango web browser is an example of this library's usage in a real product).

Code Conversion Algorithms

It is very important to understand that only the encoding methods for the national character sets are mutually compatible, and work quite well for round-trip conversions. The vendor-defined character sets often include characters that do not map to anything meaningful in the national character set standards. When dealing with the Japanese, ISO- 2022-JP, Shift-JIS, and EUC-JP encodings, for example, algorithms are used to perform code conversion - this involves mathematical operations that are applied equally to every character represented under an encoding method: This is known as algorithmic conversion....

Read More Show Less

Table of Contents

1. CJKV Information Processing Overview
2. Writing Systems
3. Character Set Standards
4. Encoding Methods
5. Input Methods
6. Font Formats
7. Typography
8. Output Methods
9. Information Processing Techniques
10. Operating Systems, Text Editors, and Word Processors
11. Dictionaries and Dictionary Software
12. The Internet
13. The World Wide Web
A. Code Conversion Tables
B. Notation Conversion Table
C. Vendor Character Set Standards
D. Vendor Encoding Methods
E. GB 2312-80 Table
F. GB/T 12345-90 Table
G. CNS 11643-1992 Table
H. Big Five Table
I. Hong Kong GCCS Table
J. JIS X 0208:1997 Table
K. JIS X 0212-1990 Table
L. KS X 1001:1992 Table
M. KS X 1002:1991 Hanja Table
N. Hangul Reading Table
O. TCVN 6056:1995 Table
P. Code Table Indexes
Q. Character Lists and Mapping Tables
R. Chinese Character Lists
S. Single-Byte Code Tables
T. Software and Document Sources
U. Mailing Lists
V. Professional Organizations
W. Perl Code Examples
X. Glossary
Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star


4 Star


3 Star


2 Star


1 Star


Your Rating:

Your Name: Create a Pen Name or

Barnes & Noble.com Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & Noble.com that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & Noble.com does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at BN.com or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation


  • - By submitting a review, you grant to Barnes & Noble.com and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Noble.com Terms of Use.
  • - Barnes & Noble.com reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & Noble.com also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on BN.com. It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)