Uh-oh, it looks like your Internet Explorer is out of date.

For a better shopping experience, please upgrade now.

CJKV Information Processing

CJKV Information Processing

by Ken Lunde

See All Formats & Editions

First published a decade ago, CJKV Information Processing quickly became the unsurpassed source of information on processing text in Chinese, Japanese, Korean, and Vietnamese. It has now been thoroughly updated to provide web and application developers with the latest techniques and tools for disseminating information directly to audiences in East Asia. This


First published a decade ago, CJKV Information Processing quickly became the unsurpassed source of information on processing text in Chinese, Japanese, Korean, and Vietnamese. It has now been thoroughly updated to provide web and application developers with the latest techniques and tools for disseminating information directly to audiences in East Asia. This second edition reflects the considerable impact that Unicode, XML, OpenType, and newer operating systems such as Windows XP, Vista, Mac OS X, and Linux have had on East Asian text processing in recent years.

Written by its original author, Ken Lunde, a Senior Computer Scientist in CJKV Type Development at Adobe Systems, this book will help you:

  • Learn about CJKV writing systems and scripts, and their transliteration methods
  • Explore trends and developments in character sets and encodings, particularly Unicode
  • Examine the world of typography, specifically how CJKV text is laid out on a page
  • Learn information-processing techniques, such as code conversion algorithms and how to apply them using different programming languages
  • Process CJKV text using different platforms, text editors, and word processors
  • Become more informed about CJKV dictionaries, dictionary software, and machine translation software and services
  • Manage CJKV content and presentation when publishing in print or for the Web

Internationalizing and localizing applications is paramount in today's global market — especially for audiences in East Asia, the fastest-growing segment of the computing world. CJKV Information Processing will help you understand how to develop web and other applications effectively in a field that many find difficult to master.

Product Details

O'Reilly Media, Incorporated
Publication date:
Edition description:
Second Edition
Product dimensions:
7.00(w) x 9.10(h) x 2.10(d)

Read an Excerpt

Chapter 9: Information Processing Techniques


Usually described as a scripting language, Perl, developed by Larry Wall, is much, much more than that. Perl's main strengths include rapid development. regular expressions (described later in this chapter), and hashes (associative arrays). It is not so much these individual features that provide Pert with extraordinary text-manipulation capabilities , but rather how these features are intertwined with one another. Other programming languages offer similar features, but there is often no convenient way for them to function together. in Perl, for example, a regular expression can be used to parse text, and at the same time used to 'store the resulting items into a hash for subsequent lookup.

Perl is the programming language of choice for those who write CGI programs or do other web-related programming (a topic that is discussed at the end of Chapter 13, The World Wide Web), because it is well suited for the task.

Although the current incarnation of Perl has no built-in support for internationalization (to the level that Java currently has), it is something that is being discussed by its developers. There are, however, clever ways to use Perl for handling multiple-byte data, most of which make use of regular expression tricks and techniques. The Perl code examples provided in Appendix W should he studied by any serious Pert programmer. Gisle Aas and Martin Schwartz have been diligently working on some extremely useful Unicode modules for Perl (Such as Unicode:: String, Unicode::Map8, and Unicode::Map), so you can expect some useful and interesting things to happen in the future. The Unicode Map module byMartin Schwartz, in particular, already supports code conversion between Unicode and a number of legacy CJKV encodings.

Kazumasa Utashiro has developed a useful japanese-enabling Perl library called jcodepl, which includes Japanese code conversion routines.** Some may find the Japanese version of Perl, called JPerl, to be useful, although I suggest using programming techniques. that avoid JPerl for optimal portability. JPerl adds: Japanese support to the following features: regular expressions, formats, some built-in functions (chop and split), and the tr / / / operator. The definitive guide to Perl is Programming Perl, Second Edition, by Larry Wall et al. (O'Reilly & Associates, 1996). Tom Christiansen and Nathan Torkington's Perl Cookbook (O'Reilly & Associates, 1998) is also highly recommended as a companion volume to Programming Perl. The comp.langperl.misc newsgroup should also be of interest. The best place to find Perl is at CPAN (Comprehensive Perl Archive Network).


Like Perl, Python is also sometimes described as a scripting language. Python was developed by Guido van Rossum, and is a high-level programming language that provides valuable programming features such as hashes and regular expressions.

An excellent guide to Python is Mark Lutz's Programming Python (O'Reilly & Associates, 1996). The comp.1angpython newsgroup should also be of interest if you want to learn about recent Python developments and join discussions. There is also a Python web site from which Python itself is available.


Tcl, which stands for Tool Command Language, is a programming language that was originally developed by John Ousterhout while a professor at UC Berkeley. Like Perl and Python, Tcl is considered a high-level scripting language that provides built-in facilities for hashes and regular expressions. John later founded Scriptics Corporation where Tcl is now being advanced.

Some important milestones in Tcl's history include its byte-code compiler introduced for Version 8.0, and support for Unicode (in the form of UTF-8 encoding) that began with Version 8.1. Tcl will also have a regex package comparable to Perl's by the time you read this. The lack of a byte-code compiler has always kept Tcl slower than Perl.

Tcl is rarely used alone, but rather with its GUI (Graphical User Interface) component called TK (standing for Tool Kit).

Other Programming Environments

While it is possible to write multiple-byte-enabled programs using all of the programming languages mentioned above, there are some programming environments that have done all this work for you, meaning that you need not worry about multiple-byte enabling your own source code because you depend on a module to do it for you. This may not sound terribly exciting for companies with sufficient resources and multiple-byte expertise, but may be a savior for smaller companies with limited resources.

One example of such a programming environment is Visix's Galaxy Global, multilingual product based on their Galaxy product. (Visix Software has since gone out of business.)

Perhaps of greater interest is Basis Technology's "Rosette: C++ Library for Unicode," which is a compact, general-purpose Unicode-based source code library. Embedded into an application, this library adds Unicode text processing capabilities that are robust and efficient across a variety of platforms (MacOS, Unix, Windows, and so on). Its functions adhere to the latest Unicode specifications. Major functions include code conversion between major legacy encodings and Unicode encodings, character classification (identification of a character), and character property conversion (such as half- to full- width katakana conversion). Basis Technology also offers a general-purpose code conversion utility, called "Uniconv," built using this library. Also of interest is UniScape's Global C and Global Checker packages, Sybase's Unilib, and Alis Technologies' Batam (their own Tango web browser is an example of this library's usage in a real product).

Code Conversion Algorithms

It is very important to understand that only the encoding methods for the national character sets are mutually compatible, and work quite well for round-trip conversions. The vendor-defined character sets often include characters that do not map to anything meaningful in the national character set standards. When dealing with the Japanese, ISO- 2022-JP, Shift-JIS, and EUC-JP encodings, for example, algorithms are used to perform code conversion - this involves mathematical operations that are applied equally to every character represented under an encoding method: This is known as algorithmic conversion....

Meet the Author

Ken Lunde was born in 1965 in Madison, Wisconsin, grew up in Mount Horeb, Wisconsin, and entered the University of Wisconsin-Madison in 1985 as a freshman. He graduated with a Bachelor of Arts degree in linguistics in 1987. He received his Master of Arts degree in linguistics in 1988. He finally received his Doctor of Philosophy degree in linguistics in 1994, and his dissertation was entitled "Prescriptive Kanji Simplification." He joined Adobe Systems Incorporated in 1991, and is currently Project Manager, CJK Type Development.

Customer Reviews

Average Review:

Post to your social network


Most Helpful Customer Reviews

See all customer reviews