Unicode CJKV character set rationalization

You've come to this page because you've propounded an incorrect definition of Han unification akin to the following (taken from Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard, ISBN 0201700522):

Han unification

The process of collecting Chinese characters from a variety of disparate sources and weeding out the characters that are really duplicates. The duplicate characters are encoded once in Unicode, not once for every source encoding in which they appear. The process of Han unificiation involves keeing careful track of just which characters from which sources are unified.

This is the Frequently Given Answer to such erroneous definitions.

Qin-Han unification is something that occurred in 221 BCE. The process described above is not Han unification.

For starters, the name "Han" is misleading. The word for "characters" comprises two ideographs, the first of which is the ideograph for the Han dynasty, the second dynasty in Imperial China and the second dynasty to use governmental force to compel the adoption of a standardized writing system in China. However, the word is pronounced differently in Chinese, Japanese, Korean, and Vietnamese, and transliterated differently therefrom into English. Chinese pronunciation is "hanzi"; Japanese pronunciation is "kanji"; and Korean pronunciation is "hanja".

A more correct name for the process would be Unicode CJKV character set rationalization, because that is what it actually is. Originally, when the ISO 10646 standard was first being drafted, the Chinese hanzi, Japanese kanji, Korean hanja, and Vietnamese Chu Han character sets were each assigned separate sets of code points, resulting in an exceedingly large character repertoire of several hundred thousand characters. At the same time, however, the Unicode consortium had decided upon a scheme for rationalizing these character sets by merging them into one, reducing the size of the repertoire to the tens of thousands. The people doing the ISO standardization, unhappy at the size of the originally proposed repertoire, coöpted the Unicode rationalization scheme. (National standards bodies balked at the originally intended ISO scheme of a 32-bit character code range aimed at including every past and present writing system, and the effort was instead merged with the then 16-bit Unicode scheme whose goals, discussed further on, were much more limited in scope. With 3 decades' hindsight, it seems that we've come back to the original ISO scheme, unfortunately by way of UTF-16, WTF-16, surrogate pairs, and a wacky 21-bit limit because of them; all because 16 bits was enough for anybody in the view of the Unicode people in 1988.)

The idea behind the Unicode CJKV character set rationalization was one whose apparent simplicity was appealing: Most of the various national character sets concerned were the result of borrowing from the standard Chinese writing systems, first promulgated in the Qin-Han unification. Therefore a simple way to rationalize these character sets is to trace the roots of the modern characters back to the 3rd century BCE characters, and merge the modern characters with common roots as "duplicates". After all, a similar process had effectively worked for European character sets, with the national character sets all being treated as extensions to a common Latin base character set.

One problem with this idea is that such a process hadn't worked for European character sets. The various "Latin-N" character sets of ISO 8859 do not comprise the Latin alphabet with additions for individual language augmentations. If they did, they wouldn't have the letters J, U, and W in the core alphabet. In fact, the so-called "Latin" character sets are actually based upon the Modern English alphabet (as a consequence of their being extensions to ASCII, of course). Rationalization of European character sets hasn't involved rolling the clock back 2,200 years and merging duplicates in a "Latin Unification" process. Merger has been based upon treating everything that is the same as Modern English as a duplicate.

This leads to two of the reasons that people have objected to the rationalization of the CJKV character sets in Unicode. Some object on the grounds that such a rationalization ignores real semantic differences that have evolved among the various character sets over the centuries. Some object on the grounds that, as with the unification of European character sets being centred around Modern English even though it is presented as "Latin", the Unicode CJKV character set rationalization isn't actually centred around 2,200-year-old standards at all, but rather around modern Chinese systems — something that is not necessarily to the taste of the Japanese, the Vietnamese, and the Koreans.

The latter touches upon another reason that this rationalization process is not Han unification. Because, in fact, the process very much is centred around modern systems, albeit not necessarily with the Sinocentric bias that some accuse it of having. The Unicode CJKV character set rationalization process has been erroneously described as the process of diversification from and variation upon the standard writing systems of the Qin-Han unification, that happened over hundreds of years, "seen in a mirror". But this is untrue. The actual goal was to rationalize a suite of 20th century character sets. As Joseph D. Becker explained in a paper in 1988, outlining (in an echo of the old "640KiB is enough" folktale) why 16 bits would be enough; unlike the original ISO 10646, Unicode was only originally intended to cover 20th century writing needs ("e.g. in the union of all newspapers and magazines printed in the world in 1988"). This, combined with the Unicode design principle of "convertibility", requiring that any "duplicate" characters in any of those 20th century character sets remain duplicate characters in Unicode, meant that in fact the rationalization process could not be historical development simply run backwards.

Another objection to this rationalization in Unicode, as Benjamin Peterson explains, is that it has sometimes been inexpertly performed by those who weren't actually users of the character sets concerned, necessitating subsequent revisions as actual use discovers errors and inconsistencies.

Additionally, the rationalization has given lie to the claim that (to quote chapter 6 of Java Internationalization, ISBN 0596000197) "dealing with unification is simply a matter of choosing a font that contains the glyphs appropriate for that country". As Peterson explains with examples, dealing with the Unicode CJKV rationalization sometimes requires not just specifying a font but specifying a national language as well. Unicode has not in practice eliminated the need for specifying what language one is using in order to specify which characters the character set denotes.

It is, however, also not the case that Unicode CJKV character set rationalization has been created by Americans and opposed by Asians. Indeed, it has had many Asian supporters. In part, this is a result of the Qin-Han unification. As Planning Chinese Characters: Reaction, Evolution Or Revolution? (ISBN 1402080387) explains, the Qin-Han unification had a far-reaching effect, across the centuries, upon all subsequent governments of China. This in turn has led to East Asian people thinking about character sets in certain ways. Standardization of writing systems is seen by people as not only validly within the remit of government responsibility, but as a goal that it is desirable to achieve, because it is viewed as promoting unity and social stability. Rationalization of the CJKV character sets in Unicode is thus seen by some people as a continuance of work that has been on-going ever since the Qin-Han unification.

Books to read on this subject

Ken Lunde. "Chapter 3: International Character Sets". CJKV Information Processing. O'Reilly, 1999. ISBN 1565922247. — Lunde explains why Han Unification "is not an appropriate way to decribe the process that took place to create this character set", explains the "source separation rule", and lays out the detailed workings of Unicode CJKV character set rationalization.
Jukka K. Korpela. "Chapter 4: The Structure of Unicode". Unicode Explained. O'Reilly, 2006. ISBN 059610121X. — Korpela explains why some people think that Unicode discriminates against CJKV characters and why the character set rationalization "has been regarded as an artificial and even barbaric method", giving an analogy to unifying the Latin, Cyrillic, and Greek alphabets, and also points out that the issue "is not, however, a case of East Asian peoples against the Western world".

© Copyright 2007,2020 Jonathan de Boyne Pollard. "Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in its original, unmodified form as long as its last modification datestamp is preserved.