There is no such thing as a legacy character encoding.

The notion of there being a suite of character encodings that are "legacy encodings" was entirely made up in Wikipedia on 2005-07-17 by a person using the pseudonym "Plugwash". There is no such thing in Unicode. Neither the Unicode Standard nor any of the Technical Reports define any such concept. This was noted on the article's discussion page on the day of its creation, but, two and a half years later, the article persists and the notion that there's such a thing is blithely promulgated by Wikipedia and by the multiplicity of web sites that copy its content.

At one point, just over two years after the creation of the article, it was put up for deletion. One Wikipedia editor, in the ensuing discussion, pointed to how many hits Google Web gave for the two words as a phrase. This meaningless metric was used as a justification for retaining the article, alongside a definition of what a legacy encoding is, derived from a document written by Brian Carr and Karen Watts of Basis Technology Corporation and published by IBM.¹ No other supporting documentation, saying what the characteristics of legacy encodings were, what qualified an encoding to be or to not be "legacy", and so forth, was presented by anyone.

That is because there is no such documentation to be had. The "definition" by Carr and Watts was in fact no more than the general statement by them that when they were talking about a legacy encoding they were talking about "any character encoding that was in use prior to the advent of the Unicode standard". They were, in turn, basing this upon Ken Lunde's CJKV Information Processing, which defines "legacy" as simply "non-Unicode", and nothing else, on page 423.

That is all that a "legacy encoding" has ever been: a character encoding that isn't a Unicode character encoding. This is the case, for example, on page 167 of Java Internationalization by Andrew Deitsch and David Czarnecki, where "legacy encoding methods" are any non-Unicode encodings supported by the platform that Java is running upon. There is nothing else to say. The phrase is just a shorthand used by some people for "not a Unicode character encoding".

The idea propounded by "Plugwash", using Wikipedia, that there's a solid, well-defined, umbrella concept under which all non-Unicode character encodings fall, apart from the tautologous statement that they aren't Unicode, is complete rubbish. But Wikipedia editors nonetheless resist removing the fabrication and Wikipedia continues to promulgate the idea that not only does such a concept exist, but it is well defined and capable of being the subject of an encyclopaedia article.

"Never mind the inaccuracy of the encyclopædia. Count the Google hits!"

1. Brian Carr and Karen Watts (1999-09-01). Processing database information using Unicode, a case study. IBM developmentWorks. Basis Technology Corporation.

© Copyright 2007 Jonathan de Boyne Pollard. "Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in its original, unmodified form as long as its last modification datestamp is preserved.