Japanese input methods in nosh user-space virtual terminals

nosh pages:

introduction and blurb
The nosh Guide
source package
FreeBSD binary packages
Debian Linux binary packages
OpenBSD binary packages
The timorous admin's installation how-to
A real-world worked example of setting up and running a service with nosh
A quick look at nosh user-space virtual terminals
Japanese input methods in nosh user-space virtual terminals
Italics and colour in manual pages
Combining the nosh user-space virtual terminals with BRLTTY
MariaDB, MySQL, and Percona
roadmap

User-space virtual terminals are one of the terminal management features of the nosh package. The design comprises separate coöperating components that are plugged together, one of which is a so-called "front-end processor" for running input methods. Input methods are driven from data files common to MacOS, xcin, gcin (GitHub), jscin (GitHub), hime (GitHub), PIME (GitHub), OpenVanilla (GitHub), OkidoKey.app (GitHub), and Chinese Open Desktop (GitHub).

The screenshots in this article are rendered in HTML+CSS rather than as images. Some WWW browsers do not correctly handle character spacing of monospaced mixed CJKVL characters, and the alignments in some examples may appear off. Your WWW browser may also "greek" the text if you do not have a font capable of the relevant Japanese glyphs from the Basic Multilingual Plane.

The user interface

Display

The user interface display is a textual one, and is thus fairly minimal compared to the graphical user interfaces presented by other front-end processors. As usual, though, it comprises an edit area where the data to be sent are composed, above a status area where information about the available conversions and current mode are displayed. These are presented over the spot, i.e. on top of the cursor position of the underlying virtual terminal (subject to being constrained to not flow over the edge of the screen if possible).

＄　
　ｒ

The front-end processor is modal; it is switchable amongst up to six different input method data tables, one of which has three sub-modes. These are nominally Chinese 1, Chinese 2, Hiragana, Katakana, Hangeul, Rōmaji 1, Rōmaji 2, and Rōmaji 3; which are signalled by a character in the status area when the list of available conversions is empty:

＄　　汉	＄　　漢	＄　　あ	＄　　カ
＄　　한	＄　　ｒ	＄　　Ｒ	＄　　ｍ

Data to be sent are constructed by typing ASCII into the edit area, which is turned into a list of conversions as one types. Spaces affect conversion, by limiting the possible sequences of convertable ASCII characters and thus potentially delimiting "words", and do not normally show up in converted output.

＄ｔｙｐｅＡＳＣＩＩｈｅｒｅ　
　ｔｙｐｅＡＳＣＩＩｈｅｒｅ ↓

Conversion only proceeds up to the cursor position, and one can move the cursor left and right along the entry field, and delete and insert characters to perform corrections for reconverting. Spaces appear in unconverted output, which also shows up against a darker background. When there are conversions available, the status area character changes from a conversion mode indicator to a single-headed or double-headed arrow indicating that one can scroll up and down through a conversion list.

＄ｔｙｐｅ　ＡＳＣＩＩ　ｈｅｒｅ　
　ｔｙｐｅ　　　　　　　　　　　 ↓

＄ｔｙｐｅｄ　ＡＳＣＩＩ　ｈｅｒｅ　
　ｔｙｐｅｄ　　　　　　　　　　　 ↓

＄ｔｙｐｅ　ＡＳＣＩＩ　ｈｅｒｅ　
　ｔｙｐｅ　　　　　　　　　　　 ↓

Unconverted output shows up using symbols defined by the input method data table in use, which is not necessarily ASCII. A Hangeul input method will make unconverted ASCII display as the equivalent Jamo. The "array" Chinese input methods make the unconverted ASCII display as the various array positions. Japanese data tables generally just use the same ASCII symbols, however. Although conventionally (when using the nosh toolkit's data tables at least) they use lowercase in Hiragana mode and uppercase in Katakana mode.

＄쇼ㅔㄷ　ㅁㄴㅊㅑㅑ　ㅗㄷㄱㄷ　
　쇼ㅔㄷ　　　　　　　　　　　 ↓

＄立止　1-2-3⇣8⇡8⇡　6-3⇡4⇡3⇡　
　立止　　　　　　　　　　　 ↓

＄チーペ　ＡＳＣＩＩ　ＨＥＲＥ　
　チーペ　　　　　　　　　　　 ↓

Accepting the conversion sends the contents of the entryfield, exactly as on-screen (i.e. in its currently converted/unconverted state), to the underlying virtual terminal as plain Unicode character input events.

＄チーペ　ＡＳＣＩＩ　ＨＥＲＥ　
　チーペ　　　　　　　　　　　 ↓

＄チーペ　ＡＳＣＩＩ　ＨＥＲＥ　
　　　　　　　　　　　　　　　カ

Keyboard input

Typed input is indeed ASCII, not Rōmaji. It is limited to the ASCII character set and so cannot include macrons or circumflex accents, as various forms of Rōmaji can.

The input method is at a layer below realizers, and so the entered input seen is whatever realizers send after doing their keyboard map processing. Thus changing between (say) QWERTY, AZERTY, Dvorak, Malt, and other keyboard layouts will affect what physical keys one has to press to spell the same romanization. (This is more important for "array-based" Chinese input methods than Japanese input methods that are always spelling-based, because changing the keyboard layout affects how the array elements are positioned on the keyboard; and it is usually required to layer a QWERTY keyboard layout on top of Chinese input methods, since the data files for the array input method assume that.) The spellings themselves remain the same; "sake" is always S · A · K · E whatever the keyboard layout is and whatever keys have those letters.

Input methods are better driven from at least a 105-key Windows keyboard; and of course a JIS 109-key Windows keyboard has dedicated keys for input method functions. These functions are not quite as the keys themselves are engraved. In part, this is because the meanings of the keys have changed over time anyway. (The 漢字 key, for example, does not despite the name switch to Kanji in modern input methods, but is nowadays an on-off switch for the input method.) In part, it is because some of the keys do not match the model with which the input method operates. (The 変換 and 無変換 keys, for example, do not match the input method's mechanism of mapping directly from ASCII to Kanji in Kanji mode, without going through an intermediate kana stage.)

For compatibility, several functions are also available (but, unlike the dedicated keys, only when the input method is switched on) as control key chords and as function keys. The control key chords are compatible with the DEC IMLIB. The function keys are compatible with Microsoft's Japanese IM editor from Windows NT 4. (Note that Microsoft's doco erroneously has F8 where it should have F10.) For a complete list of keys, see the console-input-method manual.

Some keyboard functions (assuming the default keyboard map for a 109-key keyboard)
Keys			Function
JIS 109-key Windows keyboard	Control chord	Function key	Function
漢字			Switch the input method on and off.
変換	⎈ Control+Z		Switch to Chinese1 (i.e. Kanji) conversion mode, cycling between the two Chinese conversion modes if already in Chinese conversion mode. (Usually for Japanese, Chinese2 conversion mode is an empty input method.)
無変換			No function.
ひらツなカタカナローマ字			unmodified (i.e. Level 1 Shift) Switch to the Hiragana input method. with ⇧ Level 2 Shift Switch to the Katakana input method. with ⇮ Level 3 Shift Switch to the Rōmaji input method, cycling through the three sub-modes if already in the Rōmaji input method.
半角全角			No function.
	⎈ Control+L	F6	Switch to the Hiragana input method.
	⎈ Control+K	F7 or F8	Switch to the Katakana input method.
	⎈ Control+R	F9 or F10	Switch to the Rōmaji input method, cycling through the three sub-modes if already in the Rōmaji input method.

A superuser, or a privileged user granted access to the internals of user-space virtual terminals, can generate these events with the console-input-method-control command.

Data files

Chinese Open Desktop provides three input methods for Japanese language input: Hiragana (hiragana.cin), Katakana (katakana.cin), and Kanji (nippon.cin). These are fairly unsubtle, only providing Kunrei-shiki the so-called "Official System" of romanization (as opposed to the "Japanese" and "Standard" Systems), always presenting unconverted ASCII in lowercase, and requring one to explicitly type the characters for sokuon and chōonpu.

The nosh toolset intentionally does not come with its own library of input methods, as there is already a big mess of libraries that are not in synch with one another. However, it does come with replacement Hiragana and Katakana data files, which are installed in /usr/local/etc/cin-data-tables/, augmenting what is currently available elsewhere. There is more detail in the nosh Guide and in commentary in the data table files themselves; but basically put: the replacement Hiragana data file presents unconverted ASCII in lower case and supports Kunrei-shiki and Nihon-shiki; the replacement Katakana data file presents unconverted ASCII in upper case and supports a grab-bag set of romanizations, including Hebon-shiki and some recent unofficial and semi-official non-standard stuff, mashed together along with extra spellings for symbols such as stars and brackets; and both allow gemination (letter doubling) for representing sokuon and chōonpu.

＄【パーティーローマヂ !】　
　【パーティーローマヂ !】 ↕

＄ｋａｋｋｏ　ｐａａｔｙ　ｒｏｏｍａjｉ !　ｋａｋｋｏ　
　　　　　　　　　　　　　　　　　　　　　　　　　　　カ

Normally, for Japanese input one will need just these three Hiragana, Katakana, and Kanji input method data files, and console-input-method will be invoked with the arguments --chinese1 --kana nippon.cin hiragana.cin katakana.cin. (The input method service that is set up for the head0 pre-supplied user-space virtual terminal, described in the Guide, is driven by per-service environment variables in a service's environment directory, so one simply sets the chinese1 enironment variable to nippon.cin, modifying that to be whereever the Kanji data file is, and the kana environment variable to /usr/local/etc/cin-data-tables/hiragana /usr/local/etc/cin-data-tables/katakana, the two as one string, and it will translate into the command-line options for console-input-method.)

Rōmaji data files

To the Hiragana, Katakana, and Kanji data tables one can optionally add a Rōmaji data table, although the default null conversion (in the absence of a data table) in conjunction with what one can already do with a user-space virtual terminal even without an input method fulfil most Rōmaji needs for Japanese. On the gripping hand, one might want a more unusual input method such as the nosh-supplied Rōmaji data table in /usr/local/etc/cin-data-tables/romaji-x11 that provides equivalents for those common X11 "compose key" sequences that cannot be typed on an ISO 9995 keyboard with the common secondary group (which is already available in nosh user-space virtual terminals).

Beware, however, of Rōmaji data tables like latin-letters.cin that one can find in various CIN file collections. This table gives multiple conversions to ASCII sequences of just a single character, and most characters have such sets of conversions. The resultant combinatorial explosion can consist of thousands of conversions generated by strings as short as 4 characters, which is slow to scroll through and has a noticable delay in generation.

For example, the ASCII string daemon results in a list of just over 27,700 potential conversions with latin-letters.cin, ranging from dæmon through däëmŏŉ to đąěmœŋ. It is simpler and quicker to just type these with the ISO 9995 common secondary group. Rather than six keystrokes followed by more than a hundred uses of PgDn and ↓, däëmŏŉ is typed as those same six keystrokes plus a mere twelve more: D · ⇨ ⇮+T · A · ⇨ ⇮+T · E · M · ⇨ ⇮+E · O · ⇨ ⇮+I · N

Rhythm

The recommended typing rhythm for Kanji is to process one Kanji letter at a time: spell the Kanji — optionally select a different conversion — press ⮠ Carriage Return or Enter — repeat. This avoids combinatorial explosions with spellings such as "ni" (which has over 660 different possible conversions in Chinese Open Desktop's nippon.cin).

The recommended typing rhythm for Katakana and Hiragana is somewhat looser as there is far less potential for combinatorial explosion: spell one or more kana — optionally select a different conversion — press ⮠ Carriage Return or Enter — repeat. (For the kana themselves, even spellings such as "zi" and "ji" only have a handful of potential conversions; although things get a little tedious with the multiple conversions available for some sets of symbols if one is spelling more than one symbol at once.)

It is not possible to use multiple input methods simultaneously, so for mixed writing comprising both kana and Kanji intersperse these rhythms with the keys for changing conversion mode. (Chinese Open Desktop's nippon.cin does include Hiragana conversions alongside the Kanji.)

＄　
　カ

＄ローマ　
　ローマ ↕

＄ローマ　
　　　　カ

＄ローマ　
　　　　汉

＄ローマ字　
　　　　字 ↕

＄ローマ字　
　　　　　汉

Note that pressing the spacebar is not actually necessary at all. It does, however, reduce the potential nippon.cin conversion list of "ni" to just the conversions for "n" and "i" individually if a space is placed between the ASCII letters. (It only reduces it by a little, to some 620 possible conversions, though.)

© Copyright 2018 Jonathan de Boyne Pollard. "Moral" rights asserted.
Permission is hereby granted to copy and to distribute this WWW page in its original, unmodified form as long as its last modification datestamp information is preserved.