Search the Moravia Blog

Blog

5 Things You Should Know to Sort Japanese the Right Way

Posted by Doug McGowan on Mon, Feb 06, 2017 @ 03:38 PM

karuta700.png

All too often, developers who are native to Western languages approach Japanese as if it was just, well, another language. Source strings get translated, target strings get sorted, and everything falls to pieces. In this article we’ll take you on a journey through the complexities of sorting Japanese text — so you won’t learn about it the hard way.

And by “the hard way,” we mean through harsh feedback in reviews and user forums. Actually, even just a few years ago, some applications sorted Japanese by ASCII codes and it would be hyped as an index generation feature — which would be unusable. These days, Japanese is being approached with a skosh more awareness, but the challenges remain — like a latent road hazard if you’re not careful.

dictionary700.png

1. Know that you’re dealing with multiple character sets

First, understand that Japanese is a language that has three distinct character sets — the phonetic hiragana and katakana, and the logographic kanji. Hiragana and katakana consist of 46 characters each, essentially having a 1-to-1 relationship with each other, and they are sorted in gojuon order. This level of basic sorting is already supported on most platforms, although if you’re not Japanese you might have a little trouble telling the difference between シ (shi) and ツ (tsu), or リ (ri) and ソ (so) and ン (n). Let’s assume the computer will recognize the difference.

Not the simplest system, but things start getting really difficult when you consider the third character set, kanji. Japanese children learn 1,006 kanji characters while in elementary school, and by high school that grows to a total of 2,136, plus another 983 kanji found exclusively in people’s names. That’s a hefty load compared to the 26 alphabet letters you need to learn for English.

In China, the birthplace of kanji (called hànzì), average literacy requires about 3,000 characters, but what makes Japanese even more challenging is that each kanji can have multiple phonetic readings — and that will be the monkey wrench in your sorting and indexing machinery.

If you’re getting bored fast, I suggest that you watch these videos by NativLang and skip down to section 4.


 Quick introduction to the complexities of Japanese “spelling”

2. Know the tenuous relationship between kanji and gojuon

Gojuon is the golden rule of Japanese sorting. If you’re dealing with anything that is user-facing like an index of terms, you’ll need to integrate kanji strings into the gojuon order along with kana strings. 

For example, try sorting this three-word list: [生魚] [ライス] and [ご飯]. If you figured that ご飯 should come first (since こ is #10 in gojuon order), followed by 生魚 (since 生 starts with な in this case, which is #21 in gojuon order), and finally ライス (since ら is #39 in gojuon order), you’d be correct.

But of course, nothing is as easy as it seems in Japanese. As mentioned previously, kanji often get pronounced in different ways depending on the word or phrase they’re used in.

Common (not all) pronunciations of the character 生
Term Hiragana Romanized Gojuon# Meaning
きる きる ikiru 2 To live
umu 3 To give birth
える える haeru 26 Grow
せいめい seimei 14 Life
しょうがい shougai 12 A lifetime
いと kiito 7 Raw silk
なまざかな namazakana 21 Raw fish
うぶかた ubukata 3 (Proper name)

As a result, the same kanji character gets sent all over that neat and orderly gojuon matrix.

Common (not all) gojuon positions for the character 生
あ (a) い (i) う (u) え (e) お (o)
か (ka) き (ki) く (ku) け (ke) こ (ko)
さ (sa) し (shi) す (su) せ (se) そ (so)
た (ta) ち (chi) つ (tsu) て (te) と (to)
な (na) に (ni) ぬ (nu) ね (ne) の (no)
は (ha) ひ (hi) ふ (fu) へ (he) ほ (ho)
ま (ma) み (mi) む (mu) め (me) も (mo)
や (ya)   ゆ (yu)   よ (yo)
ら (ra) り (ri) る (ru) れ (re) ろ (ro)
わ (wa)       を (wo)
ん (n)        

To address this issue, you will need access to the correct in-context phonetic readings of your kanji strings. We recommend you do this with the help of an LSP, or a Japanese speaking friend, or…more on this later.

3. Know the tiny symbols and characters and how they relate to sorting

Maybe you’ve looked at some Japanese text and noticed little whiskers or bubbles next to the characters. Or maybe some characters are smaller than the others. These are actually symbols that change the pronunciation of the character they’re next to, so if you’re sorting the Romanized version of a word, those little marks make a big difference. For example, は (ha) can be turned into ば (ba) and ぱ (pa) as shown below.

あ (a) い (i) う (u) え (e) お (o)
か (ka) が (ga) き (ki) ぎ (gi) く (ku) ぐ (gu) け (ke) げ (ge) こ (ko) ご (go)
さ (sa) ざ (za) し (shi) じ (ji) す (su) ず (zu) せ (se) ぜ (ze) そ (so) ぞ (zo)
た (ta) だ (da) ち (chi) ぢ (ji) つ (tsu) づ (zu) て (te) で (de) と (to) ど (do)
な (na) に (ni) ぬ (nu) ね (ne) の (no)
は (ha) ば (ba) ぱ (pa) ひ (hi) び (bi) ぴ (pi) ふ (fu) ぶ (bu) ぷ (pu) へ (he) べ (be) ぺ (pe) ほ (ho) ぼ (bo) ぽ (po)
ま (ma) み (mi) む (mu) め (me) も (mo)
や (ya)   ゆ (yu)   よ (yo)
ら (ra) り (ri) る (ru) れ (re) ろ (ro)
わ (wa)       を (wo)
ん (n)        

But (and this may come as a relief to you), when sorting in gojuon, those marks are disregarded. So は (ha), ば (ba), and ぱ (pa) are not differentiated. 

The same goes for the small ゃ (ya), ゅ (yu), and ょ (yo) that are suffixed to other characters to create the following diphthongs.

Base character + ゃ (ya) + ゅ (yu) + ょ (yo)
き (ki) きゃ (kya) きゅ (kyu) きょ (kyo)
ぎ (gi) ぎゃ (gya) ぎゅ (gyu) ぎょ (gyo)
し (shi) しゃ (sha) しゅ (shu) しょ (sho)
じ (ji) じゃ (ja) じゅ (ju) じょ (jo)
ち (chi) ちゃ (cha) ちゅ (chu) ちょ (cho)
に (ni) にゃ (nya) にゅ (nyu) にょ (nyo)
ひ (hi) ひゃ (hya) ひゅ (hyu) ひょ (hyo)
び (bi) びゃ (bya) びゅ (byu) びょ (byo)
ぴ (pi) ぴゃ (pya) ぴゅ (pyu) ぴょ (pyo)
み (mi) みゃ (mya) みゅ (myu) みょ (myo)
り (ri) りゃ (rya) りゅ (ryu) りょ (ryo)

And then you have the small っ (tsu) that can appear between two other characters to signify a double consonant (like a glottal stop). For example, まくら (makura = pillow), まっくら (makkura = pitch darkness). This small っ has a different purpose than the small ゃ, ゅ and ょ; however, the most important characteristic they all share for the purpose of this article is that they are not differentiated from their full-size versions during sorting. や and ゃ, ゆ and ゅ, よ and ょ, and つ and っ are treated as the same character.

index700.png

Index page from an elementary school textbook

4. Know the sorting scheme and what can go wrong

At the FileMaker Community site, it is stated that after symbols, numerals, and the Roman, Greek, and Cyrillic alphabets, the Japanese character sets are sorted with hiragana and katakana combined, followed by kanji in order of Shift JIS character code, followed by more kanji supported only in Unicode. There is no kana-kanji integration at this point, and there’s a good reason for that.

I tested the sorting function of Microsoft Excel using a random set of strings within a narrow range beginning with し (shi). Microsoft gets it all right. All of the kanji and kana terms are sorted in gojuon order the same as their hiragana phonetic counterparts (think of them as the control group).

Sort A: Correct Gojuon Sorting

goodsort.png

But take a look at the next sort attempt — the same text strings, sorted using the same function in Excel, produce vastly different results. This time, the source text gets separated into hiragana and katakana first, followed by kanji (the same as FileMaker). Those kanji are rearranged in a way that doesn’t make sense to ordinary people, and as you can see, they no longer match the hiragana strings that were correctly sorted.

Sort B: Incorrect Gojuon Sorting

badsort.png

What happened? Why the difference? Well, actually there was a difference in the source strings between Sort A and Sort B that you can’t see.

The Sort A strings were input in Excel manually, which means they were keyed-in using hiragana and converted to the final form as kanji or katakanaSomewhere in the background, phonetic information associated with the strings is stored in the data.

The Sort B strings, on the other hand, were copied and pasted from a text file, so they do not have any phonetic information accompanying the terms. Presumably, a default sorting scheme similar to FileMaker’s kicked in, since the kanji are ordered sequentially from S-JIS 8E84 (私) to 90B6 (生).

5. Know how to get help and insight

For accurate sorting based on gojuon, you need phonetic information to accompany the terms. The surest way is to consult an LSP, but if you have budgeting or timing issues that prevent you from doing this, there are other options you might consider.

For instance, if you look around you can find discussions on how to sort Japanese kanji words programmatically. Using the open-source part-of-speech and morphological analyzer MeCab to leverage IPA dictionaries to convert kanji into kana seems to be a convenient (albeit not perfect) way to deal with the situation. An alternative would be the GetPhonetic method if you are using Microsoft VBA. More discussion can be found here and here.

Japanese kanji are ordered differently between JIS, the predominant encoding used on the internet; Shift-JIS (SJIS), the Microsoft-developed version of JIS used in both Windows and Macintosh; EUC which is used on UNIX; and Unicode (UTF-8, UTF-16), the global standard that includes all characters of the world. So make sure you are absolutely clear on this from the get-go. Here’s a handy list of the codes.

Last but not least, keep an eye out for lists where non-gojuon order is more appropriate. For example, in Japan it is customary to list Prefectures from North to South, starting with Hokkaido and ending with Okinawa. The order from 1 to 47 is prescribed by the international standard ISO 3166-2:JP.

As you can see, sorting text in Japanese is a bit more challenging than in Western languages. But if you keep these characteristics in mind, and keep an open mind (that’s ready to surf the web for answers), you’ll be able to overcome the challenges and sort Japanese the right way, right away.

Top photo: Picture cards (efuda) from a deck of karuta. Karuta is a game where one person reads from a deck of phrase cards (yomifuda), and players try to grab the corresponding picture card with the correct character. Aiueo-karuta are a fun way for kids to learn their hiragana. (Fun factoid: The word “karuta” derives from the Portuguese word “carta” (card) which entered Japan in the mid-16th century along with Portuguese traders.)


If you have plans to do business in Japan or with Japan, or would like to see some examples of content localization into Japanese, please check out the Moravia Japan Blog here.

Topics: Localization Insider