Mobileread
More info regarding the zh-cn / zh-tw differences for AZW3 output
#1  akatsuki 12-07-2019, 01:31 PM
Hello Calibre developer,

This topic is a continuation of the older post: When Calibre convert, the input language is zh-tw, but the output language become zh. I find myself not able to reply to that thread, therefore open a new thread. Just in case it's not a good manner here, please let me know.

1. I am willing to help write code and debug

I understand Calibre is an open source project, and it is not an obligation for any developer to solve any problem. Therefore, I am willing to help write code and debug. But currently I have no knowledge of AZW3 internal format and the architecture of Calibre, so I post here to gather information and seek help.

2. The reading difference between zh-cn and zh-tw

The main difference is that the Kindle operating system provides different fonts for them.

For zh-cn, they are: 宋体, 黑体, 楷体, 圆体.
For zh-tw, they are: 宋體, 黑體, 楷體, 圓體.
(Notice the slight difference in the names)
It seems that Kindle not yet supports zh-{hk,mo,sg,my}. But zh-{hk,mo} is similar to zh-tw, and zh-{sg,my} is similar to zh-cn.

3. Why font matters?

To save Unicode encode space, the Unicode consortium merges CJKV characters from different country or territory into same Unicode representation. This caused a result that the reader must choose the correct font, otherwise character shapes from mixed country or territory will appear in-mid of a paragraph. Most shared characters have similar shapes so the reader can guess, but roughly less than 1% of the characters are unintelligible because the shapes are not similar.

You can learn the Unicode same-codepoint-different-shape problem from this picture on Wikipedia.

4. Possible values for zh-cn and zh-tw

From previous posts, I know that it is not clear which XML value does Kindle recognize as zh-cn and zh-tw. I think they might be one of the following:

Code
zho-cn / zho-tw
zho-hans / zho-hant
zho-sim / zho-trad (or maybe zho-tra)
zho_CN / zho_TW
...
This way we can narrow down the search so the amount of work may be less, ... probably.

4. Possible fallback method?

In case none of them work, maybe it would be possible to add an
Code
<html lang="zh-tw">
attribute to force the Kindle to use the correct font if Kindle uses an HTML render that understands this.

Thank you.

#2  BetterRed 12-07-2019, 06:22 PM
Quote akatsuki
Hello Calibre developer,

This topic is a continuation of the older post: When Calibre convert, the input language is zh-tw, but the output language become zh. I find myself not able to reply to that thread, therefore open a new thread. Just in case it's not a good manner here, please let me know.
If you're referring to this:

show attachment »

It's a warning to deter piggy back posts to old threads - but it shouldn't prevent new posts, especially from the original poster.

Let me know, if you want this thread to be merged with the old one.

BR

#3  kovidgoyal 12-07-2019, 08:58 PM
If you wish to contribute code, feel free to do so. The azw3 output plugin in in the writer8 folder. Search for lang in that folder. As far as I know the azw3 format has no support for anything other than ISO 639-1 lang codes, but if you have a azw3 file that does specify country code, you will have to use a hex editor to check how the country code is stored in the header and implement it in the azw3 output plugin. There is a description of the header fields of MOBI/AZW3 files in the mobileread wiki.

Today's Posts | Search this Thread | Login | Register