Mobileread
Should Chinese Fonts be Embedded in Ebooks?
#1  Tex2002ans 05-26-2020, 09:31 PM
Since I don't read/write Chinese, I was wondering if anyone on MR could help.

I know with many CJK Unicode characters, they can render differently depending on which language they're in (Chinese/Korean/Japanese). (See "Han unification" on Wikipedia.)

The Fonts/Sentences

The documents I'm converting used these 4 fonts in the original DOCs:

Here's an example sentence of each:

Spoiler Warning below






Code
(<i>Shujing</i>, “The Great Declaration I”, <span style='font-family:SimSun'>泰誓上</span>)
[...]
Liu E, also known as Liu Tieyun <span style='font-family:"MS Gothic"'>劉鐵雲</span>, was born in 1857 at Liuhe <span style='font-family:"MS Gothic"'>六合</span> county in what is today Nanjing <span style='font-family:"MS Gothic"'>南京</span>.
[...]
From Liu E’s<span style='font-family:"PMingLiU"'>劉鶚</span> preface to <i>The Travels of Laocan</i> (<i>Laocan youji</i> <span style='font-family:"PMingLiU"'>老殘遊記</span>).
[...]
In his <i>Historical Records</i> (<i>Shiji</i> <span style='font-family:"MS Mincho"'>史記</span>), Sima Qian quotes the philosopher Jia Yi,


(There are ~80 in total.)

I converted all to use lang="zh" + xml:lang="zh":

Code
(<i>Shujing</i>, “The Great Declaration I”, <span class="chinese" lang="zh" xml:lang="zh">泰誓上</span>)
[...]
The Questions

1. Is "zh" the proper lang to use in this case?

(I used Google Translate and it seems like all the characters are in Chinese, but I'm not sure if it's Simplified/Traditional [zh-Hans or zh-Hant].)

2. When working with these characters, would it be best to embed a Chinese/language-specific font? If so, which one?

(Free/Open font preferable.)

3. Is there any better way of handling conversion to ebook? Or should I just trust the source document had them typed in correctly and that ereaders will render okay?

I visually inspected some, and they seem to render similar to the source documents, but I'm not sure how they'll appear on actual ereaders.

The examples all look the same except for some small differences in #1 (SimSun + whatever font Sigil is rendering these in):

SimSun

show attachment » show attachment »

MS Gothic

show attachment » show attachment »

PMingLiU

show attachment » show attachment »

MS Mincho

show attachment » show attachment »

Side Note: For some more CJK unicode goodness, also see:

https://meta.stackexchange.com/questions/251743/whitelist-the-span-tag-with-the-lang-attribute-in-order-to-support-han-chara
https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name

Seems like even many sites don't handle certain cases properly... so I can't imagine the ebook side of things. :P
Reply 

#2  jhowell 05-26-2020, 10:38 PM
Are these books being produced for sale?

Do you have specific ecosystems in mind for these books?

I don't read or speak Chinese but I know that Kindles have fonts for Chinese books and have different handling for simplified vs. traditional Chinese.
Reply 

#3  Tex2002ans 05-27-2020, 01:32 AM
Quote jhowell
Are these books being produced for sale?
Yes.

Quote jhowell
Do you have specific ecosystems in mind for these books?
All the usual major ones. (B&N, Kobo, Amazon, [...].)

Quote jhowell
I don't read or speak Chinese but I know that Kindles have fonts for Chinese books and have different handling for simplified vs. traditional Chinese.
I was treating it similar to how I handle Polytonic Greek. Since many of those obscure Greek characters don't show up on old devices, I embed a font (like Galatia SIL) just for that "greek" class, then subset it.

With Chinese, I previously ran across only ~2-3 characters in an entire book. In that case, I either didn't bother (2 characters likely wouldn't be missed if the reader didn't display), or I subset a font (like Droid Sans Fallback) just for those.

In this specific case, it's 2 articles (out of ~230) that have dozens of Chinese words inside... and now that I've since learned about the language-dependent glyphs, I want this done right.

Side Note: Just now I ran across this:

https://en.wikipedia.org/wiki/List_of_CJK_fonts

which lists:

None are open-source (so definitely not embeddable).

And I may be dealing with different languages than I thought... I also wonder if Droid Sans Fallback is substitutable for all those, and will morph depending on lang... has anyone tested this across different ereaders?

Side Note #2: Here's the 2 actual PDFs if anyone wants to take a closer look:

http://libertarianpapers.org/wp-content/uploads/article/2013/lp-5-1-5.pdf
http://libertarianpapers.org/wp-content/uploads/2016/06/post/2016/06/lp-8-1-6.pdf

Everything is all CC3.0.
Reply 

#4  Quoth 05-27-2020, 05:13 AM
Do the PDFs embed the required fonts? Otherwise you don't know what it should look like
Reply 

#5  Tex2002ans 05-27-2020, 06:01 AM
Quote Quoth
Do the PDFs embed the required fonts? Otherwise you don't know what it should look like
Those are the PDFs generated years ago, and then I have the actual DOC(X)s (this is how I know all the font information + have all the correct underlying unicode characters).

But there's two parallel issues here:

1. Fonts: Since I can't use any of those 4 proprietary fonts, I'm going to have to rely on different fonts in the ebook.

On the proofing side of things, it's hard to tell if this is simple font differences (like a difference between Serif/Sans-Serif fonts)... or if stripping those fonts can cause the displayed text to now be wrong.

Side Note: It looks like "Source Han Sans" may be another potential font candidate.

2. HTML Language: There are actual language variations (different swashes and swooshes).

For example, this single character:

返 (U+8FD4)

in different languages, has at least 5 different representations:

https://en.wikipedia.org/wiki/File:Source_Han_Sans_Version_Difference.svg

In ebooks, this would require proper lang markup:

Code
<span lang="zh-Hans">返</span> (Simplified Chinese)
<span lang="zh-Hant">返</span> (Traditional Chinese)
<span lang="zh-HK">返</span> (Traditional Chinese - Hong Kong)
<span lang="ja">返</span> (Japanese)
<span lang="ko">返</span> (Korean)
All are the same Unicode character, but should display differently (like the above SVG).

I mean, to me, the few sample images I posted in #1 look similar, but I don't know, because it all looks Chinese to me .

Side Note: My best guess currently, is that I can change anything that was in:

PMingLiU -> lang="zh-Hant" (Traditional Chinese)
SimSun -> lang="zh-Hans" (Simplified Chinese)
MS Gothic + MS Mincho -> lang="ja" (Japanese)

then substitute in a thoroughly vetted Asian font (like Source Han Sans). But then comes actual device support... has anyone meticulously tested this stuff across devices?
Reply 

#6  jhowell 05-27-2020, 08:43 AM
I get it now. The book is primarily in English with Chinese characters here and there.

As this relates to Kindle there are language specific fonts for Simplified and Traditional Chinese, but those won't come into play since they are enabled based on the primary language of the book. The regular fonts probably won't have the characters you want and I believe that the fallback is the Code2000 font. I doubt that has any handling of language-specific character variants.

So it does appear that embedding a font with the correct language variant would need to be done. Using images instead would be more foolproof.
Reply 

#7  Quoth 05-27-2020, 10:51 AM
I gave up and used an image (screen captured and reduced from source!) at first occurrence with transliteration and then just transliteration. Which may or may not have been correct. It was a few years ago and I tended to get [][][][][] on the actual ebook, but I didn't know much about Calibre or Font Embedding or CSS for language support then.

Also if you had someone Chinese, would they be the "right" Chinese person, though the various written scripts are simple compared with the bewildering variety of spoken "Chinese" languages.
Reply 

#8  Tex2002ans 05-27-2020, 03:38 PM
Quote jhowell
I get it now. The book is primarily in English with Chinese characters here and there.


Quote jhowell
As this relates to Kindle there are language specific fonts for Simplified and Traditional Chinese, but those won't come into play since they are enabled based on the primary language of the book.
Agreed.

This is an English book with the occasional Chinese/Japanese character (~80 foreign words).

Side Note: Do you know which fonts Kindles have for Simplified/Traditional Chinese?

Quote jhowell
The regular fonts probably won't have the characters you want and I believe that the fallback is the Code2000 font.
I believe so too.

Symbola is also a "fallback font" I embed whenever I'm dealing with very obscure Unicode characters (like Wingdings/Webdings, which I wrote about in 2016).

Quote jhowell
I doubt that has any handling of language-specific character variants. So it does appear that embedding a font with the correct language variant would need to be done.
Agreed. Doubt Symbola handles that either. Probably need a font specifically designed for Asian languages.

Quote Quoth
I gave up and used an image (screen captured and reduced from source!) at first occurrence with transliteration and then just transliteration. Which may or may not have been correct. It was a few years ago and I tended to get [][][][][] on the actual ebook, but I didn't know much about Calibre or Font Embedding or CSS for language support then.
I strongly recommend against inserting text as images. I wrote about some reasons why in the 2018 Greek thread.

Side Note: On many Asian font bugs and poor support across all types of programs... I recommend checking out some of these talks:

That's where I first learned about many of these Asian-specific issues.
Reply 

#9  jhowell 05-27-2020, 06:07 PM
Quote Tex2002ans
Side Note: Do you know which fonts Kindles have for Simplified/Traditional Chinese?
As far as I know there are eight fonts. They are named Heiti, Kaiti, Song, and Yuan with separate ones for Traditional and Simplified Chinese. I don't know any details about these.
Reply 

#10  Quoth 05-28-2020, 07:47 AM
Quote Tex2002ans
I strongly recommend against inserting text as images. I wrote about some reasons why in the 2018 Greek thread.
I'd agree.
It's a shame that these issues were largely solved at the OS level before anyone made any eink reader and that the early Kindles are so poor.

What I do now isn't the same as even four years ago.
Reply 

  Next »  Last »  (1/3)
Today's Posts | Search this Thread | Login | Register