Mobileread
Problem with ocr
#1  ichnilatis 12-29-2020, 12:13 PM
Hi,
I have installed ell.traineddata and grc.traineddata into koreader/data/tessdata, but KOReader doesn't recognize a scanned pdf I have in Ancient Greek, even I have switched on the "Forced OCR".

I would also like to ask why there are only two options for "Document Language", English and Chinese?

Thank you for your help!


P.S.: Let me wish you all a blessed new year. May the light of the newborn Christ illuminate your heart in a dark hopeless world! (sorry if it is not politically correct)
Reply 

#2  Frenzie 12-29-2020, 01:06 PM
I suspect it was written by a Chinese contributor many years ago. Ideally someone would polish it a bit by making the options depend on what's in that folder, but for the moment you can set the default in persistent.defaults.lua.

Incidentally, is there a document available on Archive.org or some such to test with?
Reply 

#3  ichnilatis 12-29-2020, 01:26 PM
Quote Frenzie
I suspect it was written by a Chinese contributor many years ago. Ideally someone would polish it a bit by making the options depend on what's in that folder, but for the moment you can set the default in persistent.defaults.lua.

Incidentally, is there a document available on Archive.org or some such to test with?
What word should be instead of "Chinese"? I mean what change I should make in persistent.defaults.lua?

I upload a page of a scanned book. I noticed that the book I was reading was in djvu format. I converted the page into pdf for you. I believe that the problem exist both for pdf and djvu.
[pdf] p0242.pdf (600.6 KB, 31 views)
Reply 

#4  Frenzie 12-29-2020, 04:35 PM
The text is meaningless really, it's the three letters hidden behind it that count. In your case grc and ell.
https://github.com/koreader/koreader/blob/fbf60f96e4e280e17f008a13d27ee4abc307b96f/defaults.lua#L115-L118
Reply 

#5  Frenzie 12-29-2020, 04:43 PM
It works for me — more or less. The OCR isn't great at spaces in italic.
Screenshot_2020-12-29_21-42-26.png 
Reply 

#6  ichnilatis 12-30-2020, 04:55 AM
So, do I have to make this correction?

-- document languages for OCR
DKOPTREADER_CONFIG_DOC_LANGS_TEXT = {"English", "Ancient Greek"}
DKOPTREADER_CONFIG_DOC_LANGS_CODE = {"eng", "grc"} -- language code, make sure you have corresponding training data
DKOPTREADER_CONFIG_DOC_DEFAULT_LANG_CODE = "eng" -- that have filenames starting with the language codes

From the screenshot you sent I conclude that the breathings (᾿ ῾), the circumflex (῀) and the grave accent (`) are not recognized... and some letters

Can this problem be solved?
Reply 

#7  Frenzie 12-30-2020, 07:02 AM
Quote ichnilatis
So, do I have to make this correction?

-- document languages for OCR
DKOPTREADER_CONFIG_DOC_LANGS_TEXT = {"English", "Ancient Greek"}
DKOPTREADER_CONFIG_DOC_LANGS_CODE = {"eng", "grc"} -- language code, make sure you have corresponding training data
DKOPTREADER_CONFIG_DOC_DEFAULT_LANG_CODE = "eng" -- that have filenames starting with the language codes
Something like that, yes. If you want to keep it, make sure to put it in persistent.defaults.lua.

Quote
From the screenshot you sent I conclude that the breathings (᾿ ῾), the circumflex (῀) and the grave accent (`) are not recognized... and some letters

Can this problem be solved?
It's probably much less of a problem in non-italic text, but unless you have a slightly higher DPI original document not really. A newer version of Tesseract might also do slightly better.
Reply 

#8  ichnilatis 12-30-2020, 07:17 AM
Quote Frenzie
Something like that, yes. If you want to keep it, make sure to put it in persistent.defaults.lua.
Not just in defaults.lua? Where can I find persistent.defaults.lua?

Quote Frenzie
It's probably much less of a problem in non-italic text, but unless you have a slightly higher DPI original document not really. A newer version of Tesseract might also do slightly better.
I use the Version 3.04, as it is recommended here. Can I use a newer version of Tesseract?

Thanks for your replies!
Reply 

#9  Frenzie 12-30-2020, 11:29 AM
Quote ichnilatis
Not just in defaults.lua? Where can I find persistent.defaults.lua?
It's a file you have to create yourself. defaults.lua will be overwritten by updates.



Quote
Can I use a newer version of Tesseract?
Not in KOReader, but an update to Tesseract 4 is coming. I wouldn't count on any noticeable improvements except in some edge cases, but at the same time it's probably not getting any worse either.
Reply 

#10  ichnilatis 12-30-2020, 11:44 AM
Frenzie, I have made the correction in defaults.lua and individual words are recognized correctly. (I try to take a screenshot to show you, but I can't. I've just make a thread with this question...) But, when I choose more than one words and then I choose dictionary at the popup menu, nothing happens.

Also, I notice that when I highlight one or more words, the text isn't shown in the bookmark, as usually, but only the page and the time.

Quote Frenzie
It's a file you have to create yourself. defaults.lua will be overwritten by updates.
You mean I can make the file with Notepad with just the above mentioned text for ocr?

One more question: Why there are only two options for the text language? What should be the second option instead of "Chinese"? Each user has to make the change manually in the defaults.lua?

Thanks again!
Reply 

  Next »  Last »  (1/3)
Today's Posts | Search this Thread | Login | Register