Mobileread
extra spaces in Kindle (e.g. Ganzhe i t swor t e) but not DC (e.g. Ganzheitsworte)
#1  icefusion 09-20-2018, 04:07 AM
I scanned a German document at 600dpi. Then I used Briss to split each scanned page into two PDF pages. Then I ran Acrobat DC's OCR for 600dpi output. It worked, as can be verified by copying and pasting the text.

When I send the PDF to Kindle, however, virtually every word has spaces within it. What in DC, e.g., was properly "Ganzheitsworte," when selected within Kindle is "Ganzhe i t swor t e". This renders Kindle's integrated dictionary useless. Ideas?
Reply 

#2  pdurrant 09-20-2018, 04:32 AM
Use the text from Acrobat DC's OCR to create a kindle book instead. You shouldn't expect the same results from two different OCR systems.
Reply 

#3  willus 09-21-2018, 12:25 AM
I don't understand. Did you somehow use Adobe to create a new PDF with an OCR layer in it, and send that PDF to the kindle? Or did you send the scanned pdf (after cropping with Briss) to the kindle without having performed any OCR beforehand? I don't know enough about Adobe DC to know if it will create a PDF with an OCR layer.
Reply 

#4  icefusion 09-21-2018, 01:40 AM
willus: I scanned the book as a PDF, ran it through Briss, then used Acrobat DC to add an OCR layer.

pdurrant: Exporting the text from the PDF is not an option. The document has too many quotes in foreign languages, including Greek, using the Greek alphabet. Also, the OCR made quite a few mistakes on the footnotes. I don't think that Kindle runs its own OCR but rather processes the OCR layer in the PDF, adding spaces.
Reply 

#5  pdurrant 09-21-2018, 03:03 AM
Quote icefusion
I don't think that Kindle runs its own OCR but rather processes the OCR layer in the PDF, adding spaces.
That sounds unlikely. What happens if you send the original PDF?
Reply 

#6  icefusion 09-24-2018, 04:40 AM
Whenever I've sent non-OCR'ed PDFs to my Kindle they lack a text layer. The same goes for this document when I use a version without the text layer.
Reply 

#7  pdurrant 09-24-2018, 06:07 AM
Oh, how interesting. Could it be that the spaces are there in the text layer already?

What happens if you try to convert the PDF with text layer in calibre?
Reply 

#8  icefusion 09-24-2018, 10:26 AM
When I copy text within Acrobat the spaces are absent.
I just used Calibre to export to TXT and RTF. The former only produces the document outline (but none of the document proper), which lacks the extra spaces. The latter produces the image layer, not the text.
I have posted my quandary on the Kindle forum (https://www.mobileread.com/forums/sh...d.php?t=310958), hoping that someone over there has had the same issue.
Reply 

#9  willus 10-19-2018, 04:15 PM
Typically double-posting is frowned upon at MR, though they definitely need a way to cross-post questions like this to multiple forums. I downloaded the PDF sample you posted in the other thread and looked at it. There are definitely no spaces in the OCR layer (see excerpt from decompressed PDF stream below), so it's a mystery as to why they are put in by Amazon's conversion.


Code
...
0.05 Tc 9.4807 0 0 9.1 63.27 418.57 Tm
(der )Tj
9.2469 0 0 9.1 79.35 418.57 Tm(Ganzheitsworte )Tj9.65 0 0 9.1 146.38 418.57 Tm
(mag )Tj
/Suspect <</Conf 0 >>BDC
9.1849 0 0 9.1 167.15 418.57 Tm
(salom )Tj
...
Reply 

Today's Posts | Search this Thread | Login | Register