Mobileread
How to unlock this PDF file?
#11  willus 06-10-2020, 11:00 PM
Just as an example of some more involved processing, I've attached a conversion with the command below. I ended up running OCR on it because the placement of the original OCR layer was not very good. I included the marked-up version to show how k2pdfopt is parsing the document.

k2pdfopt -cbox1-4,6-7 1.246in,1.428in,11.62in,14.23in -cbox5 1.608in,1.372in,9.792in,16.11in -as -rt 0 -g .2 -col 2 -cgr .6 -ch 2.5 -jfc- -odpi 110 -dev k2 -ocr t -ocrlang fra parquin.pdf -sm
[pdf] parquin_k2opt.pdf (1.88 MB, 56 views)
[pdf] parquin_marked.pdf (4.86 MB, 68 views)
Reply 

#12  roger64 06-11-2020, 10:43 AM
@willus

Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. .

I shall look for a better viewer on Linux... Sumatra works well with Wine.

As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results.

pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan)

post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives.
Reply 

#13  willus 06-12-2020, 06:23 AM
Quote roger64
@willus

Thanks for your reply. I have still to learn how to use k2pdfopt properly and shall study your example. .

I shall look for a better viewer on Linux... Sumatra works well with Wine.

As far as Tesseract is concerned, I get consistently better ocr results when the file is first processed with scantailor (which does not work with pdf). Tesseract is a small piece of software (about 1/30 the size of Abby Fine Reader) which needs to be complemented with pre and post processing to optimize its results.

pre-processing: I remarked for example that straightening the files, selecting black and white mode and darkening a little with scan tailor improves very often the result (of course it depends on the quality of the scan)

post-processing: many "obvious" mistakes can be corrected for example when only one letter is missing. But Tesseract does not do post-analysis. True, this also opens the door to some false positives.
Just so you know, you can do all of those pre-processing steps directly in k2pdfopt. The -cmax option adjusts contrast, the -as option will auto-straighten / de-skew, the -g option will adjust gamma factor, which can be used to darken the text, and the -bpc option selects bits-per-color. You can set this to 2 for black and white.
Reply 

#14  roger64 06-12-2020, 10:19 AM
That's quite impressive and useful because many "old" pdf need a pre-processing of sort if we expect to get a suitable result with Tesseract.

My study of k2pdfopt will probably be a bit longer, but that really seems to be worth it.

Reply 

#15  willus 06-12-2020, 03:18 PM
Quote roger64
That's quite impressive and useful because many "old" pdf need a pre-processing of sort if we expect to get a suitable result with Tesseract.

My study of k2pdfopt will probably be a bit longer, but that really seems to be worth it.
I released a new version today. I recommend it especially for French OCR.
Reply 

#16  roger64 06-13-2020, 08:10 PM
Thank you for this new version which works nicely.

Note: I just realized that koreader makes also use of k2pdfopt.
Reply 

 « First  « Prev   (2/2)
Today's Posts | Search this Thread | Login | Register