Mobileread
Converting pdf to png images
#1  roger64 09-04-2019, 04:54 AM
Hi

In order to pre-process image files with scantailor, I may have to convert some source PDF to png files.

There are some online services that do this, I prefer doing it using imagemagick.

Second try on a 14 pages pdf extract from a bigger book gave this:

Code
convert garnier.pdf garnier.png
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `garnier.png' @ warning/png.c/MagickPNGWarningHandler/1748.
[roger@lenovo roger]$
It converted nearly instantly all the pages which is pretty good but I am not sure to understand the information above. Has somebody some knowledge about it?

Even adding parameters like -quality 100, or -density 300, one such image has a 27k only size, while the same image processed with, say pdfcandy online service at medium resolution has a 55k size (see screenshot). Does this difference may hinder the ocr process later?

The second image (001) comes from pdfcandy
garnier-0.png garnier_p001.png 
Reply 

#2  Tex2002ans 09-04-2019, 09:28 PM
Quote roger64
In order to pre-process image files with scantailor, I may have to convert some source PDF to png files.

There are some online services that do this, I prefer doing it using imagemagick.
Good choice.

Quote roger64
Code
convert garnier.pdf garnier.png
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `garnier.png' @ warning/png.c/MagickPNGWarningHandler/1748.
That warning can probably be completely ignored.

From what I could tell, what's happening is that ICC (color) metadata from the PDF is being embedded in the PNG... (see technical note below).

If you want the warning to go away, and don't care about the metadata, just add a -strip:

Code
convert -strip garnier.pdf garnier.png
You could continue to add whatever other adjustments you want:

Code
convert -density 300 -strip garnier.pdf garnier.png
You could also remove the transparency and make the background white:

Code
convert -density 300 -strip garnier.pdf -background white -alpha off garnier.png
or even use the mogrify command instead:

Code
mogrify -format png -density 300 -strip -background white -alpha off garnier.pdf
Side Note: For more info on mogrify and batch processing, see the ol' IMv6 Basic Usage (mogrify).

Quote roger64
It converted nearly instantly all the pages which is pretty good but I am not sure to understand the information above. Has somebody some knowledge about it?
Technical Note: I tested a PDF on my end, and got a similar "RGB color space not permitted" error. When I used:

Code
identify -verbose output.png
on it and compared the stripped/unstripped PNGs, this was the chunks of metadata that -strip removed:

Spoiler Warning below






Code
 Resolution: 300x300 Print size: 8.5x11 [...] icc:copyright: Copyright Artifex Software 2011 icc:description: Artifex Software sRGB ICC Profile pdf:Version: PDF-1.5	[...] png:bKGD: chunk was found (see Background color, above) png:pHYs: x_res=300, y_res=300, units=0 png:text: 4 tEXt/zTXt/iTXt chunks were found png:text-encoded profiles: 1 were found png:tIME: 2019-09-04T23:45:09Z	[...] Profiles: Profile-icc: 2576 bytes	[...]


I assume the few icc lines were what ImageMagick was warning about.

The PNG itself says it's grayscale, but the embedded ICC metadata within the PNG was trying to say it was some sort of sRGB.

Probably carryovers from the PDF metadata when the original person generated/scanned those in.

Quote roger64
Even adding parameters like -quality 100, or -density 300, one such image has a 27k only size, while the same image processed with, say pdfcandy online service at medium resolution has a 55k size (see screenshot). Does this difference may hinder the ocr process later?
... who knows what kinds of commands they run on that online service. With ImageMagick, you control the entire workflow.

And every PDF is going to be different, so you may need to do different kinds of tweaks for different things (DPI, speckling cleanup, etc.).

ImageMagick Note: PNG is lossless... so -quality on PNG only changes how much compression it's running on the file.

JPG is lossy, so -quality is a sliding scale from 1-100 on how hideous you want the images to be.

ImageMagick's page on -quality for more info.
Reply 

#3  roger64 09-05-2019, 03:02 AM
@Tex2002ans

Thank you so much for your comments which comfort me using Imagemagick and png format for the task at hand. I shall trust Imagemagick outputs when using basic parameters above (quality, density) and leave aside all others options that could possibly lower down the image quality (as cleanup).

After all, the only goal of this stage is to get a png image that can be later processed with scantailor.

I did this kind of conversion with two different pdf scans from Gallica. The size of the output .png files varied widely from 26k (1st book) to a whopping 1.7mb (2d book)! As this difference can also be noticed using online conversion services, it can only be explained by the nature of the pdf.

Happily, this oversized files are only of temporary use because later the scantailor process outputs to standardized and much lower size .tif images.
Reply 

#4  Tex2002ans 09-05-2019, 03:26 AM
Quote roger64
After all, the only goal of this stage is to get a png image that can be later processed with scantailor.
And have you been using Scan Tailor Advanced?

https://github.com/4lex4/scantailor-advanced/releases

It includes all the enhancements from all the different Scan Tailor forks over the years:

https://github.com/4lex4/scantailor-advanced#description

Quote roger64
I did this kind of conversion with two different pdf scans from Gallica. The size of the output .png files varied widely from 26k (1st book) to a whopping 1.7mb (2d book)! As this difference can also be noticed using online conversion services, it can only be explained by the nature of the pdf.
Yeah, it'll be completely different depending on the PDFs: What DPI they were originally scanned at, vector/bitmapped, color, markings, etc.

Like one of the books I was working on (problem still not solved) had vertical lines slashed right through the middle (along with an incredibly low resolution scan).

Quote roger64
Happily, this oversized files are only of temporary use because later the scantailor process outputs to standardized and much lower size .tif images.
You could also output as PDF->TIFF straight from ImageMagick, but the workflow you're using seems fine. I also prefer outputting to PNGs.
Reply 

#5  roger64 09-05-2019, 02:08 PM
Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable... I am quite happy with the "experimental" version from the Arch repository.

After it, I get quite good results with Tesseract OCR.

Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bpt6k662811.texteImage
Reply 

#6  Tex2002ans 09-05-2019, 10:56 PM
Quote roger64
Yes I tried Scan Tailor advanced (but it's last year version now...) and I was disappointed: too complex for me, unstable.
Looks exactly the same as Scan Tailor to me, just has a few optional tabs/buttons in some steps.

But maybe the Linux version is less stable. The Windows version for me has been solid as a rock (and much faster than all the previous ones, since it's heavily multi-threaded).

Quote roger64
Some PDF though are beyond repair: (but it's exception) https://gallica.bnf.fr/ark:/12148/bpt6k662811.texteImage
Yuck, looks about as bad as mine... but yours can be solved!

I was able to follow most of the steps here:

Removing noise from scanned text document

Step 1

Get the PDF into PNGs:

Code
convert -density 300 input.pdf output.png
show attachment »

Since that PDF is awful, and has enormous whitespace around it, I would suggest trimming:

Code
convert -density 300 input.pdf -trim output.png
that would focus more on the text itself.

show attachment »

Alternate #1: You could also use the magick.exe command:

Code
magick.exe -density 300 input.pdf -trim .\output-%d.png
ImageMagick ran out of memory on my end, so if you want to convert the PDF in pieces, you can adjust the [0-30] to fit whatever page numbers you want to export:

Code
magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png
Side Note: Not sure why it's only cropping vertically, there's probably another method to crop the left/right whitespace too. It would probably speed up the later steps too.

Step 2

Now, I followed much of that forum post above.

Code
convert output.png -connected-components 4 -threshold 0 -negate output-negate.png
Step 3

It seems like area-threshold looks for "chunks of pixels that are X pixels or less".

I tested with area-threshold=30:

Code
convert output.png -define connected-components:area-threshold=30 -connected-components 4 -threshold 0 -negate output-cc30.png
but I found that this PDF needed more. So I adjusted by 10s all the way up to 80:

Code
convert output.png -define connected-components:area-threshold=80 -connected-components 4 -threshold 0 -negate output-cc80.png
show attachment »

Step 4

Then I was able to take image from Step 1 + Step 3 and create a diff:

Code
convert output.png output-cc80.png -compose minus -composite output-diff.png
show attachment »

Step 5

Then use the images from Step 1 + Step 4 to remove:

Code
convert output.png ( -clone 0 -fill white -colorize 100% ) output-diff.png -compose over -composite output-diff-composite.png
Here's the Original (Step 1) + Diff (Step 3) + Cleaned (Step 5):

show attachment » show attachment » show attachment »

Finalized

Here's a few more before/after pages out of the book:

show attachment » show attachment »
show attachment » show attachment »

I attached a ZIP with Windows .bat files that batch convert the images using these steps. It's a giant mess, and it does create a lot of blank/duplicate images, but it chugs through everything eventually. I already spent hours writing this tutorial up, and don't feel like debugging the rest.

But hopefully that'll get you much cleaner input into Scan Tailor + better OCR.
[zip] Tex.BAT.ImageMagick.Cleanup.BadSpeckleLines.zip (1.3 KB, 17 views)
Reply 

#7  Tex2002ans 09-06-2019, 02:02 AM
Quote Tex2002ans
Code
magick.exe -density 300 input.pdf[0-30] -trim .\output-%d.png
Side Note: Not sure why it's only cropping vertically, there's probably another method to crop the left/right whitespace too. It would probably speed up the later steps too.
Actually, I just figured it out. For this specific set of images, if you add another -trim, it cuts the left/right as well:

Code
magick.exe -density 300 input.pdf[0-30] -trim -trim .\output-%d.png
And with the double-trimmed images, about 30 pages turned completely black in Step 2, so to get around that, I added a white border. This Step 1 is much better:

Code
magick.exe -density 300 input.pdf[0-30] -trim -bordercolor white -border 40x40 .\output-%d.png
Reply 

#8  roger64 09-06-2019, 10:49 AM


Congratulations! That's an impressive demo to insert into an imagemagick manual. I had failed previously over a ten pages extract of this say ...curious pdf.

I'll have to change my words. It can be done. I'm still not keen to convert the whole book (my computer has a 8 G RAM).
Reply 

#9  Tex2002ans 09-06-2019, 06:58 PM
Quote roger64
Congratulations! That's an impressive demo to insert into an imagemagick manual. I had failed previously over a ten pages extract of this say ...curious pdf.
I'll PM you with my:

Still hideous OCR... but better than what's currently there.

It's just a bad and low quality scan in the first place...

Quote roger64
I'll have to change my words. It can be done. I'm still not keen to convert the whole book (my computer has a 8 G RAM)
No wonder Scan Tailor crashes on you, some of this image manipulation takes up tons of GBs of RAM. :P
Reply 

Today's Posts | Search this Thread | Login | Register