Mobileread
Cropping PDFs for EPUB conversion using BRISS, Ghostscript and/or Calibre
#1  fredthefork 08-08-2019, 03:30 AM
Hello! I'm new to this so please forgive me if this is basic knowledge.

I have a PDF file which is OCRed. I would like to convert it to epub. The main problem is that I'd like to crop my pdf so I do not have duplicate Headers or Page Numbers in my epub. I have tried first OSX's Preview, then Briss for that. I then tried to run it through calibre epub conversion. Didn'nt work. I then used ghostscript to extract the text:
Code
gs -sDEVICE=txtwrite -o extractedText%d.txt input.pdf
- but this doesn't work either -still getting all the headers. Although the pdf is clearly cropped, the cropped content did not seem to get deleted permanently.

Then I read on here that

If you run the Briss PDF output through Ghostscript to generate a new PDF, I believe it will permanently get rid of the cropped-out material so that it won't come back in calibre.

This user suggested this command:
Code
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf
. And although it does produce a pdf, running it through my first ghostscript command or through the standard calibre conversion is to no avail: Still get the headers & page numbers. I've also tried using different pdfs, just to be sure.

What am I missing here? This can't be so difficult, - can it?
Reply 

#2  asleyam 08-09-2019, 12:36 AM
Have you tried ScanTailor? It is free and open source. I have a mac so I use ScanTailor via Crossover. Though if you have macports installed then ScanTailor is easy to install. Unfortunately Homebrew does not have a cask for it yet.

http://scantailor.org/

It is designed as a preprocessing tool so it works on batches of scanned images. If you already have a pdf then simply export the pages as images and enter them into ScanTailor. Then use the various settings to crop the headers and page numbers, deskew, set margins etc . It will output in Tif format.

There is no easy one click method that I have found to batch crop out extraneous material from scanned images,
Reply 

#3  dwig 08-09-2019, 01:04 PM
Quote fredthefork
Hello! I'm new to this so please forgive me if this is basic knowledge.

I have a PDF file which is OCRed. I would like to convert it to epub. ...
What am I missing here? This can't be so difficult, - can it?
Yes.

One, "cropping" tools like Briss don't delete anything. They just set a new page size for viewing. The old data is still there; it's just off the page and out of view.

Two, the PDF was OCRd before it was cropped. The headers and similar "junk" is still in the text layer from the OCR process and still "visible" to the format converter so it ends up in the ePub.

You might be more successful if you "crop" the PDF first and then to the OCR. This might prevent the OCR process from "seeing" the parts that were trimmed.
Reply 

Today's Posts | Search this Thread | Login | Register