remove OCR from a PDF?
#1  soondai 11-14-2010, 01:05 AM
Is there any tool for removing the OCR element from PDFs?

I have a few scanned books with it, and while it's great for reading on the PC, these files tend to be very large and often cannot be cropped to fit an e-reader.

#2  frabjous 11-14-2010, 10:07 PM
The question is a little difficult to understand.

Do you mean that it's a "Searchable Image" PDF with both a text layer, and an image layer, with the text layer generated by OCRing the image layer, and you want to remove the text layer?

I'm sure there are ways of doing that (e.g., converting the PDF to some other image format and then converting back, etc.), but I really can't see the point. The image layer is almost certainly almost entirely responsible for the large file size. And I don't see how the text layer could interfere with cropping either.

Or did you want to remove the image layer instead? That would only be worth it if the OCR was near perfect, or you were planning on cleaning it up manually, which is a huge time commitment.

#3  soondai 11-14-2010, 11:23 PM
I assume it's the text layer keeping soPDF from working with it

I probably need to just hold off reading my PDF books until I have a better machine for it

#4  frabjous 11-15-2010, 12:35 AM
As far as I know, SoPDF cannot crop scanned margins at all, text layer or no text layer. In general, it's not an ideal tool for scanned PDFs.

I'd try BRISS instead.

#5  soondai 11-15-2010, 01:49 AM
should have known.
I was trying soPDF because I wanted to rotate it as well.

thanks for the tip

#6  frabjous 11-15-2010, 11:38 AM
You could use BRISS to crop it, and then run it through SoPDF afterward to rotate it. Good luck.

#7  NatalieLyda 12-07-2010, 04:51 PM
I don't know if this is exactly what you're looking for, but I often use online OCR conversion software to convert my PDF documents to MSWord or other text style documents. My favorite converter is by Ricoh Innovations'. You can try it, for free, at:

#8  alfred_doeblin 10-08-2011, 05:45 AM

I'd like to accomplish just the opposite to what soondai demands: to get rid of the image and just retain the plain text. Is it that possible with some tool? And if not, does any body know the structural details of the pages in scanned pdfs? I think it would be possible to write a small app using itextpdf.

With kind regards

Alfred D.

#9  DSpider 10-08-2011, 12:24 PM
Should've made a new topic instead of bumping a 2010 thread but whatever, I'll try to answer.

Editing PDFs is never a good idea. Best would be to go back to the original format, make the changes and export as a fresh PDF. Sure, Adobe Acrobat, Foxit Phantom (and similar) can edit PDFs if you wish to get rid of the images. Or you could just copy-paste the text (right click - "Copy Text to Clipboard" or something like that) into a Word/LibreOffice document.

For extracting text from images or protected PDFs you can use ABBYY FineReader 11. It will load the PDF as a bunch of JPG images and OCR it. For best result you'll have to proof read it since it's not 100% accurate. There's also the issue with fonts... You can either match them with something similar or extract them from the PDF with FontForge or something similar.

Regarding the "structural details" of PDFs... There are two types of PDF files: plain PDF and tagged PDF. You'll find that the plain format is used in over 90% of PDFs. This is a really PITA to convert since the content (text, images) are just floating objects on a blank piece of paper. You can usually spot these right away if you highlight the text and they're all separate letters/numbers (or groups of them). Tagged PDFs, on the other hand, use formatting tags - meaning they're usually more accurate to convert because the text is on a single line instead of each individual glyph (or groups of glyphs) with their own "position" (coordinates) on the page.

#10  frabjous 10-08-2011, 12:42 PM
For something free and open source, you could try PDFreflow, which uses the PDFtoHTML from Poppler as a backend. (Poppler also contains a pdftotext tool.)

