Mobileread
How to unlock this PDF file?
#1  Shohreh 05-28-2020, 08:46 AM
Hello,

This PDF file is built to prevent users from selecting and copying text. I can zoom in/out so it's not an image, but is a real (vector) PDF. All it allows is "Select All".

I tried "qpdf.exe --decrypt", to no avail. I don't know if cpdf or mutool can help.

Why is that? Is there a way to remove this restriction?

Thank you.
Reply 

#2  Shohreh 05-28-2020, 09:53 AM
Turns out the PDF seems to contain images of high-definition text, which explains why it still looks OK even when zooming in.

So the only solution would be to run it through an OCR… which is too much work to get a clean layout.
Reply 

#3  willus 05-29-2020, 07:45 PM
Quote Shohreh
Turns out the PDF seems to contain images of high-definition text, which explains why it still looks OK even when zooming in.

So the only solution would be to run it through an OCR… which is too much work to get a clean layout.
k2pdfopt -mode copy -odpi 200 -ocr t -ocrlang fra -ocrd p protected.pdf

Result attached.
[pdf] protected_k2opt.pdf (3.51 MB, 91 views)
Reply 

#4  Shohreh 06-05-2020, 12:51 AM
Thanks very much!

https://willus.com/k2pdfopt/help/options.shtml

-mode copy: source pages are simply copied to the output file, but rendered as bitmaps. No trimming or re-sizing is done.

-odpi 200: Set pixels per inch of output screen.

-ocr t: Attempt to use optical character recognition (OCR) in order to embed searchable text into the output PDF document. If followed by t or g, specifies the ocr engine to use (tesseract or gocr).

-ocrlang <set language>: Select the Tesseract OCR Engine language. […] The default language is whatever is in your Tesseract trained data folder. […] Use -ocrlang ? to see the list of Tesseract language files in your Tesseract data folder.

-ocrd p: Set OCR detection type for k2pdfopt and Tesseract. […] For -ocrd p, k2pdfopt passes the entire output page of text to Tesseract and lets Tesseract parse it for word positions.
Reply 

#5  roger64 06-06-2020, 11:56 PM
Thanks for this interesting tip.

I am an Archlinux user.

I have been using Tesseract extensively for over one year. Usually, when I have to deal with a PDF, I make a batch convert to png using Imagemagick, then scantailor, before performing the OCR.

I installed k2pdfopt from AUR by compiling it. However something was missing because when I tried, I've got this message:

Code
[...]
k2pdfopt v2.51 (w/DjVuLibre) (c) 2020, GPLv3, http://willus.com Compiled Jun 7 2020 with Gnu C v10.1.0 for Linux on x64.
** No OCR capability in this compile of k2pdfopt! **
I have seen here in the comments, that this package has some trouble on this regard (ocr). Using a Windows version would be an overkill - for me. So, I regrettably give up this try for now.
Reply 

#6  willus 06-07-2020, 10:13 AM
Quote roger64
Thanks for this interesting tip.

I am an Archlinux user.

I have been using Tesseract extensively for over one year. Usually, when I have to deal with a PDF, I make a batch convert to png using Imagemagick, then scantailor, before performing the OCR.

I installed k2pdfopt from AUR by compiling it. However something was missing because when I tried, I've got this message:

Code
[...]
k2pdfopt v2.51 (w/DjVuLibre) (c) 2020, GPLv3, http://willus.com Compiled Jun 7 2020 with Gnu C v10.1.0 for Linux on x64.
** No OCR capability in this compile of k2pdfopt! **
I have seen here in the comments, that this package has some trouble on this regard (ocr). Using a Windows version would be an overkill - for me. So, I regrettably give up this try for now.
Do the linux binaries not work on your Linux distro?
Reply 

#7  roger64 06-07-2020, 10:37 AM
Hi

I'll try it. it probably will if I manage to download it (two fails) I see that this version is from the 5th of January 2019.

I have a more recent and improved version of Tesseract installed on my computer (neural engine). Will k2pdfopt make use of it?

Code
[roger@lenovo ~]$ tesseract -v
tesseract 4.1.1 leptonica-1.79.0 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.4) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
[roger@lenovo ~]$
EDIT: I downloaded the x64 binary, I can launch k2pdfopt, let it record the options ( in green), but I fail to point it the "protected.pdf" folder. I"ll check again tomorrow.
Reply 

#8  willus 06-07-2020, 03:52 PM
Quote roger64
Hi

I'll try it. it probably will if I manage to download it (two fails) I see that this version is from the 5th of January 2019.

I have a more recent and improved version of Tesseract installed on my computer (neural engine). Will k2pdfopt make use of it?

Code
[roger@lenovo ~]$ tesseract -v
tesseract 4.1.1 leptonica-1.79.0 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.0.4) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
[roger@lenovo ~]$
EDIT: I downloaded the x64 binary, I can launch k2pdfopt, let it record the options ( in green), but I fail to point it the "protected.pdf" folder. I"ll check again tomorrow.
k2pdfopt has the tesseract engine compiled in, so it will use what it was compiled with, e.g. v4.0.0 for the latest version. The only support files it needs are the tesseract language training files.
Reply 

#9  roger64 06-07-2020, 11:37 PM
@willus

Thanks for your explanations and patience...

So I set up TESSDATA_PREFIX in /etc/environment and resumed testing. I thought I had succeeded, but...

Please, look at the joint files: have you any idea about what went wrong? In the file "exemple", you'll find a copy of the terminal commands I used to process Parquin.pdf.

I can search the text from the _k2opt file, but does not know how to select or extract text. Is this normal?
[pdf] Parquin.pdf (1.30 MB, 73 views)
[pdf] Parquin_k2opt.pdf (13.78 MB, 62 views)
[pdf] exemple.pdf (37.4 KB, 59 views)
Reply 

#10  willus 06-10-2020, 10:31 PM
Quote roger64
@willus

Thanks for your explanations and patience...

So I set up TESSDATA_PREFIX in /etc/environment and resumed testing. I thought I had succeeded, but...

Please, look at the joint files: have you any idea about what went wrong? In the file "exemple", you'll find a copy of the terminal commands I used to process Parquin.pdf.

I can search the text from the _k2opt file, but does not know how to select or extract text. Is this normal?
You ran OCR correctly with Tesseract, but: a couple things--first off, you don't need to do OCR. The original document already has selectable text. Second, both documents you attached allow me to select the text with my PDF viewer--Sumatra PDF running on Windows 10.

Note that there's a bug in k2pdfopt for how it does the selection sizes of the French accented "a". This will be resolved in the next release, which I hope to get out reasonably soon.
Reply 

  Next »  Last »  (1/2)
Today's Posts | Search this Thread | Login | Register