Mobileread
Batch DRM/Password detection
#1  rfog 07-12-2020, 09:38 AM
All!

I have across time a lot of purchased PDF and I want if there exist a way to check if they have DRM or are password protected or copy/print/whatever restriction?

I know I can go PDF by PDF checking it, but when the number is about two thousand...

Does not matter if the way to do it need to be done in macOS, Windows or Linux.

Thanks in advance.

(And no, I'm not asking a way to *remove* DRM, I want to collect my DRM protected PDF).
Reply 

#2  JSWolf 07-12-2020, 12:11 PM
Sorry, there is no way to check for DRM on PDF in batches. You have to do it one by one.
Reply 

#3  j.p.s 07-12-2020, 01:49 PM
Quote JSWolf
Sorry, there is no way to check for DRM on PDF in batches. You have to do it one by one.
Of course there is a way.

In a bash terminal in linux with pdftk installed:
Code
for file in *.pdf do pdftk $file dump_data > /dev/null 2>> encrypted_list.txt
done
The file encrypted_list.txt will contain a list of encrypted files (and any other errors that turn up).
Reply 

#4  Doitsu 07-12-2020, 01:55 PM
It's also relatively easy to check for password-protected files with the PyPDF2 Python library:

1. Install Python 3.x and the PyPDF2 library.
2. Save the following lines as a text file with a *.py extension.
(Make sure to copy it verbatim; in Python, indentations matter. Missing/extra spaces will cause the script to fail.)

Code
#!/usr/bin/env python
import sys, os, glob
from PyPDF2 import PdfFileReader
def main(): current_dir = os.path.dirname(os.path.abspath(__file__)) pdf_files = glob.glob(os.path.join(current_dir, '**', '*.pdf*'), recursive=True) for pdf_file in pdf_files: with open(pdf_file, 'rb') as fh: reader = PdfFileReader(fh) encrypted = False if reader.isEncrypted: encrypted = True if encrypted: os.rename(pdf_file, pdf_file + '.encrypted.pdf')
if __name__ == "__main__": sys.exit(main())
3. Copy the *.py file to a folder with *.pdf files in it and double-click it.

If the script worked, all password-protected files should have an *.encryped.pdf extension. If it doesn't, open a command prompt/terminal window, execute the file and post the error messages.
Reply 

#5  j.p.s 07-12-2020, 02:03 PM
^ If renaming is acceptable, that is an elegant solution.
Reply 

#6  rfog 07-12-2020, 04:33 PM
Wow!

Thanks a lot! I will test all of this tomorrow.
Reply 

#7  j.p.s 07-16-2020, 02:48 PM
I did a bit of looking around and found a couple more ways to do it.

1. qpdf gives a bit cleaner results.
Code
for f in *.pdf; do qpdf --show-encryption $f > /dev/null; done
2. For those like me that find perl easier to read and write than python
Code
#!/usr/bin/perl
use PDF::API2;
while (glob "*.pdf") { $pdf = PDF::API2->open($_); print "$_ is encrypted.\n" if $pdf->isEncrypted();
}
PDF::API2 was not included by default on any of my systems, but neither was PyPDF2 including on a very large anaconda install of python at work.
Reply 

#8  rfog 07-17-2020, 02:45 PM
Wow!!!

I thought it was more complex to do.

Thanks a lot to all.

Now comes the second part: is there any way to check if those PDF with DRM have real text? I've found sometimes that copy and paste for citation dealt with garbage or nonsense texts and I've had to manually type the text.

Any automated way to detect those pdf?
Reply 

#9  willus 07-18-2020, 11:57 AM
You can do something like this for batch text extraction:

k2pdfopt -ocrout %s_text.txt -o dummy.pdf "*.pdf" -mode copy -n -dpi 100

For every file, e.g. myfile.pdf, this will create myfile_text.txt which will have the extracted text layer.
Reply 

#10  rfog 07-19-2020, 04:42 AM
Quote willus
You can do something like this for batch text extraction:

k2pdfopt -ocrout %s_text.txt -o dummy.pdf "*.pdf" -mode copy -n -dpi 100

For every file, e.g. myfile.pdf, this will create myfile_text.txt which will have the extracted text layer.
Ho Ho.

Impressive. Even faster if I add -p 10-20 (for example), to only get the text of some pages and see if they contains text or garbage.



So many tools, and so little time...
Reply 

Today's Posts | Search this Thread | Login | Register