Mobileread
Splice PDF: A Script to improve readability by separating images from text
#1  MarjaE 09-03-2020, 06:34 PM
I've written a script to help with my pdf issues. Written for the bash shell in the MacOS Automator so it may require tweaks for other software.

The idea is to split each pdf in 3 parts and then splice them back together-- the cover, which I've rasterized, the images from each page, again rasterized, and the text from each page, blackened and inserted after the images. This makes it easier for me to read the text, and makes it easier for the Kindle to handle the images regardless how they've been constructed. It breaks tables of contents.

P.S. This does not work with scanned pdfs. I'd suggest using k2pdfopt -mode copy for that.

I've also written a varient with -dev dx after each k2pdfopt -mode copy, and with different output file names, for a grayscale output optimized for the Kindle Dx.

By default K2 increases contrast, so if you prefer not to, that's another tweak.

It requires Ghostscript, Cpdf, K2pdfopt, and Qpdf. Cpdf should be free for non-commercial use, but I'd still prefer an open source alternative to it, and it's no longer available via Homebrew.

I've installed k2pdfopt to ~/Applications and I've installed the others using Homebrew.

Each app seems to have slightly inconsistent standards for standard output and standard input. In the end, I instructed each one to export a set filename to a "Splice" folder, or import a set filename from there. I've been able to run the whole sequence that way, first splitting, then processing, and then splicing the pdf back together.

I haven't replaced all the older code where it used ` instead of (), maybe eventually.

for f in "$@"
do
# Copy and Rasterize 1st page from source pdf using k2pdfopt
~/Applications/k2pdfopt -ui -mode copy -p 1 -x -o "/Users/Marja/Splice/RGBCover_copy.pdf" "$f" $@
# Copy text from source pdf file using Ghostscript, turn text black using Cpdf
# The color conversion strategy should help with the 2nd stage if I switch to Ghostscript
# - and -_ indicate standard output and input
# Due to compatibility issues, dumping to ~/Splice/Text.pdf
/usr/local/bin/gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -sColorConversionStrategy=RGB -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Text.pdf" "$f" &&
/usr/local/bin/cpdf "/Users/Marja/Splice/Text.pdf" -blacktext -o "/Users/Marja/Splice/Blacktext.pdf"
# Copy images from same source pdf file using Ghostscript, rasterize images using K2pdfopt
# Due to compatibility issues, dumping to ~/Splice/Images.pdf
/usr/local/bin/gs -sDEVICE=pdfimage24 -dFILTERTEXT -dCompatibilityLevel=1.4\
-g800x1080 -r150 -dPDFFitPage \
-sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Images.pdf" "$f" &&
~/Applications/k2pdfopt -ui -mode copy -x -o "/Users/Marja/Splice/RGBImages_copy.pdf" "/Users/Marja/Splice/Images.pdf" $@ &&
# Splice files using qpdf
suffix="-SplicedColor.pdf"
base=`basename "$f" .pdf`
outputfile=$base$suffix
/usr/local/bin/qpdf --collate "/Users/Marja/Splice/RGBCover_copy.pdf" --pages "/Users/Marja/Splice/RGBCover_copy.pdf" "/Users/Marja/Splice/RGBImages_copy.pdf" "/Users/Marja/Splice/Blacktext.pdf" -- "$outputfile"
done
Reply 

#2  MarjaE 09-03-2020, 06:37 PM
If anyone with more programming experience wants to rework this, feel free. A platform-independent and cpdf-independent version would be useful.
Reply 

#3  j.p.s 09-03-2020, 07:24 PM
Quote MarjaE
If anyone with more programming experience wants to rework this, feel free. A platform-independent and cpdf-independent version would be useful.
As an initial reaction, I suggest adding the following to the top of the script:
Code
export K2PDFOPT_HOME=~/Applications
export OUTDIR=/Users/Marja/Splice
and replace all instances of "~/Applications" with "$K2PDFOPT_HOME"
and of /Users/Marja/Splice with "$OUTDIR"
(or any names you prefer). That way others (and you) only have to edit a couple of lines at the top to make a change in location of k2pdfopt and output directory.
Reply 

#4  MarjaE 09-03-2020, 07:38 PM
Thank you!

P.S. I'm having some trouble with the broken tables of contents and with broken scaling.

A quick test shows that k2pdfopt -mode copy -n -toc- can cut the table of contents, but not correct the scaling. A Quartz filter can cut and correct, but it's platform-specific and doubles up the text. MuTool Clean can't cut or correct these. Printing to a new pdf should have much the same effect as running through a Quartz filter.

P.P.S. Also running Mutool clean -d -s -z at the end of the process scrambles some text by writing one line over another. But -g -g -g doesn't seem to cause trouble. Known bug with -s: https://bugs.ghostscript.com/show_bug.cgi?id=702715

P.P.P.S. Removing text from the image pages is hit-and-miss. I suspect k2 is starting before gs has finished. So I am looking at restructuring the script to (a) run a Quartz filter at the beginning, even if it's Mac-specific, (b) then run the Ghostscript stages, (c) then cpdf and k2, and (d) finally run qpdf.
Reply 

#5  MarjaE 09-05-2020, 12:57 AM
A Mac-specific implementation, optimizing for the Kindle Dx. It works in Mojave. I'm not sure if it will work in Catalina due to Apple's ongoing cuts to Automator:

1. Install BenWiggy's PDFsuite, pypy, pyobjc for python 2, ghostscript, k2pdfopt, cpdf, and qpdf.

2. Open Automator and create a new App.

3. Run Shell script, 7 times, using Bash, and passing input as arguments. By splitting this into 7 shells scripts, we can help make sure the Mac finishes each step before starting the next. You'll need to substitute your preferred location for your K2pdfopt app, for some other apps, and for your Splice folder. I don't think the export code above will be suitable with so many short scripts.

for f in "$@"
do
# Strip any table of contents and fit text to page sizes to avoid ay scaling issues
/usr/local/bin/python /Users/Marja/Library/Services/quartzfilter.py "$f" "/System/Library/Filters/Lightness Increase.qfilter "/Users/Marja/Splice/Light.pdf"
done

for f in "$@"
do
# Copy and Rasterize 1st page from source pdf using k2pdfopt
~/Applications/k2pdfopt -ui -mode copy -dev dx -p 1 -x -o "/Users/Marja/Splice/DxCover_dx.pdf" "/Users/Marja/Splice/Light.pdf" $@
done

for f in "$@"
do
# Copy images from same source pdf file using Ghostscript, rasterize images using K2pdfopt
# - and -_ indicate standard output and input
# Due to compatibility issues, dumping to ~/Splice/Images.pdf
/usr/local/bin/gs -sDEVICE=pdfimage24 -dFILTERTEXT -dCompatibilityLevel=1.4\
-g800x1080 -r150 -dPDFFitPage \
-sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Images.pdf" "/Users/Marja/Splice/Light.pdf"
done

for f in "$@"
do
# Copy text from source pdf file using Ghostscript, turn text black using Cpdf
# The color conversion strategy should help with the 2nd stage if I switch to Ghostscript
# Due to compatibility issues, dumping to ~/Splice/Text.pdf
/usr/local/bin/gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -sColorConversionStrategy=RGB -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Text.pdf" "/Users/Marja/Splice/Light.pdf"
done

for f in "$@"
do
# Copy images from same source pdf file using Ghostscript, rasterize images using K2pdfopt
# - and -_ indicate standard output and input
# Due to compatibility issues, dumping to ~/Splice/Images.pdf
~/Applications/k2pdfopt -ui -mode copy -dev dx -x -o "/Users/Marja/Splice/DxImages_dx.pdf" "/Users/Marja/Splice/Images.pdf" $@
done

for f in "$@"
do
# Copy text from source pdf file using Ghostscript, turn text black using Cpdf
# The color conversion strategy should help with the 2nd stage if I switch to Ghostscript
# Due to compatibility issues, dumping to ~/Splice/Text.pdf
/usr/local/bin/cpdf "/Users/Marja/Splice/Text.pdf" -blacktext -o "/Users/Marja/Splice/Blacktext.pdf"
done

for f in "$@"
do
# Splice files using qpdf and date so new runs won't overwrite old ones
/usr/local/bin/qpdf --collate "/Users/Marja/Splice/DxCover_dx.pdf" --pages "/Users/Marja/Splice/DxCover_dx.pdf" "/Users/Marja/Splice/DxImages_dx.pdf" "/Users/Marja/Splice/Blacktext.pdf" -- /Users/Marja/Splice/"SplicedDx$(date "+%Y.%m.%d-%H.%M.%S").pdf"
done

The 3rd shell script can take a long while.

I've experimented with the PDFSuite 150 and 300 dpi filters, but depending on the source pdfs these often crash due to memory pressure. Even this version will occasionally crash.

I've not been able to keep the original filename as an element in the final one.
Reply 

#6  MarjaE 09-10-2020, 12:05 AM
P.S. Using a single bash shell, with wait in a separate line between every 2 other commands, works better than multiple shells.
Reply 

#7  MarjaE 10-03-2020, 02:10 AM
Here's an updated Mac implementation. It differs from my 1st draft in 3 respects:

1. It adds a Quartz step at the beginning, to reduce the risk of losing text information, and of sizing issues.

2. It adds the wait steps.

3. It adds another k2pdfopt step at the end, to remove the now-useless tables of contents.

I don't have the programming knowledge for j/p.s.'s suggestions.

Requires Automator with a shell script, using bash, and passing input as arguments (Mac-specific), Quartz (Mac-specific, but other apps may accomplish the same goals in Linux and Windows), Python 3, a couple scripts from Benwiggy's PDFSuite edited to work with Python 3, ghostscript, Willus's k2pdfopt, cpdf, and qpdf.

for f in "$@"
do
# Strip any table of contents and fit text to page sizes to avoid any scaling issues
/usr/local/bin/python3 /Users/Marja/Library/Services/quartzfilter3.py "$f" "/Users/Marja/Library/Filters/Generic RGB.qfilter" "/Users/Marja/Splice/GRGB.pdf"
wait
# Copy and Rasterize 1st page from source pdf using k2pdfopt
~/Applications/k2pdfopt -ui -mode copy -p 1 -x -o "/Users/Marja/Splice/Cover_rgb.pdf" "/Users/Marja/Splice/GRGB.pdf" $@
wait
# Copy images from same source pdf file using Ghostscript, rasterize images using K2pdfopt
# Due to compatibility issues, dumping to ~/Splice/Images.pdf
/usr/local/bin/gs -sDEVICE=pdfimage24 -dFILTERTEXT -dCompatibilityLevel=1.4\
-g800x1080 -r150 -dPDFFitPage \
-sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Images.pdf" "/Users/Marja/Splice/GRGB.pdf"
wait
~/Applications/k2pdfopt -ui -mode copy -x -o "/Users/Marja/Splice/Images_rgb.pdf" "/Users/Marja/Splice/Images.pdf" $@
wait
# Copy text from source pdf file using Ghostscript, turn text black using Cpdf
# The color conversion strategy should help with the 2nd stage if I switch to Ghostscript
# - and -_ indicate standard output and input
# Due to compatibility issues, dumping to ~/Splice/Text.pdf
/usr/local/bin/gs -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERVECTOR -dCompatibilityLevel=1.4 -sColorConversionStrategy=RGB -sstdout=%sstderr -dNOPAUSE -dQUIET -dBATCH -sOutputFile="/Users/Marja/Splice/Text.pdf" "/Users/Marja/Splice/GRGB.pdf"
wait
/usr/local/bin/cpdf "/Users/Marja/Splice/Text.pdf" -blacktext -o "/Users/Marja/Splice/Blacktext.pdf"
wait
# Splice files using qpdf
/usr/local/bin/qpdf --collate "/Users/Marja/Splice/Cover_rgb.pdf" --pages "/Users/Marja/Splice/Cover_rgb.pdf" "/Users/Marja/Splice/Images_rgb.pdf" "/Users/Marja/Splice/Blacktext.pdf" -- /Users/Marja/Splice/SplicedRGB.pdf
wait
# Remove any table of contents, since it won't fit the spliced pdf
suffix="-SplicedRgbG.pdf"
base=`basename "$f" .pdf`
outputfile=$base$suffix
~/Applications/k2pdfopt -ui -mode copy -n -toc- -o /Users/Marja/Splice/"$outputfile" /Users/Marja/Splice/SplicedRGB.pdf $@
done
Reply 

#8  MarjaE 11-07-2020, 04:01 PM
It helps to drop $@ where I've included it above.
Reply 

Today's Posts | Search this Thread | Login | Register