Mobileread
ghostscript ccitt with pdfwrite
#1  icq70610 01-02-2023, 01:12 PM
Hi all

another -- kind of specific question ...

im currently (ok 3 - 4 years) digitizing my whole library from childhood and i have become kind of obsessed with scanning procedures.

my current workflow is like this

Scan --> scantailor --> lots of clicking --> tiff --> mogrify stuff --> ps --> pdf

the results are really good and normally Im quite fond of them. filesize quality 300x300 i ould give them a 8-9 of 10. My "trouble" starts with books that are from like minded people that doing similar efforts and a example doc may look like this.

-----
Creator: PDF-XChange Editor 5.5.xxx
Producer: PDF-XChange PDF Core API (5.5.xxx)
CreationDate: xxx xxx
ModDate: xxx xxx
Custom Metadata: no
Metadata Stream: yes
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 164
Encrypted: no
Page size: 372 x 559.68 pts
Page rot: 0
File size: 8100115 bytes
Optimized: no
PDF version: 1.2
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1550 2332 rgb 3 8 jpeg no 170 0 300 300 371K 3.5%
2 1 image 1550 2332 index 1 1 ccitt no 172 0 300 300 17B 0.0%
3 2 image 1550 2332 index 1 1 ccitt no 174 0 300 300 9123B 2.0%
4 3 image 1550 2332 index 1 1 ccitt no 176 0 300 300 7332B 1.6%
5 4 image 1550 2332 index 1 1 ccitt no 178 0 300 300 36.8K 8.3%
6 5 image 1550 2332 index 1 1 ccitt no 180 0 300 300 42.6K 9.7%

----

as you can see 371K for the cover 300x300 dpi and around 42K for a "traditional" grey image. Cover is a jpg rbg and the rest is ccitt (tiff4) encoded.

Anybody has an Idea to instruct ghostscript commandline to achieve similar encodings?

when i encode them it looks like this

page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1550 2332 icc 3 8 jpeg no 9 0 300 300 389K 3.7%
2 1 image 1550 2332 index 1 1 image no 16 0 300 300 461B 0.1%
3 2 image 1550 2332 index 1 1 image no 22 0 300 300 19.7K 4.5%
4 3 image 1550 2332 index 1 1 image no 28 0 300 300 14.4K 3.3%
5 4 image 1550 2332 index 1 1 image no 34 0 300 300 64.4K 15%
6 5 image 1550 2332 index 1 1 image no 40 0 300 300 74.3K 17%
7 6 image 1550 2332 index 1 1 image no 46 0 300 300 76.3K 17%
8 7 image 1550 2332 index 1 1 image no 52 0 300 300 76.5K 17%
9 8 image 1550 2332 index 1 1 image no 58 0 300 300 75.7K 17%


as you can see on ghostscript im not reaching ccitt encoding? anyone knows the correct parameter for gs ... even a single page to encode in ccitt with pdfwrite as device?

\Pete
Reply 

#2  rkomar 01-02-2023, 06:06 PM
TIFF format is more about the wrapping than the encoding. You can compress using many different methods within a TIFF wrapper. I would suggest that you produce your black-and-white images as CCIT4 encoded TIFF files before calling gs to create the PDF file (i.e. use the option "-compress Group4" when running mogrify/convert). I'm not familiar with scantailor, but maybe it offers that option out of the box.

When I was first scanning my books, I would use convert to produce the TIFF files with CCIT4 compression. Then I would use tiffcp to combine the separate TIFF files into a single multi-page TIFF file. I would then use either tumble or tiff2pdf to convert the multi-page TIFF file into a PDF file. Then I would use gs as the last step to add PDFMARKS to the PDF file.

Nowadays I used pdfbeads, but that has become more complicated than my old way because the program is no longer maintained and is very difficult to get working on a modern system. I use my old copy of pdfbeads within an old linux distro running inside VirtualBox.
Reply 

#3  icq70610 01-03-2023, 02:00 PM
Quote rkomar
TIFF format is more about the wrapping than the encoding. You can compress using many different methods within a TIFF wrapper. I would suggest that you produce your black-and-white images as CCIT4 encoded TIFF files before calling gs to create the PDF file (i.e. use the option "-compress Group4" when running mogrify/convert). I'm not familiar with scantailor, but maybe it offers that option out of the box.

When I was first scanning my books, I would use convert to produce the TIFF files with CCIT4 compression. Then I would use tiffcp to combine the separate TIFF files into a single multi-page TIFF file. I would then use either tumble or tiff2pdf to convert the multi-page TIFF file into a PDF file. Then I would use gs as the last step to add PDFMARKS to the PDF file.

Nowadays I used pdfbeads, but that has become more complicated than my old way because the program is no longer maintained and is very difficult to get working on a modern system. I use my old copy of pdfbeads within an old linux distro running inside VirtualBox.
Thank you for the quick answer - pdfbeads -- interesting idea -- (vm i get it :-) ) - as for the other points above -- thats exactly what im currently doing and i wrote a crude bash wrapper for tryouts

but basically its the following.

qpdf --> explode all pdfs into single pdf pages
gs --> convert pdf to tiff (b/w tiffg4)
gs -q -dBATCH -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -dFirstPage=1 -dLastPage=1 -sOutputFile=111.tif page-111.pdf
than loop over the tif -> pdf with img2pdf a "raw" wrapper without encoding https://gitlab.mister-muffin.de/josch/img2pdf
and than bulk all the PDF's together into a combined pdf.

rather crude - but i achieve good compression results on b/w images with minimal effort and quite reasonable quality. (please be aware that the input images should be b/w allready -- if they are grey the tiffg4 encode gives sometimes funky results.

though i share -- topic closed on my end -- but it was hell of frustrating :-) to get some grip on that.

\Pete
Reply 

#4  Quoth 01-03-2023, 06:07 PM
ImageMagick, or the GIMP (import as layers)
Reply 

Today's Posts | Search this Thread | Login | Register