Hi all
another -- kind of specific question ...
im currently (ok 3 - 4 years) digitizing my whole library from childhood and i have become kind of obsessed with scanning procedures.
my current workflow is like this
Scan --> scantailor --> lots of clicking --> tiff --> mogrify stuff --> ps --> pdf
the results are really good and normally Im quite fond of them. filesize quality 300x300 i ould give them a 8-9 of 10. My "trouble" starts with books that are from like minded people that doing similar efforts and a example doc may look like this.
-----
Creator: PDF-XChange Editor 5.5.xxx
Producer: PDF-XChange PDF Core API (5.5.xxx)
CreationDate: xxx xxx
ModDate: xxx xxx
Custom Metadata: no
Metadata Stream: yes
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript
: no
Pages: 164
Encrypted: no
Page size: 372 x 559.68 pts
Page rot: 0
File size: 8100115 bytes
Optimized: no
PDF version: 1.2
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1550 2332 rgb 3 8 jpeg no 170 0 300 300 371K 3.5%
2 1 image 1550 2332 index 1 1 ccitt no 172 0 300 300 17B 0.0%
3 2 image 1550 2332 index 1 1 ccitt no 174 0 300 300 9123B 2.0%
4 3 image 1550 2332 index 1 1 ccitt no 176 0 300 300 7332B 1.6%
5 4 image 1550 2332 index 1 1 ccitt no 178 0 300 300 36.8K 8.3%
6 5 image 1550 2332 index 1 1 ccitt no 180 0 300 300 42.6K 9.7%
----
as you can see 371K for the cover 300x300 dpi and around 42K for a "traditional" grey image. Cover is a jpg rbg and the rest is ccitt (tiff4) encoded.
Anybody has an Idea to instruct ghostscript commandline to achieve similar encodings?
when i encode them it looks like this
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1550 2332 icc 3 8 jpeg no 9 0 300 300 389K 3.7%
2 1 image 1550 2332 index 1 1 image no 16 0 300 300 461B 0.1%
3 2 image 1550 2332 index 1 1 image no 22 0 300 300 19.7K 4.5%
4 3 image 1550 2332 index 1 1 image no 28 0 300 300 14.4K 3.3%
5 4 image 1550 2332 index 1 1 image no 34 0 300 300 64.4K 15%
6 5 image 1550 2332 index 1 1 image no 40 0 300 300 74.3K 17%
7 6 image 1550 2332 index 1 1 image no 46 0 300 300 76.3K 17%
8 7 image 1550 2332 index 1 1 image no 52 0 300 300 76.5K 17%
9 8 image 1550 2332 index 1 1 image no 58 0 300 300 75.7K 17%
as you can see on ghostscript im not reaching ccitt encoding? anyone knows the correct parameter for gs ... even a single page to encode in ccitt with pdfwrite as device?
\Pete
TIFF format is more about the wrapping than the encoding. You can compress using many different methods within a TIFF wrapper. I would suggest that you produce your black-and-white images as CCIT4 encoded TIFF files before calling gs to create the PDF file (i.e. use the option "-compress Group4" when running mogrify/convert). I'm not familiar with scantailor, but maybe it offers that option out of the box.
When I was first scanning my books, I would use
convert to produce the TIFF files with CCIT4 compression. Then I would use
tiffcp to combine the separate TIFF files into a single multi-page TIFF file. I would then use either
tumble or
tiff2pdf to convert the multi-page TIFF file into a PDF file. Then I would use
gs as the last step to add PDFMARKS to the PDF file.
Nowadays I used
pdfbeads, but that has become more complicated than my old way because the program is no longer maintained and is very difficult to get working on a modern system. I use my old copy of
pdfbeads within an old linux distro running inside VirtualBox.
Quote rkomar
TIFF format is more about the wrapping than the encoding. You can compress using many different methods within a TIFF wrapper. I would suggest that you produce your black-and-white images as CCIT4 encoded TIFF files before calling gs to create the PDF file (i.e. use the option "-compress Group4" when running mogrify/convert). I'm not familiar with scantailor, but maybe it offers that option out of the box.
When I was first scanning my books, I would use convert to produce the TIFF files with CCIT4 compression. Then I would use tiffcp to combine the separate TIFF files into a single multi-page TIFF file. I would then use either tumble or tiff2pdf to convert the multi-page TIFF file into a PDF file. Then I would use gs as the last step to add PDFMARKS to the PDF file.
Nowadays I used pdfbeads, but that has become more complicated than my old way because the program is no longer maintained and is very difficult to get working on a modern system. I use my old copy of pdfbeads within an old linux distro running inside VirtualBox.
Thank you for the quick answer - pdfbeads -- interesting idea -- (vm i get it :-) ) - as for the other points above -- thats exactly what im currently doing and i wrote a crude bash wrapper for tryouts
but basically its the following.
qpdf --> explode all pdfs into single pdf pages
gs --> convert pdf to tiff (b/w tiffg4)
gs -q -dBATCH -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -dFirstPage=1 -dLastPage=1 -sOutputFile=111.tif page-111.pdf
than loop over the tif -> pdf with img2pdf a "raw" wrapper without encoding
https://gitlab.mister-muffin.de/josch/img2pdfand than bulk all the PDF's together into a combined pdf.
rather crude - but i achieve good compression results on b/w images with minimal effort and quite reasonable quality. (please be aware that the input images should be b/w allready -- if they are grey the tiffg4 encode gives sometimes funky results.
though i share -- topic closed on my end -- but it was hell of frustrating :-) to get some grip on that.
\Pete
ImageMagick, or the GIMP (import as layers)