Mobileread
Optimize PDFs from archive.org for E-Ink devices
#1  ctop 02-25-2020, 10:28 PM
The internet archive at archive has a lot of interesting books for borrowing and downloading. I have some downloads of older books, that are difficult to read on E-Ink devices because they include the background of the page, which has become yellow. So the contrast is low and the text becomes unclear, also the files are quite big. So I wonder if somebody knows a good way to trim the PDFs for ereaders. I would prefer to use a commandline on a Linux based system, if such a tool is available here.
An example of the PDFs I am looking at is this:

https://archive.org/details/smtlichewer16goet/page/n8/mode/2up

(This is the item page, the download link is here

https://archive.org/download/smtlichewer16goet/smtlichewer16goet_bw.pdf

Any help appreciated, Ctop
Reply 

#2  Tex2002ans 02-25-2020, 11:55 PM
Quote ctop
[...] the background of the page, which has become yellow. So the contrast is low and the text becomes unclear, [...]

I would prefer to use a commandline on a Linux based system, if such a tool is available here.
GUI-based:

Scan Tailor Advanced:

https://github.com/4lex4/scantailor-advanced

There isn't another tool like it.

If you want commandline, then there's nothing better than ImageMagick, but you'll have to come up with all the tweaks yourself.

There was also "What’s your “image rehab” routine?" from 2013 which discussed some image cleanup ideas. Although that mostly focused on cleaning up images within scans.

Side Note: Archive.org's B&W versions are usually okay. In this case, it requires lots of manual intervention. Go back to the color PDF (or like GrannyGrump mentions in the thread above, use the original JPEG2000 files), and do all your cleaning from there.

This specific file also has a lot of bleeding through the pages, so that may make your job extra harder when trying to darken text.

Quote ctop
also the files are quite big. So I wonder if somebody knows a good way to trim the PDFs for ereaders.
Scan Tailor Advanced should be able to do all the chopping/cropping/contrast adjustments for you. But if you need even more PDF tweaking beyond that, then there's k2pdfopt, by willus.
Reply 

#3  ctop 02-26-2020, 01:52 AM
Quote Tex2002ans
GUI-based:

Scan Tailor Advanced:

https://github.com/4lex4/scantailor-advanced

There isn't another tool like it.
.
Thanks. I was somehow hoping that I could just clean the images without disturbing the text layer. I have been using scantailor (though not the advanced version, thanks for pointing that out) for books I scanned myself, and am quite pleased with the results. So it seems what you are saying, it is best to throw away all the post-processing already done and start from the images. Sigh, with a GUI based program that is quite a lot of work...

All the best,
Ctop
Reply 

#4  doubleshuffle 02-26-2020, 04:37 AM
Why not fix the epub and upload it to the MR library? Will be much nicer on your reader, and also a service to the community.
Reply 

#5  ctop 02-26-2020, 05:56 AM
Quote doubleshuffle
Why not fix the epub and upload it to the MR library? Will be much nicer on your reader, and also a service to the community.
I had not even thought about that. I will have a look and see if it can be done in a reasonable timeframe.

Ctop
Reply 

#6  doubleshuffle 02-26-2020, 12:13 PM
It is a lot of work, no denying that. But your pdf-fixing efforts sound pretty complicated too, so that's what gave me the idea.

I only now had a look at the book you have in mind. That's huuuge, of course, and seriously a lot of work.

BTW, there's a very nice epub edition of Goethe's works in our library, provided by pynch. But I'm not sure if the scientific writngs are complete in that one.
Reply 

#7  doubleshuffle 02-26-2020, 12:18 PM
Just had a look at the txt file of the book - a very clean OCR result with surprisingly few errors. Fixing the epub may really be the way to go here.
Reply 

#8  Tex2002ans 02-26-2020, 06:41 PM
Quote ctop
I was somehow hoping that I could just clean the images without disturbing the text layer.
Yeah, that's the one disadvantage of Scan Tailor, it recreates/morphs the original text.

But if you're using it for personal copies, or a pre-processor for more accurate OCR, it's great.

The nice thing about it is you can also do page-by-page adjustments, and see how the final output will look. For example, speckle cleanup is fantastic, and you can see the diffs and adjust the strength if necessary.

Quote ctop
I have been using scantailor (though not the advanced version, thanks for pointing that out) for books I scanned myself, and am quite pleased with the results.
The original is not maintained any more, while the other forks added lots of functionality (like better multi-threading—you can see the entire enhancement list on Github).

Scan Tailor Advanced combines all the best functionality from all of them, and I believe it's the only one actively maintained.

Quote ctop
So it seems what you are saying, it is best to throw away all the post-processing already done and start from the images.
Yes. Archive.org just does a whole host of automated conversions... and I wouldn't use them if you could help it.

I usually just stick with their:

1. B&W PDF. Usually this is decent. In the case of this specific "yellowed book", it was crap.

2. Color PDF. This matches what they show in their online reader. Helpful if working with color, drawings, or "yellowed books". (You can do your own contrast/color corrections from this, and create a better grayscale/B&W version.)

3. As a last resort, work directly from the JPEG2000 images. These are the highest resolution/quality.

Do not touch their "EPUBs" or any of their other "ebook" formats (they are just automatically run through OCR, no proofing or anything). You're better off working from the source files and recreating your own OCR/ebooks from that.

Plus, if you have access to newer tools, you may get even more accurate conversion (according to the metadata, Finereader 8 was used, where Finereader 12+ is probably more accurate).

PS. If you need me to run any images/PDFs (pre-processed or not) through Finereader 12, just let me know.

Quote ctop
Sigh, with a GUI based program that is quite a lot of work...
You can always automate any pre-processing steps with ImageMagick. For example, I was working on a book with scanning artifacts that ran vertically through the text:

Detecting/Removing Vertical Scanlines from Scans

So it could be used to clean up the images, then run through further corrections/tools after.

But with ImageMagick... you'll have to spend time figuring out all the commands + recreating fixes that may already exist.

For example, Scan Tailor already does a fantastic job of dewarping, detecting and cropping spines+edges-of-pages, [...].

If you go pure commandline ImageMagick... you'll have to figure out all those algorithms on your own. (Plus each book is going to have its own unique challenges.)
Reply 

#9  hobnail 02-26-2020, 07:34 PM
Quote doubleshuffle
Just had a look at the txt file of the book - a very clean OCR result with surprisingly few errors. Fixing the epub may really be the way to go here.

I've also done it using the txt file and depending on the quality of the scan and the original book it can be a painful amount of work.
Reply 

#10  Pajamaman 02-26-2020, 09:50 PM
I suggest you try koreader. It contains ocr and reflow capacity on the fly. It also has contrast.

On another note, does anyone know a pdf tool that can ocr text that curves up at the end of a line as a result of the edge of a book page not being flat when scanned?
Reply 

  Next »  Last »  (1/3)
Today's Posts | Search this Thread | Login | Register