Extracting text with formatting from PDF
#1  nekokami 01-24-2007, 10:20 PM
Hi folks,

I have a PDF file that I'd like to get the text out of while retaining the formatting. The file is too large to simply select all text and copy/paste. (I get a memory error when I try to do this.) Besides, I'd like to not take the page numbers, since they won't be relevant on the device I'll be reading on (eBw 1150). The ABC PDF converter gets the text, but loses the formatting. I can't afford a full copy of Acrobat. Other extractors I've tried seem to assume one has Word installed (I don't).

I usually use a Mac, but I do have a PC available. Can anyone suggest a good, preferably low-cost program to convert PDF to something more portable, e.g. HTML or RTF? (I guess I could use the trial of Acrobat Professional for now, but I'd like a more long-term solution.)


PS - I've also tried TextLightning and Trapeze on the Mac. Neither worked, possibly because they didn't like the font. TextLightning kept crashing, and the limited output it did manage to provide didn't parse. It looked like raw PDF code. Trapeze just produced junk.

#2  jæd 01-25-2007, 04:10 AM
I would try the various command line converters for this, or write a perl/java/php program...

#3  Alexander Turcic 01-25-2007, 06:42 AM
Abbyy Transformer works well too, but it's payware. They have a demo you can try.

#4  nekokami 01-25-2007, 09:57 AM
Thanks, I'll try Abbyy Transformer, but $99 is too steep for me to use it once I'm past the demo.

It turns out that there is an additional wrinkle. Text formatting (italics and some other changes) were implemented using different fonts, rather than font styles. Copy and paste doesn't seem to preserve these different fonts, so I lose formatting even in the copy-paste-to-Word method.

@jæd, do you recommend any particular command-line converter? I write in perl and (to a lesser extent) php, but I really don't have time to write code right now.

#5  nekokami 01-25-2007, 07:42 PM
The plot thickens further: I have a copy of Readiris OCR, so I tried pulling this PDF file in to see if I could just OCR it. All I see in Readiris is boxes instead of letters. I tried a different PDF file and it worked fine (well, mostly fine--usual OCR type errors). Note that in the "thumbnail preview" mode on the Mac in the Finder, I also see boxes instead of text. Also, in the "Preview" application on the Mac I see boxes. (This isn't surprising, as I strongly suspect these two bits of software use the same code.)

Does anyone here know enough about PDF to guess what's happening? Again, when I look at the fonts (in Document Properties in Acrobat Reader) I see pretty weird names, e.g. "TTE1D974C0t00 (Embedded Subset)". It's a truetype font, but the encoding is listed as "Custom." In files that behave more normally I see recognizeable font names (variations on Arial or Times New Roman) and encoding of "Ansi". Does anyone know how to work around this problem? Maybe I'm going to need the full version of Acrobat after all....

#6  pclewis 01-25-2007, 11:35 PM
Here is a program that is cheap, $12.95, that works OK. Always some rework required. PDF is an output file, so it is what it is.

#7  nekokami 01-26-2007, 08:37 AM
@pclewis, thanks. As I mentioned above, I've tried that one already. It does work, it just loses formatting because the text is formatted using different fonts, rather than text styles. I think I'm probably going to have to find someone with a copy of Acrobat Standard (or Pro) that I can use to change the fonts.

Thanks anyway

#8  nekokami 01-28-2007, 10:59 AM
In case anyone is curious, I did eventually determine that the problem was the custom encoding in the PDF, and I solved the problem by finding an alternate (HTML) source of the file. :/ Yet another good example of why PDF is not a good format for source files. (I'm planning to spend today learning LaTeX.)

#9  wallcraft 01-28-2007, 11:18 PM
Foxit Reader Pro Pack might be a possibility. I only have the free Reader - which seems to maintain formating in its text view. The Pro Pack ($39) is needed for full-file text conversion.

#10  nekokami 01-29-2007, 10:18 AM
Looks to me like Foxit's output is plain text, but at least the editor might help if I have to sort out a problem like this in the future.

Meanwhile, the HTML file I have doesn't seem to want to import using the eBw Librarian software. It spits out warnings about unknown fonts, then hits a fatal error. Oddly enough, two very similar files imported just fine. Fortunately, HTML converts to RTF fairly easily, so I can try that next.

