Mobileread
PDF to ePub conversion
#1  BetterRed 08-19-2019, 09:42 PM
Quote DNSB
. . . since PDF is one of the worst formats to convert from . . .


In the OPs last post he wrote "I converted the book from PDF with Sigil." Is that possible - if so how?


@joebob2 - most PDF's are created from something else - I've only known one person who wrote Postscript on a clean slate. They typically start life as WP or DTP files from programs such as Word, InDesign, Writer etc. If you can get hold of such a file that might be a better place to start.

BR
Reply 

#2  DNSB 08-19-2019, 11:37 PM
Quote BetterRed

In the OPs last post he wrote "I converted the book from PDF with Sigil." Is that possible - if so how?
I don't think it is possible to convert PDF to epub using Sigil. I did run into one author who attempted to copy/paste pages from one of her old books into BookView as that was the only electronic format for that book she was able to obtain when the rights reverted. The output of that was a right mess.
Reply 

#3  BetterRed 08-20-2019, 12:59 AM
Quote DNSB
I don't think it is possible to convert PDF to epub using Sigil. I did run into one author who attempted to copy/paste pages from one of her old books into BookView as that was the only electronic format for that book she was able to obtain when the rights reverted. The output of that was a right mess.
Ah yes, that one came up in a recent discussion. I don't think of page-by-page coffee/pasta as a conversion technique, I think of it as a 'there must be a better way than this" technique

PDF conversion must be amongst the top 5 topics at MR.

BR
Reply 

#4  joebob2a 08-20-2019, 11:36 AM
Quote DNSB
I don't think it is possible to convert PDF to epub using Sigil. I did run into one author who attempted to copy/paste pages from one of her old books into BookView as that was the only electronic format for that book she was able to obtain when the rights reverted. The output of that was a right mess.
This book has come to be through a pretty roundabout process, as you might suspect. It was originally written in M$ Word/OpenOffice, sucked into Quark Express, and then output as PDF in print form. Many corrections had happened between the original word processing files and the Quark files. I used a web utility (https://www.online-convert.com/) to get from PDF back to Word, but then I had the page header and footers to worry about, not to mention typesetting issues like no space after periods and embedded hyphens. I seriously don't want to go back to that! I have the original Quark source, but I haven't found a conversion tool to get it out of that format.

On the Smashwords site it talks about a "nuclear option," i.e. copy and paste the entire document into a Word document and re-convert it. I'm tinkering enough right now, I may go that direction.
Reply 

#5  lumpynose 08-20-2019, 12:35 PM
Quote joebob2a
On the Smashwords site it talks about a "nuclear option," i.e. copy and paste the entire document into a Word document and re-convert it. I'm tinkering enough right now, I may go that direction.
That's what I've done when I've "transcribed" a short story from an old magazine when the PDF scans are on archive.org. Exceedingly tedious. In that case it's probably has more errors since the magazine has faded and the paper's brown and the typesetting can be dodgy.
Reply 

#6  jackie_w 08-20-2019, 12:59 PM
Quote BetterRed
I don't think of page-by-page coffee/pasta as a conversion technique
This made me smile (it's been a slow day). Are you using predictive text by any chance?
Reply 

#7  Tex2002ans 08-20-2019, 07:40 PM
Quote joebob2a
This book has come to be through a pretty roundabout process, as you might suspect. It was originally written in M$ Word/OpenOffice, sucked into Quark Express, and then output as PDF in print form. Many corrections had happened between the original word processing files and the Quark files.
So the Quark file is the up-to-date version?

Quote joebob2a
I used a web utility [...] to get from PDF back to Word, but then I had the page header and footers to worry about, not to mention typesetting issues like no space after periods and embedded hyphens.
A more robust OCR program (like Finereader) would avoid most of those issues.

Quote joebob2a
I have the original Quark source, but I haven't found a conversion tool to get it out of that format.
What's the file extension on the Quark file? QXD?

Do you happen to know which version of Quark it used?

(And ~ when this book was published?)

I only worked on one QXD file many years ago, and surprisingly, LibreOffice was able to open it. It still required a lot of elbow grease, but it was a huge step up from having to OCR from scratch.

Quote joebob2a
On the Smashwords site it talks about a "nuclear option," i.e. copy and paste the entire document into a Word document and re-convert it.
... no. Just no.

You lose all important formatting information (bold/italics/superscript), and underneath-the-surface is just as important as the text itself.

And depending on how the PDF was put together, that copy/paste itself might introduce a massive amount of issues as well (like the hard hyphens issue you mentioned).

You'll spend more time cleaning up all those errors than if you just worked from much cleaner OCR in the first place.
Reply 

#8  BetterRed 08-20-2019, 07:45 PM
Quote joebob2a
This book has come to be through a pretty roundabout process, as you might suspect. It was originally written in M$ Word/OpenOffice, sucked into Quark Express, and then output as PDF in print form. Many corrections had happened between the original word processing files and the Quark files. I used a web utility (https://www.online-convert.com/) to get from PDF back to Word, but then I had the page header and footers to worry about, not to mention typesetting issues like no space after periods and embedded hyphens. I seriously don't want to go back to that! I have the original Quark source, but I haven't found a conversion tool to get it out of that format.

On the Smashwords site it talks about a "nuclear option," i.e. copy and paste the entire document into a Word document and re-convert it. I'm tinkering enough right now, I may go that direction.
I had QuarkXpress in my "Word, InDD, Writer list" - but I took it out on the basis of 'surely not'.

You can open PDF files directly in MS Word 2016/19, the result can be surprisingly good - but I suspect that's because the documents I'm thinking of were originally typed into Word by someone who didn't regard it as a Remington portable. An ex QuarkXpress PDF might not fare so well.

BR
Reply 

#9  BetterRed 08-20-2019, 07:50 PM
Quote jackie_w
This made me smile (it's been a slow day). Are you using predictive text by any chance?
No, none of that AI crap - I turn it off. IIRC the coffee/pasta word play comes from my MIT/DEC days, along with bang, crunch, snail and hat

BR
Reply 

#10  joebob2a 08-22-2019, 12:16 PM
Quote Tex2002ans
So the Quark file is the up-to-date version?
No, I've already invested significant time in cleaning up the PDF. The epub version is pretty close to where I want it, but it has all these technical issues. that the validators don't like.

Quote
What's the file extension on the Quark file? QXD?

Do you happen to know which version of Quark it used?
The source files are .qxd files. I know it was generated on a Mac. Unknown as to version, but it's more than ten years old.

Quote
(And ~ when this book was published?)
It went to print in 2009, just as the e-book revolution was turning the corner. I'm working on an e-book version because there's a surge in demand, and I just want it out there.


Quote
I only worked on one QXD file many years ago, and surprisingly, LibreOffice was able to open it. It still required a lot of elbow grease, but it was a huge step up from having to OCR from scratch.

... no. Just no.
Amen to the No. LibreOffice wanted to turn the PDF files into graphics -- each page an image. The QXP files looked like random bits in LibreOffice.

Quote
And depending on how the PDF was put together, that copy/paste itself might introduce a massive amount of issues as well (like the hard hyphens issue you mentioned).

You'll spend more time cleaning up all those errors than if you just worked from much cleaner OCR in the first place.
At this point, I'm just looking for something to fix the validation errors. I'm tempted to edit the html files in a text editor with group replace to correct the flagged errors, but I need to know the correct replacement for each of those errors. I had an earlier post talking about how Sigil was having trouble consolidating HTML files. Calibre was able to merge the files without breaking things, so that's now a viable option. I now have one html file for each of the eight major chapters, as opposed to dozens.

What's surprising to me is that there are all these great conversion utilities, yet nothing that addresses the validator errors.

Thanks again for all the help. I'll keep plugging on this.
Reply 

  Next »  Last »  (1/2)
Today's Posts | Search this Thread | Login | Register