Mobileread
Importing Open Office HTML in Sigil
#1  paulpeer 03-11-2010, 09:07 AM
In another thread a user asked how to make an ePub out of an Open Office file. Valloric responded "just export to HTML and import in Sigil". It's a bit more complicated than that

- A first remark is that OpenOffice Writer exports to HTML and not to XHTML. Sigil transforms a lot of HTML elements (P, DIV, H1...) into their lower case equivalents (p, div, h1), but a lot of them are not touched.

- The first group of elements that are not touched are the A-elements. If your OpenOffice document has notes, they are exported in this way:

Code
<A CLASS="sdfootnoteanc" NAME="sdfootnote1anc" HREF="#sdfootnote1sym"><SUP>1</SUP></A>
Sigil transforms this to:

Code
<a CLASS="sdfootnoteanc" HREF="#sdfootnote1sym" NAME="sdfootnote1anc"><span><sup>1</sup></span></a>
So CLASS and HREF are still in their upper case form, and NAME is not changed to "id". Hence most of the readers do not understand that this is a footnote and a first job is to search and replace all occurences of "NAME" with "id", "CLASS" with "class" and "HREF" with "href". After doing that, you'll see that the notes suddenly are blue links and are working.
Is this a job Sigil could do automatically?

- The second problem is about images. If the original OpenOffice document has images, they are exported as different files with links from within the HTML document, e.g.

Code
<IMG SRC="../Provizore/Grafo_html_m26feaff4.jpg" NAME="Afbeeldingen4" ALIGN=LEFT WIDTH=310 HEIGHT=281 BORDER=0>
After import into Sigil, the only thing changed is "IMG" which is now "img". But even if you change "SRC" to "src", Sigil does not find the images. I haven't found an easy way to deal with this problem so far.

- Last there is a big group of elements that remains in the Sigil file such as DIR, LANG, ALIGN, CLASS, CONTENT, HTTP-EQUIV etc. Many of them you can just remove, for others such as STYLE you may want to adapt the CSS file.

I'm not complaining about Sigil. It does a great job. But it leaves a lot of work for us!
Reply 

#2  Valloric 03-11-2010, 09:42 AM
Most of your problems stem from attributes being in all uppercase. I used to let Tidy convert them all to their lowercase equivalents (it does this by default), but this ended up wreaking havoc on SVG attributes. SVG has lovely case-sensitive attributes like "viewBox", and "viewbox" or "VIEWBOX" don't work. So I had to hack Tidy into leaving attributes in whatever case they came in.

I made this change months ago, and no one complained thus far. I plan on taking a look into making Tidy convert uppercase attributes to lowercase, but leaving mixed-case attributes alone. Sounds simple, but if you've ever taken a look into Tidy source code, you'd quickly realize it's not, mostly because Tidy source is a horrible mess of unreadable spaghetti C code.
Reply 

#3  paulpeer 03-11-2010, 10:03 AM
Quote Valloric
I made this change months ago, and no one complained thus far. I plan on taking a look into making Tidy convert uppercase attributes to lowercase, but leaving mixed-case attributes alone.
Maybe it's easier to persuade the OpenOffice guys to make their program export to plain vanilla XHTML instead of the old HTML ...
Reply 

#4  Valloric 03-11-2010, 11:25 AM
I've just fixed this in Tidy. Future versions of Sigil will fold uppercase attributes to lowercase, and mixed-case attributes will be left as is.
Reply 

#5  paulpeer 03-11-2010, 11:33 AM
Quote Valloric
I've just fixed this in Tidy. Future versions of Sigil will fold uppercase attributes to lowercase, and mixed-case attributes will be left as is.
Uauuu! You're amazing. Thank you!
And have you seen the part about images in my post? This isn't just a uppercase/lowercase question, is it?
Reply 

#6  Valloric 03-11-2010, 11:41 AM
Quote paulpeer
Uauuu! You're amazing. Thank you!
And have you seen the part about images in my post? This isn't just a uppercase/lowercase question, is it?
It's caused by the same issue.
Reply 

#7  roger64 03-16-2010, 09:40 AM
I just tried to export direct from OpenOffice as html. I've got very unsatisfactory results. Text looks scrambled on Sigil and so on. I gave up.

I then used an OpenOffice extension called "writer2xhtml", which gives more control to the user and export odt file as so-called "strict html". The results look far better.
http://extensions.services.openoffice.org/project/writer2xhtml

Sigil opened the file without any complaint nor showing any visible defect. I could process it easily (checking TOC, filling meta,...) and save as an epub file. But once the epub was on my PRS-505, I've got an "Error Page!" and the file can't load.

I tried many small changes to no avail.
Reply 

#8  KevinH 03-16-2010, 10:31 AM
Hi roger64,

It seems there are many reasons for an "Error Page" but -

one possible reason is that you have not properly split the file into sections that are smaller that 260K bytes. That seems to be the upper limit on chapter size for the Sony family of e-readers and many others.

So make sure you have broken the single large html file into sections, 1 file for each chapter.

If you have already done that, that look for a particularly large or long chapter and split that one as well as some appropriate point.

KevinH
Reply 

#9  paulpeer 03-16-2010, 10:36 AM
Have you tried checking your ePub with Validator? http://threepress.org/document/epub-validate/
This often gives a good hint.
Reply 

#10  roger64 03-16-2010, 10:44 AM
OK Solved. I split my file in five "chapter breaks", including one for one image.
I also suppressed the second image which was in .gif format.

One of these two things did the trick. BTW the end result is perfect. So I would recommend using this extension which works well with Sigil.
I use OpenOffice 3.2 with Linux.

So I did not have to test the Validator. I keep it for the next time.

Thanks very much for your help.
Reply 

  Next »  Last »  (1/2)
Today's Posts | Search this Thread | Login | Register