Remove the non-indexed elements?!
#1  un_pogaz 03-18-2019, 04:21 AM
When opening an ePub, Sigil only loads the indexed files into the <manifest>, and will delete any non-indexed files.
I understand this behavior which aims to remove parasitic files added by other unscrupulous and indelicate software.

However, more than once I have opened an ePub and I have only HTML pages and no images or CSS sheets.
However, its files are present in the ePub, simply they are not indexed in the <manifest> (probably because of unscrupulous software that confuses <manifest> and <spine>)

What I reproach Sigil for is the unilateral decision to delete non-indexed files, without warning or consultation of the user, because the risk of completely breaking an ePub is important.
It would be nice if, when opening, Sigil checked the correspondence <manifest>/ "content of the ZIP archive" and asked the user if he wanted to add them to the <manifest> in case of an non-indexed file as found.
The must would be that we are a check-list of files that we want to keep or not.

#2  DiapDealer 03-18-2019, 06:18 AM
Sigil's every action is predicated on the notion that the opf is entirely correct. And every action Sigil undertakes ensures that the opf stays that way. Sigil can't warn you about unmanifested xhtml files when it opens an epub, because it doesn't know they exist. It's not purposely deleting them. It hasn't made a "unilateral decision." It just hasn't loaded them. Because the opf (the boss) didn't tell it to do so when it was being parsed. What you're asking for would take a complete overhaul of how Sigil opens epubs, or a complete overhaul in how they're saved. Neither would be a trivial undertaking.

#3  KevinH 03-18-2019, 09:17 AM
Sounds like you should create an input plugin that walks the contents of the epubs zip file and adds any files present but not manifested that are css, image, font, or xhtml files to the manifest, and the xhtml files to the end of the spine in a some pattern (but what order?). You might want to include unmanifested javascript files as well.

It appears that these bad "epubs" are nothing more than a zip of a website that has been scraped.

Alternatively, unzip these "books" first and then use Sigil's Add Existing ... menu to pull in the pieces you want.

#4  KevinH 03-18-2019, 10:00 AM
Also since a spine is made up from manifest ids, any file not manifested can not be in the anyway. So no spine order would be relevant either.

That is not even close to being an epub. That is just a zip archive.

#5  DiapDealer 03-18-2019, 07:33 PM
Someone is clearly changing the extension of an .htmlz archive to .epub in my opinion. The metadata.opf file they often contain is woefully insufficient for the purposes of epub editing.

#6  DNSB 03-19-2019, 12:17 AM
Quote un_pogaz
What I reproach Sigil for is the unilateral decision to delete non-indexed files, without warning or consultation of the user, because the risk of completely breaking an ePub is important.
If the files in the archive don't match up to the files in content.opf, the epub is already completely broken so it's a bit late to talk about the risk of completely breaking an epub. Very rarely—if ever—does Garbage In produce Gospel Out.

A couple of quotes:

From the epub3 standard:
All Publication Resources must be referenced from the manifest, regardless of whether they are included in the EPUB Container or made available remotely.

and from Mukli Krisztián's epub boilerplate:

The content.opf file is the most important part of the EPUB package, because it defines the structure of the eBook and the metadata.

Manifest Section – This section is a list of all the content files, media, fonts, and stylesheets used in the eBook. The files can be listed in any order. However, you should not include a file in the Manifest Section that is not in the EPUB package. Also, you should not have undeclared files in the EPUB package that have not been declared in the Manifest Section.

#7  un_pogaz 03-19-2019, 04:28 AM
I agree that, theoretically, an unindexed file in the <manifest> "does not exist" in the ePub, and that therefore technically the ePub is broken.
But not really either because the files are well used (in the <link> and <img>), but their non-importation breaks the ePub even more.
Problems that could have been fixed become unfixable.

Is a conversion problem, possible. But the idea is that not everyone is also respectful of the standard, so a little caution will be good.
The most vicious thing is that sometimes some files are indexed, others are not.

I am not asking for a systematic addition but just a check, which can be done using the "file table" of the ZIP archive. And no need to decompress any files, just load the OPF in memory and read the entries of the <manifest> (if Sigil was in C#, I could tried it).

Then Sigil works normally, and creates a conforme ePubs.

If it proves too complex to implement, okay, very sad, but I just want to reported this problem and ask for some thought to solve it.

To answer KevinH: So you ask me to 1) open with a "ZIP opener" each ePub I want to work 2) looked in the OPF the correspondence <manifest> contained in the ZIP 3) added the entries "forgotten" ? By hand?
This is possible, but the goal of a software is it to automate tedious, repetitive and sometimes complex tasks? Might as well take advantage of it.

#8  Thasaidon 03-19-2019, 05:22 AM
Have a look at the "Modify ePub" plugin for Calibre. This plugin can perform certain jobs on ePubs including "Add unmanifested file to Manifest".

If you run the "Modify ePub" plugin on your ePub before opening it in Sigil it may very prevent the problem you are having.

#9  un_pogaz 03-19-2019, 07:00 AM
Thank, I'd look at that.
(But integrating this security feature would always be a plus)

#10  DiapDealer 03-19-2019, 08:53 AM
What you're trying to open is not technically even an epub. It's just been given an .epub extension. The opf file in that kind of archive is for metadata purposes only. Sigil's not going to accommodate any-old zip archive full of html/css/images just because the extension claims it's an .epub. It must actually BE an epub (or it will be forced to be). And being an epub means certain strictures have to apply.

  Next »  Last »  (1/3)
Today's Posts | Search this Thread | Login | Register