Mobileread
On repairing defective apnx files.
#1  j.p.s 10-11-2019, 06:57 PM
(Please don't use this thread to vent that ebooks shouldn't have page numbers, what the numbering scheme should be, or other off topic posts.)

Three of the books that I've read recently with amazon supplied page numbers started out OK, but the page numbers started getting screwy near the end. My best guess is that books with extensive notes, bibliographies, etc are prone to having HREFs that look like page anchors to whatever tools publishers use to generate <pageList> sections in toc.ncx and that that screws that file up.

Details:
Spoiler Warning below







A Brief History of Everyone Who Ever Lived by Adam Rutherford Sep 2016, Weidenfeld & Nicolson

Utopia for Realists by Rutger Bregman March 2017, Little, Brown

Bad Blood by John Carreyrou May 2018, Knopf

These aren't cheap books, but sometimes they are on sale. Also I assume that they have been out long enough to not have long wait lists at libraries. I'm curious whether the commercial EPUB versions have page number problems. (I don't check ebooks from libraries.) I did check out paper versions of all three to compare page numbers. The Utopia for Realists paper book had significantly lower page numbers than the amazon supplied numbers throughout the book. Bizarrely, the page anchors in the ebook matched the paper book, as did the fixed ebook page numbers.

I thought it might be possible to repair the apnx files, but that it would be difficult to figure out how, and tedious and time consuming to do. Then I saw post#2 by Doitsu in this thread: https://www.mobileread.com/forums/sh...d.php?t=255926

I don't think kindleunpack has an option to make an epub whose toc.ncx has a <pageList> section, but it turned out to be relatively easy to use kindleunpack -> (some regex and scripting) -> kindlegen -> kindleunpack to get repaired apnx files.

The first step is to look at some of the Text/part0*.xhtml files to learn the form of page anchors used in the book. Next make a list of file name anchor id pairs and use that to generate a <pageList> section to insert ahead of the closing /ncx> in toc.ncx after removing anything fishy that might be in the list. Then use kindlegen on the augmented EPUB. The only thing you need from the fat mobi is the apnx file, which should be renamed to match the one supplied by amazon for the book.

The really good news is that the new apnx file can be copied straight to the sdr directory for the book, overwriting the existing file. (I did this with the book closed on the kindle.) This doesn't seem to faze the kindle at all. The next time the book is opened, the page numbers are correct.

Attached is a perl script to generate a <pageList> section from a list of file name page id pairs.
[gz] gen_pagelist.pl.gz (392 Bytes, 21 views)
Reply 

#2  j.p.s 10-26-2019, 02:29 PM
It was suggested to me privately that Doitsu's pagelist sigil plugin,https://www.mobileread.com/forums/sh...d.php?t=265237, might be useful in repairing books with page number problems.

I tried it on Bad Blood: Secrets and Lies in a Silicon Valley Startup, which uses code like
Code
<span id="page_3" epub:type="pagebreak" title="3"></span>
for pages and that worked.

Unfortunately, A Brief History of Everyone Who Ever Lived uses
Code
<a id="page_1"></a>
and Utopia for Realists uses
Code
<a id="page-1"></a>
and the pagelist plugin does not work for either.

When compatible page number markup was used in the ebook production, the sigil pagelist plugin significantly simplifies apnx repair workflow to kindleunpack -> sigil pagelist -> kindlegen -> kindleupack (note that all that is required from all this is the replacement apnx file that can be copied to the sdr directory on the kindle, the original azw3 file does not need to be changed).
Reply 

#3  KevinH 10-26-2019, 09:29 PM
You could use kindleunpack to generate the Adobe pagemap.xml and use search on the custom form of id you want to verify and/or fix the pagemap.xml file. Then add it to the unpacked epub and pass it through kindlegen and strip out the apnx info into a separate file.
Reply 

#4  j.p.s 10-27-2019, 02:02 PM
Quote KevinH
You could use kindleunpack to generate the Adobe pagemap.xml and use search on the custom form of id you want to verify and/or fix the pagemap.xml file. Then add it to the unpacked epub and pass it through kindlegen and strip out the apnx info into a separate file.
Thanks, that might lead to an optimal workflow for apnx repair. I'm currently having trouble with kindleunpack best dealt with in the kindleunpack thead, but I'm tied up on a bunch of other things at the moment.

But it does seem a bit ironic that the original apnx file isn't even required to make a good one. This implies that for books with page id targets transferred over USB, a "real page number" apnx file can be automatically generated without having to leave airplane mode to get one from amazon.
Reply 

#5  j.p.s 11-02-2019, 05:45 PM
Quote KevinH
You could use kindleunpack to generate the Adobe pagemap.xml and use search on the custom form of id you want to verify and/or fix the pagemap.xml file. Then add it to the unpacked epub and pass it through kindlegen and strip out the apnx info into a separate file.
I've used kindleunpack on each of the azw3 files, once with the original apnx file and once with the repaired apnx file that I generated with kindlegen on epubs with a pagelist appended to the toc.ncx. I think the resulting page-map.xml files support my speculation that page information supplied by the publishers had multiple errors due to misidentified HTML link targets as page IDs.

The 3 pairs of page-maps are attached as pagemaps.zip
[zip] pagemaps.zip (12.3 KB, 13 views)
Reply 

#6  KevinH 11-05-2019, 09:32 AM
Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.

Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together?

That is very strange.

KevinH
Reply 

#7  j.p.s 11-05-2019, 12:11 PM
Quote KevinH
Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.

Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together?

That is very strange.

KevinH
Hence my theory that the publisher generated EPUBs have faulty page-map or pageList. Whatever automated tools they use must have been developed on plain books without much in the way of internal links and didn't get tested on books that do. Maybe they should contract with DoItSu to supply a generalized version of his sigil plugin.

I don't think the problem is with the conversion to KF. The bad apnx is from gabage in, garbage out.

I don't have a way to get the commercial EPUBs, so I can't investigate further. (No account at EPUB retailer, library, etc. and unwillingness to have anything to do with Adobe.)
Reply 

#8  j.p.s 11-09-2019, 10:35 AM
Quote KevinH
Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.

Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together?

That is very strange.

KevinH
I've spent some time comparing the "bad" and "fix" page-map.xml with each other, the part0*.xhtml, and assembledtext.dat files. I thought it would be easy to check a few of the references that don't match the pattern for the page number ids and see where they are compared to the actual page references. It turned out that I couldn't find any of them, so I guess that the bogus apnx files as delivered by amazon somehow cause kindleunpack to synthesize them.

I still think this is triggered by books with extensive footnotes but have lost some confidence in that. Still pretty sure bogus pagelist or page-map in the publisher supplied epub is the cause, but have no way to check.
Reply 

#9  j.p.s 12-07-2019, 03:43 PM
Quote KevinH
Yes, the "bad" ones definitely look bad and the "fix" ones look much better. I am surprised as to why this happens as in older mobi 6 and mokbi 7 internally links are filepos info (file offsets) and in newer mobi8 they encode a base 32 file offset into a character based "id-like" equivalent. Both file offsets should be quite precise and not lead to what you are seeing.

Is it just moving in the wrong direction to get the exact link text? Are the "bad" and "fix" targets in any way close together?

That is very strange.

KevinH
I've played with this some more as I get bits of time and better understanding of apnx.

I paginated an EPUB by hand by inserting anchors based on a PDF scan of the book and generated a pagelist from a list of the anchors. The apnx files generated by running kindlegen on the EPUB and kindleupack on the kindlegen output points to the opening "<" of the anchor for both the mobi7 and mobi8 raw markup. (The mobi7 markup has empty <a ></a>.)

I had not previously looked into the pagination for books that I did not notice any problem when reading. I wrote a script to dump the page table of offsets at the end of an apnx file and optionally 16 characters from the raw markup (assembled_text.dat) beginning at each offset. No commercial book perfectly matched at every page, but a few came close with a couple actually matching on almost every page. Some had small offsets, others larger. Sometimes the offset was not a fixed amount. A few did not have anchors or spans that indicated page boundaries, so I have no idea how accurate the apnx offsets are.

I'm attaching apnx_dump.pl
[pl] apnx_dump.pl (1.7 KB, 2 views)
Reply 

Today's Posts | Search this Thread | Login | Register