Mobileread
The Mobipocket format: Starring Leonardo diCaprio and Kate Winslet
#1  schmidt349 11-24-2007, 04:37 PM
Long time lurker, first time poster.

So, having just come into possession of an Amazon Kindle, I thought I'd load it with some documents that I have in various text-based formats (DocBook XML being chief among them). I read that it supports the Mobipocket format, and being somewhat adept with Perl, I figured I'd whip up some conversion software with the help of CPAN.

What follows is a tale of horror such as you can't imagine.

Just for a lark, I started with the following:

$ strings KindleUsersGuide.azw

and got this:

Kindle_Users_Guide
BOOKMOBI
MOBI
EXTH
08/01/2007
Amazon.com
Amazon.com
REF000000
Reference
Q<P>An overview of all the Amazon Kindle features and how to use them.</P>
Kindle User's Guide
<html><head><guide><reference
itle="Table
ontents"
toc"
ilepos=0
232 /
.Welcome
start
8616

{snip}

No worries, I thought, it's not a ZIP or a GZIP archive, so they're probably working their own proprietary mojo with some kind of compressed container format. I can see strings up top that look like they're pretty clearly identifiers of some kind (Kindle_Users_Guide, BOOKMOBI, MOBI, and EXTH). I wasn't all that enthusiastic about reverse-engineering somebody's proprietary binary file format, so I visited Mobipocket's web site to look for a document specification.

That was my first mistake.

Nowhere do the Mobipocket people actually give you the secret sauce for their file format. No c code examples, no header or binary structure descriptions, nothing. Their Windoze-only "Mobipocket Creator," despite being marketed as "free software," is anything but -- I almost wish the FSF had a trademark on that term so they could do a legal beatdown on anyone who calls their software "free" just because it doesn't cost anything.

So, no help whatsoever from the Mobipocket crowd. I did discover, though, while browsing their forums that file extensions "prc" and "pdb" are synonyms for "mobi". So I Googled those, thinking that maybe someone somewhere had already done my homework for me.

I knew something was wrong immediately when I was redirected to a bunch of Palm OS-related websites. Imagine my horror when I found out that the mobipocket document container is actually a Palm Database file, a monstrosity that stores everything in a bizarre nonstandard record structure instead of a nice friendly POSIX-compliant directory hierarchy. The sauce on the goose: it stores data in big-endian format because it was originally designed to be used by the very first Palm Pilots, which had Motorola 68000-series microprocessors in them. Wow.

Thankfully CPAN has modules for everything, so I fired up Palm::PDB and Palm:oc, almost hoping that they wouldn't be able to parse the file. However, they didn't have any problems groking the file structure, and my worst fears were realized.

These examples of rotten HTML are drawn from the finally decoded content of the Amazon Kindle Manual, which I grabbed off the device.

Let's start at the beginning:

<p width="0em"><font face="serif">Thank you for purchasing Amazon Kindle. You are reading the Welcome section of the <i>Kindle User's Guide</i>. This guide provides an overview of Kindle and highlights a few basic features so you can start reading as quickly as possible.</font></p>

After came to and peeled my face out of the keyboard I'd just spent five minutes banging it into, I glanced behind myself reflexively, half-expecting to see a blue police box or Billy Pilgrim or some other indication that I had been flung back in time to 1997.

The <font> element is one of those great horrors that we thankfully put to rest with HTML 4.0, XHTML 1.0, and CSS BEFORE THE END OF THE LAST CENTURY. So's the i tag. These are all examples of HTML 3.2-type mixing of document structure and formatting, which isn't supposed to happen under any circumstances in this day and age. You're supposed to use the style attribute along with the generic inline <span> container.

How am I supposed to convert into a format that doesn't even validate as HTML 3.2? How am I expected to use a monstrosity that doesn't conform to ANY of the ebook standards we've established over the last ten years?

The IDPF people have been working on these problems for ages. They came up with a bunch of specifications years ago that would have prevented this nightmare. But this was like the greatest hits of Netscape 2.0. I saw the <center> tag. I saw <li> tags that weren't closed. I saw illegal entities like &. Craploads of tags had the wrong punctuation for their closers (ie, <h4></H4>. Picture references didn't comply with Dublin Core or anything even close to standard. Hell, I half-expected to run into <blink> and <marquee>.

I could not believe I was looking at document markup from the user guide to a device that's supposed to be bleeding-edge.

If someone on this forum is from Mobipocket, I want to know how in good conscience you can continue to use a completely proprietary document container and HTML that looks like I wrote it back in 5th grade. To everyone else I recommend in the strongest possible terms that this format be avoided wherever possible.

I really, really hope that Amazon adds .epub support to the Kindle sometime soon. I already tried loading a document in that format on the device but was told it's unsupported. Otherwise you really are going to have to rely on Amazon for all your content, and good luck using it anywhere else even without DRM.
Reply 

#2  Nate the great 11-24-2007, 04:47 PM
You know, a large number of people try to claw their eyes out when they first see the insides of a Mobipocket file. I'm glad it didn't happen to you.

Welcome to MobileRead.
Reply 

#3  HarryT 11-24-2007, 04:50 PM
Why not just use the tools that the rest of us use to generate Mobi files - Book Designer or the Mobi Desktop Reader - instead of beating yourself up over something you have no control over?

Yes, Mobi files are Palm Resource (PRC) files; the "BOOKMOBI" string you found is what identifies the specific data contained in the file. "TEXTREAD" is an alternate that you'll find in other Mobi files.

Mobi is pretty much a de facto eBook industry standard - it's certainly available on far more devices than any other format, and has more books available than any other format. There's no point of having hysterics over the fact that it's a rather elderly file format; it won't change and you'll give yourself a heart attack .
Reply 

#4  Steven Lyle Jordan 11-24-2007, 04:54 PM
Welcome, schmidt349!

Yeah, okay, Mobi is old. Really old. And they're not especially helpful to programmers and hackers, pretty much leaving them to their own devices. On the other hand, it has the benefit of being able to run on almost any platform imaginable, and already has readers for said platforms. Sometimes there's value to being "old."

For your purposes, you should still be able to do the job of converting... even if you have to make it a 2-step process and convert your DocBook to some other format (like Word DOC or HTML) that Mobipocket Creator, or some other established conversion software, can work from... right? Of was there something I'm missing in your post here (other than your obvious shock and indignation at Mobi)?

Glad you're trying out a Kindle... quite a few of us are experimenting with it, me getting my e-books onto Amazon, for instance. You'll have to let us know how well it reads the DocBook files (once you get them in there).
Reply 

#5  schmidt349 11-24-2007, 05:04 PM
Well, just a couple of comments, and please don't take them personally.

You can't justify the mobipocket format's intractability by saying "it supports old devices" or "it's a de facto standard." That's the same logic that Microsoft uses to keep us locked into their proprietary binary document formats (and now proprietary XML document formats). If there are industry standards that everyone else has agreed to it makes absolutely no sense not to follow them except if they're trying to monopolize the market by using standards lockout.

It's incidentally the same logic that gave us HTML hell back in the nineties. Neither Netscape nor Microsoft wanted to play by a common set of rules; instead they just did whatever they pleased with HTML as a standard, and by the end it really wasn't one.

The good news is that the Kindle uses the Netfront browser (v3.3) to render HTML. That should make it XHTML and CSS compliant, so if I fiddle things properly I may be able to find a way to go Docbook via XSLT -> XHTML/CSS. Wish me luck.
Reply 

#6  tompe 11-24-2007, 05:19 PM
Quote HarryT
Why not just use the tools that the rest of us use to generate Mobi files - Book Designer or the Mobi Desktop Reader - instead of beating yourself up over something you have no control over?
Oh, they have released a Linux version of these tools then. Or are you just assuming that everybody uses Windows?
Reply 

#7  schmidt349 11-24-2007, 05:42 PM
Yeah, that was a sticking point for me as well-- I don't do Windows, it gives me heartburn.
Reply 

#8  Hadrien 11-24-2007, 05:59 PM
They're still supposed to release mobigen on Linux, sometime in the future, in a galaxy far far away...
Reply 

#9  jasonkchapman 11-24-2007, 07:00 PM
Quote schmidt349
You can't justify the mobipocket format's intractability by saying "it supports old devices" or "it's a de facto standard."
I don't think anyone's trying to justify the format. It's just that your complaints are a decade old. Everyone knows what a horror the format is internally.

From a technological view, the format is a travesty. From a marketing view, it's the closest thing to a de facto standard that there is in the commercial e-book market. From a technological view, Amazon probably should developed a new format, or better, worked from an existing open standard. From a marketing view, Amazon's use of the Mobipocket format at least gives the hint of a promise of interoperability in the future.

Personally, I'm willing to bet that to Amazon's target market, the ones who are going to decide if e-books really matter or remain a niche market, things like HTML, XML, OEBP, etc. are just strings of meaningless technobabble.
Reply 

#10  igorsk 11-24-2007, 08:42 PM
I'm thinking they're just following the good old rule: "If it ain't broken, don't fix it".
Reply 

  Next »  Last »  (1/9)
Today's Posts | Search this Thread | Login | Register