KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files
#1  adamselene 11-12-2009, 01:50 PM
Most of this post now by pdurrant.

KindleUnpack is a set of python scripts that take a Kindle/Mobipocket ebook and extracts the HTML, images and metadata contained in the ebook, and puts them in a form suitable for passing to KindleGen.

For KF8 files and combined Mobipocket and KF8 files, it also can produce separated mobipocket and KF8 files, and also the original source files if those are included in the ebook. In addition, for KF8 files it can produce an 'ePub', although if the HTML isn't compliant with ePub standards, the 'ePub' won't be either.

For Amazon's .azw4 files, it will extract the PDF that's been wrapped up in Amazon's .azw4 file format.

Downloads available:
Version 0.81 of the python scripts (including .pyw graphics front end)
Version 0.81 of a drag&drop AppleScript version.
Version 0.81 of a drag&drop 64-bit AppleScript version fo Mac OS X 10.6 and later..
A calibre plugin version of the scripts is available in this thread.

For anyone not interested in KindeGen and KF8, there's a copy of the last version of the single-file script, mobiunpack 0.32.

The name of the script was changed to KindleUnpack with version 0.6.1.

The Python scripts are released under GPLv3. The AppleScript Wrapper is released with unlicense.

The 0.81 version includes all bug fixes made at the git repository up until 1st December, 2018.

Many thanks to adamselene for the base code which has been built on by many of the participants of this thread.


[Original Post:]
I reimplemented huff/cdic compression in Python, and did a few other things while I was at it. The new script:

* decompresses about 25x faster than
* uses much less memory (about 16x on my largest test file)
* implements conversion of uncompressed and Palmdoc-compressed files
* handles trailing data correctly in all cases

Check it out:

PLEASE NOTE that this tool is only for decompressing unencrypted Mobipocket files. It does not decrypt DRMed files. Do not ask me for help breaking DRM.
[zip] mobiunpack (18.4 KB, 17703 views)
[zip] KindleUnpack (442.4 KB, 761 views)
[zip] KindleUnpack 64 (437.5 KB, 1303 views)
[zip] (124.7 KB, 2419 views)

#2  adamselene 11-13-2009, 11:22 PM
The latest version (0.07, same location) is even faster—now about 50x as fast as

#3  quocsan 11-14-2009, 07:20 AM
Great job!
Thank you, Adamselene.

#4  HansTWN 11-15-2009, 09:04 PM
time to get working on those Topaz files! Wink, wink!

#5  pdurrant 02-05-2010, 12:37 PM
Quote adamselene
PLEASE NOTE that this tool is only for decompressing unencrypted Mobipocket files. It does not decrypt DRMed files. Do not ask me for help breaking DRM.
Many thanks for this. I have moved the latest versions into the first post in this thread now. (Being a moderator has some advantages.)

#6  soalla 02-05-2010, 01:33 PM
thanks to both of you!!

#7  pdurrant 02-05-2010, 06:05 PM
I've now tweaked the script to also output the images.

Note that the HTML file is the raw contents of the Mobipocket file, and so the img attributes in it aren't proper HTML, and don't point to the extracted images. To get working images in the HTML, a bit of search/replace will be needed, although it should be possible to do it with a single grep, as I've tried to make the file names easy to use with what's in the HTML file.

#8  Jellby 02-06-2010, 04:04 AM
Pssst, remove the __MACOSX directory

#9  pdurrant 02-06-2010, 10:54 AM
Quote Jellby
Pssst, remove the __MACOSX directory
OK, done. Saved 48 bytes!

#10  pdurrant 02-09-2010, 03:03 PM
Tweaked again, mostly by some_updates from the Dark Reverser's blog comments, to output some of the metadata from the file.

I've added to his work by getting the metadata output as an opf file resembling the original file used to generate the Mobipocket file.

However, the raw output of the 'html' in the Mobipocket file need a fair bit of work on it yet before it'll be possible to regenerate the file using Mobipocket Creator or KindleGen.

That's my eventual aim with this, however.

