Mobileread
azw3r highlight and note extraction info
#1  j.p.s 07-27-2019, 01:52 PM
I've figured out enough of the azw3r format to extract personal highlights, notes, and maybe bookmarks. (All strictly by inspection.) I've also written a C program to extract highlights and notes (in a text format possibly most suitable as an intermediate stage) and a perl script that uses the extracted highlights and notes to mark up the rawml for the book. azw3r.pl is a perl alternative to the C program which takes the same arguments and produces the same output. Both of these can now extract highlighted text from the book's rawml file. Both might also be used with yjr files from KFX books, but without the capability to extract highlighted text.

Since jhowell's KRDS parser krds.py https://www.mobileread.com/forums/sh...d.php?t=322172 is general and complete, I've put the details of my partial reverse engineering in spoiler tags.
Spoiler Warning below







As I write this up, I see that the structures are saved avl interval trees, which is meaningless to me and the results of a web search don't look interesting. This particular file is a strange mix of binary and text. (Of course the notes are in text, but see the following.

Each hightlight begins (for my purposes) with the string "annotation.personal.highlight" followed by 4 bytes. The first byte is always 0x03 (^C) followed by 3 bytes that seem to give the length of the following text string that denotes the rawml byte offset of the beginning of the highlight. This is followed by a repeat to give the byte offset of the end of the highlight, which is followed by about a couple dozen bytes of (as far as I am concerned) junk.
Code
annotation.personal.highlight^C^@^@^G1191325^C^@^@^G1191337^B^@^@^A... 3 0 0 7 3 0 0 7
(0*256) + 0)*256 + 7 = 7

Personal notes are similar to highlights. They begin with the string "annotation.personal.note", followed by the rawml byte offset of the highlight associated with the note. This is followed by more "junk", then binary (only) length of the note, then the text of the note itself.

Bookmarks look similar to highlights, but I have not investigated.

The C code and perl scripts are in github at https://github.com/jps-e/azw3r and a
ttached here along with a sed script to make the rawml viewable in a web browser.

ETA: The C and perl have been updated

ETA: New release attached as azw3r-0.1.7.zip to this post. See post #29 for details of added features.
[gz] notes_insert.pl.gz (492 Bytes, 37 views)
[gz] unxml.sed.gz (78 Bytes, 33 views)
[gz] azw3r.pl.gz (822 Bytes, 10 views)
[gz] azw3r.c.gz (1.0 KB, 11 views)
[zip] azw3r-0.1.7.zip (4.2 KB, 5 views)
Reply 

#2  NiLuJe 07-28-2019, 01:47 PM
Not being a Java guy at all, I've always wondered if those (and a few other things) weren't some weird Java binary storage/serialized format...
Reply 

#3  lumpynose 07-28-2019, 02:06 PM
Quote NiLuJe
Not being a Java guy at all, I've always wondered if those (and a few other things) weren't some weird Java binary storage/serialized format...
Why Java? (I don't know enough about kindle files to know what the connection might be.)

I would doubt that it's Java serialization since that is rather fragile; a slight change to a class could break compatibility. But for other binary encodings, who knows.
Reply 

#4  NiLuJe 07-28-2019, 02:42 PM
Because most of the Kindle backend is in Java .
Reply 

#5  j.p.s 07-28-2019, 04:38 PM
Quote NiLuJe
Not being a Java guy at all, I've always wondered if those (and a few other things) weren't some weird Java binary storage/serialized format...
Also not a java guy, and wondered the same thing for the same reasons.
Reply 

#6  PoP 08-09-2019, 02:31 PM
Quote j.p.s
I've figured out enough of the azw3r format to extract personal highlights, notes, and maybe bookmarks. ...
Just to note that (similar in content to *.azw3r and *.azw3f), for *.kfx books, *.yjr and *.yjf files are created in the *.sdr folder.
Reply 

#7  ilovejedd 08-10-2019, 02:17 AM
Awesome work! Question though, how do you use this (syntax)? I'm assuming Linux only? Will this work on a Linux LiveUSB?

Thanks!
Reply 

#8  PoP 08-10-2019, 08:41 AM
En passant, I've also been using Kindle Mate to store notes, highlights and vocabulary builder words (but not bookmarks).
Reply 

#9  ilovejedd 08-10-2019, 09:37 AM
Quote PoP
En passant, I've also been using Kindle Mate to store notes, highlights and vocabulary builder words (but not bookmarks).
Afaik, that uses "My Clippings.txt" so it only works for highlights created on the e-ink devices. Doesn't work for highlights created via iOS/Android app and synced to e-ink Kindle.
Reply 

#10  j.p.s 08-10-2019, 12:26 PM
Quote ilovejedd
Same sdr folder. Just named mbp1 for MOBI and yjr for KFX
Quote PoP
Just to note that (similar in content to *.azw3r and *.azw3f), for *.kfx books, *.yjr and *.yjf files are created in the *.sdr folder.
Thanks to you both.

And it looks like on older firmware, mbp for MOBI.

I played around with them a bit and their formats for highlights and notes are all different and deserve their own threads. After they are all sorted out, maybe someone can start a thread for an application that automatically handles all of them. In all cases, there is "junk" between the notes header and the text of the note.

I might edit this post later to show excerpts inside spoiler tags.
Reply 

  Next »  Last »  (1/3)
Today's Posts | Search this Thread | Login | Register