Mobileread
azw3r highlight and note extraction info
#21  j.p.s 08-17-2019, 06:20 PM
When I first started playing with this I thought Calibre could not convert a kindleunpack rawml file to PDF, but when I tack and .html extension to the input file name, Calibre makes a usable PDF, even if the XML has not been commented out. It has the added advantage over html2ps that the font is larger. (However, the Calibre generated PDF does not have a clickable TOC. Also, for me using xpdf to view it, clicking on a link internal to the PDF causes an empty web browser window to open.)
Reply 

#22  rzikaou 08-18-2019, 10:56 AM
Hi j.p.s,

First, I'd like to thank you for the work you've done so far.

I was in the process of reverse engineering azw3r files and then found your project and it's been super helpful.

I'm not sure if this is a known issue, but I tried using your tool to extract highlights (with no notes) from a book, but the highlight text it is extracting is not correct.

I suspect the rawml file I'm providing it might not to be the right one.

I used KindleUnpack and it gave me 2 different rawml files: mobi7/book.rawml and mobi8/book.rawml.

I tried running `azw3r -i book.azw3r -h -r book.rawml` with both of them and the extracted highlight text is incorrect.

Any ideas?

Thanks!
Reply 

#23  j.p.s 08-18-2019, 12:18 PM
hi rzikaou,

I'm sorry to hear that it is not working for you.

Can you post the exact sequence of commands you are using?

It would also be good if you can pick some public domain book (preferably short with few or no images) in KF8 format (azw3) and make some highlights in it and post the output of the azw3r program here along with the book.azw3r file used to extract the highlights. Then I can try to reproduce your problem.
Reply 

#24  rzikaou 08-18-2019, 02:39 PM
The document I'm using is an html document that I've converted to `epub`, then to `mobi` (using `kindlegen`) and then I e-mailed it to my Kindle with the special `@kindle.com` email which resulted in the `azw3` document that is on my kindle.

I've made the following test highlights on the first page of the article (without any notes):

"On a bright Monday in January"

"a thousand"

"They packed themselves into a cheerful courtyard outside"

But this doesn't seem to be what the tool is returning.

I'm trying to remember how I generated the rawml because now running `kindleunpack` doesn't give me any `.rawml` files.
[zip] test-article.zip (50.8 KB, 15 views)
Reply 

#25  j.p.s 08-18-2019, 04:10 PM
Thanks rzikaou for uploading your example. It turns out that was important rather than my suggestion to use a book. I've never understood why there is a 14 byte difference between Amazon's offset into the rawml and where the text actually is. It turns out that for your azw3 the offset is 166 bytes. I have added a -o option to the azw3r.c and azw3r.pl attached to the first post and made a new github release.

I don't know whether it is possible to tell in advance what the offset is. I had to get yours experimentally. I moved your azw3r and rawml files into the same directory so that the command would merely be way too long instead of impossibly long.
Code
azw3r -h -o 166 -i "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWNa1bd4a78ed253ba5271d0cb7df407fda.azw3r" -r "Test article_CTD7HH6AE5BVXFGTFOTOV54NOREZMUWN.rawml"
1259 1269 Highlight: 'a thousand '
1184 1220 Highlight: 'On a bright</span> Monday in January '
1462 1518 Highlight: 'They packed themselves into a cheerful courtyard outside '
I have shown the "-o 166" at the beginning of the command for clarity. During experimentation it would be best at the end.
Reply 

#26  jhowell 08-18-2019, 06:03 PM
Quote j.p.s
I don't know whether it is possible to tell in advance what the offset is. I had to get yours experimentally.
"kindleunpack -d" will produce a file named "assembled_text.dat" in the mobi8 folder containing a subset of the rawml corresponding to the actual book content (flow 0). I think you will find that the position number offsets are indexed into this data without any correction needed.
Reply 

#27  rzikaou 08-18-2019, 07:46 PM
Thanks for debugging this j.p.s.

jhowell as far as I can tell, you are correct. Using the offsets against the "assembled_text.dat" gives the expected result!
Reply 

#28  j.p.s 08-18-2019, 07:49 PM
Quote jhowell
"kindleunpack -d" will produce a file named "assembled_text.dat" in the mobi8 folder containing a subset of the rawml corresponding to the actual book content (flow 0). I think you will find that the position number offsets are indexed into this data without any correction needed.
And so it does. But, it seems a bit magic. Somewhat early on, the rawml has the 14 byte string "</body></html>" not in assembled_text.dat, then somehow the two files have unaligned sets of opening and closing html and body tags which somehow do not affect the byte offsets of book text.

rzikaou's rawml file has extra header tags not in the assembled_text.dat file.

So the C azw3r and the perl azw3r.pl can be used as is with
-r assembled_text.dat -o 0

Quote odamizu
Thank you jhowell! As always, you are a wonderful source of enlightenment
Ditto.
Reply 

#29  j.p.s 09-07-2019, 06:18 PM
There is a new release at github that makes the default rawml offset 0 in the C and perl utilites, so kindleunpack -d should be used to make assembled_text.dat instead of kindleupack -r to make <book>.rawml

Also, there is a new utility named krdsJSON2notes.pl that processes the <book>.json file produced by jhowell's KRDS parser https://www.mobileread.com/forums/sh...d.php?t=322172 into the same format used by notes_insert.pl to highlight and insert notes into a rawml file (assembled_text.dat) suitable for converting to PDF.

So now human readable personal notes can be extracted from all Kindle books and personal highlights can be extracted from KF8 (azw3) and probably mobi books.

The current latest release is attached as azw3r-0.1.7.zip to post #1 in this thread.
Reply 

#30  Luca2903 10-09-2019, 05:17 PM
Quote j.p.s
Luca2903,

you have not said what kindle format your books are in. If they are KF8 (azw3), then I think my scripts are in good enough shape for just about anyone to extract both notes and highlights as text as separate files and/or insert them into the text of the book for context.
Hello man, the format is .Kfx.

Thanks.
Reply 

 « First  « Prev Next »  Last »  (3/4)
Today's Posts | Search this Thread | Login | Register