Pocketbook dictionary format revisted
#1  Markismus 11-30-2019, 06:47 AM
I am maintaining the script PocketBookDic to convert dictionaries to xdxf-, Pocketbook dic- and Stardict ifo-format. It's at the github repository PocketBookDic.

Conversion to dic-format needs a windows program converter.exe and language configuration files. They can be found at the github repository LanguageFilesPocketbookConverter.

Nowadays, I am looking into more heuristic approaches to convert free format dictionaries, such as mobi-files and Kindle dictionaries (azw-, azw3-files). Typically it has an intermediate html-stage, which has to be interpreted and converted to the central xdxf-format, before it can be converted to other formats such as Stardict optimized for Koreader, Pocketbook dic-format or mdict-format.*

Over 20 dictionaries in both xdxf- (human readable), Stardict- and Pocketbook's dic- (binary) format can be found here. (The xdxf-files can be converted with convert.exe to the dic-files. If you want to tweak your dictionary, this is the place to do it.)

16th November 2021: Getkey just did some testing and the scripting for pocketbook format is updated to handle unicode characters better. All pocketbook dictionaries are recreated.

For those that can't be charmed by the tinkerings needed for conversion, post a request and link to your dictionary files and I'll try and convert them.

11th November 2022:
image »
Due to an excessive amount of traffic (more than 50GB this month) pCloud is restricting access. As it is a moving total, access should be restored within a few days. (The last 2 days in the graph show restricted access. Apparently, nobody is going to pay for pCloud. )

November 13th, 2022:
pCloud access is indeed restored. .
*Only implemented to check whether there would be a speed improvement over Stardict on Onyx Boox systems. It didn't improve.
Screenshot from 2022-11-11 12-35-29.png 

#2  Markismus 12-01-2019, 05:05 PM
I spend yesterday trying to guess to restrictions of the pocketbooks dictionary converter.exe* to get the whole of the Oxford Dictionary 2nd Edition into dic-format. Oxford dictionary has entries up to 115k characters, so it not odd converter.exe crashes, just irritating. Duden (de-de) en Oxford Learners Dictionary 8th Ed. (en-en) work with a little tweaking of the xdxf-files.**

Wish I had a clue of that format so I could skip the program converter.exe: The Perl script already runs up to 250 lines!
Does anyone have or know a link to the source code of converter.exe? Does anyone know the format of pocketbook's dic-format, so I can generate it straight from xdxf- or cvs-format?

The restrictions known of converter.exe are
  1. A line should not be >4096 bytes. It cuts the line after this length and messages that the XML is missing closing tags.
  2. If '&' or '>' are found in the XML content outside of tags, etc., it quits and messages about malformed XML.
  3. If an dictionary entry definition, a block enclosed by <def> and </def> tags exceeds 100kB it crashes without messaging. (103916 bytes works, but 104992 bytes already crashes. )***

Possible resolutions are:
  1. Split the dictionary entry at the tags or use something like prettify, auto-ident.
  2. '&' and '<' should be replaced with '&amp' and '&lt'.
  3. I can resolve this by splitting an entry in multiple entries with identical lemma's.

If someone has tinkered with this before and has pointers for me, I would be much obliged.

* I used DictionaryConverter-neu 171109. Search this forum or look here for more info.
** For the conversion of dictionaries to xdxf-format I used linguae. Search this forum or look here for more info.
*** This is different from @Rkomar's post that states that he converted a dictionary with 33283 lines. It seems to be the limit on one dictionary entry.

I just removed all the lines>4096 bytes. The result was:
Loading collates...
Loading morphems...
Loading keyboard...
Loading dictionary file...
140407 words loaded
Sorting dictionary...
Searching for equal words...
Packing dictionary...

maximum block count reached

So it doesn't crash anymore, however, it still can't pack it.
It is slightly larger than Rkomar's claim of 33283 lines: 1,185,340 lines. That's why I wanted it! Maybe if I make the dictionary instead of in the 2 parts that it is now for Stardict in 6 parts for Pocketbook.....crappy

#3  Markismus 12-03-2019, 03:26 PM
I have a working Perl script and it's on github. It converts mobi- (KindleUnpacked html), cvs-, Stardict- and dxdf-format to Pocketbook dic-format and Stardict formats.

I've succesfully converted Liddell-Scott-Jones, Oxford's Learners dictionary, Duden (de-de), an latin-english dictionary, Nouveau Littre 2011, the Oxford English Dictionary 2nd Ed.and Wordnet.

The results in both xdxf- (human readable) and dic- (binary) format are here. (The xdxf-files can be converted with convert.exe to the dic-files. If you want to tweak your dictionary, this is the place to do it.)

You will also need
  1. pocketbook converter binary and its language configuration files. I've zipped them in the uploaded
  2. Install Perl
  3. Instal Stardict-tools (If you want to convert from Stardict ifo-,dict- and idx-files.)
See github for further info.

The zip-file attached contains the newest converter.exe patched by ezdiy from post #6.
[zip] (73.8 KB, 1432 views)

#4  ezdiy 12-03-2019, 04:46 PM
I've patched the binary to remove block count limit (I'm using it for small 200k word dict though, not sure if it really works with larger dicts) and seems to work for me (TM). I've also tried to remove the 4kbyte entry limit, though not sure if successfully (I don't have dicts with defs this long to test).

#5  Markismus 12-03-2019, 05:16 PM
How did you patch that? Do you have the source code?

No luck. Still crashes on the Oxford dictionary part 1.

#6  ezdiy 12-03-2019, 06:37 PM
Quote Markismus
No luck. Still crashes on the Oxford dictionary part 1.
This one works:

Turns out the "100kb limit" is actually 64k (after removal of tags). This is a hard limit of DIC format. I've patched the binary to not crash, truncate and report the offending line over limit. But there's not much more that can be done - you'll have to abbreviate the entry or split it via perl. Out of the whole dict there's only one such entry though. Further, the chunks between each < are still limited to 4k i think, though that can be easily fixed with some re-formatting from perl with no information loss.

How did you patch that? Do you have the source code?

image »

#7  Markismus 12-04-2019, 01:53 AM
@ezdiy Great! Thank you!
Turns out the "100kb limit" is actually 64k (after removal of tags).
What tags are retained in the conversion? Are color-tags removed? Blockquote, ex, abr?
you'll have to abbreviate the entry or split it via perl. Out of the whole dict there's only one such entry though.
The maximum article- and line-lengths are already implemented, so I’ll tune them in the script. That's why the reconstructed xdxf-file still only had one left, that was too long. The original is teeming with them.

What is the limiting entity, precisely? I saw with Greek letters, that it isn't bytes: Some accepted entries stayed below 3500 chars, while being 7500 Bytes. But the chars are not exactly 4k either, somewhat less.

Is there a way to encode for resources? Audio tags for pronunciation? I know Stardict-tools can convert Lingvo audio resources to Stardict format, however, I have no idea how to implement them in xdxf-format, yet. Would be great to use the audio feature of the pocketbook!

Image resources would be nice, too. Maybe with bbencode? I encoded fonts that way into xml when further processing needed it.

That looks a bit like the de-assembler I used as a kid. (I had to hack CGA games to work on my dad's monochrome Hercules graphics card.) What could I look into for that, nowadays?

#8  nhedgehog 12-04-2019, 04:37 AM
Nice, someone is working on the pocketbook dictionary format.
Do you guys know this program?

#9  Markismus 12-04-2019, 04:41 AM
@nhedgehog Yes, I used it to get the first xdxf-formatted files. It crashes rather neatly and was not unproblematic to install. You wouldn't have to use it anymore with the script. (See the second footnote of the first post in this thread.)

#10  nhedgehog 12-04-2019, 04:43 AM
This may be interesting too (from a Russian Forum)
The name of any * .dic dictionary that displays a Pocketbook can be corrected in the following way:
1. Create a text file, in it we write the desired name of the dictionary.
2. Using the wu8.exe program from Alex_None, we convert this file to UTF-8.
3. Open the converted file with the necessary name for viewing as Hex.
4. Open the dictionary * .dic hex editor.
5. Starting at offset 0x40, we replace the unreadable name by the required one byte.
There is a limit on the length of the name - a maximum of 31 characters (already other data come from the offset 0x80). The name must be terminated with two zero bytes (maximum at offset 0x7e and 0x7f).
Point 2 can be made with a usual notepad, in this case, when viewing it in Hex mode, ignore the first 3 bytes of the EF BB BF.

There is an app (dicrename.exe) in one of the converter folders.

  Next »  Last »  (1/24)
Today's Posts | Search this Thread | Login | Register