Mobileread
Export list of words in spellcheck
#11  Tex2002ans 07-09-2019, 04:42 AM
Quote KevinH
BTW, Calc like excel will parse most text files if delimited in some way (need not be commas and quotes) or if field aligned.
Yep, tab-delimited is usually my favorite. Commas are just too common, and make manually reading the file in a text editor a chore.

Whenever exporting CSVs into LibreOffice Calc, a nice window pops up giving you lots of import options.

Quote elchamaco
That csv export from reports is something like i was searching, but for the missing words. But playing with the editor of calibre i see it has an option to copy to clipboard all the words from spellcheck and paste them in an excel. This also works for me.
You can also use the Spellcheck Lists in non-standard ways. Like in this thread, I explained how to use it to find a list of "foreign-language" words:

https://www.mobileread.com/forums/sh...59#post3812859

and go marking them up with xml:lang.

I've also done something similar when trying to normalize a collection of various articles between American/British spellings. You could:
  1. Mark ebook as English (US).
  2. Export CSV of "misspelled words".
  3. Mark ebook as English (UK).
  4. Export CSV of "misspelled words".

Compare both CSVs together, look at differences, and you can see:
Reply 

#12  DiapDealer 07-09-2019, 06:24 AM
Quote BetterRed
If Kovid had an Open with in the calibre editor could the PageEdit gadget be used from within it?
I suppose so, but that's a big IF.

And I retract my comment about your previous idea being a "bad, bad" one. Your logic for wanting it was sound. It's just not feasible/practical is all.
Reply 

#13  elchamaco 07-09-2019, 11:23 AM
Quote Tex2002ans
Yep, tab-delimited is usually my favorite. Commas are just too common, and make manually reading the file in a text editor a chore.

Whenever exporting CSVs into LibreOffice Calc, a nice window pops up giving you lots of import options.



You can also use the Spellcheck Lists in non-standard ways. Like in this thread, I explained how to use it to find a list of "foreign-language" words:

https://www.mobileread.com/forums/sh...59#post3812859

and go marking them up with xml:lang.

I've also done something similar when trying to normalize a collection of various articles between American/British spellings. You could:
  1. Mark ebook as English (US).
  2. Export CSV of "misspelled words".
  3. Mark ebook as English (UK).
  4. Export CSV of "misspelled words".

Compare both CSVs together, look at differences, and you can see:
Yes you can do a lot of stuff, i want to use it to upgrade dictionaries with misssing words. But not only hunspell... stardict/mobi dictionaries. I'll create a hunspell dictionary from stardict, and find missing words in different books to improve the main dictionary, main definitions and inflected forms.

Probably the best choice will be to create a script that checks all the words from a epub book against a hunspell dictionary and export the missing words, but a to begin the manual method can work.
Reply 

#14  KevinH 07-09-2019, 11:47 AM
Please note for Hunspell dictionaries that properly use affix detection and compression, you should not add unflagged words to the dictionary. The proper way to handle that for en is to expand the dictionary (by reversing affix flag usage) to recreate a plain word list, add you new words and be sure to add all versions of the word with prefixes and suffixes, and then re-crunch the wordlist.

This process seems to have been lost over the years as people do not understand the affix rules and affix compression.

For example the en US dict that Sigil used to use had no affix compression used at all. Being the original author of MySpell (predecessor of hunspell) and one-time head of OpenOffice's lingucomponent project, it is sad to see information on how to properly create dictionaries that are not giant wordlists has been lost.

In addition, the role of a spellcheck dictionary is not the same as an online dictionary or real dictionary. Spellcheck dictionaries should be designed to focus on the "working set" of a language and NOT try to be all encompassing as this actually leads to fewer incorrect words being detected as common mistakes turn out to be real but not typically used words, or slang, or abbreviations, or whatnot.

You are better off creating additional user dictionaries that catch common words you use that are not covered by the spellcheck dictionaries, to expand your personal "working set" of the language.
Reply 

#15  elibrarian 07-09-2019, 03:01 PM
Quote elchamaco
Probably the best choice will be to create a script that checks all the words from a epub book against a hunspell dictionary and export the missing words, but a to begin the manual method can work.
You might find the "linguist"-exrension for Libreoffice Writer useful. One of the things it does is making a list of not-recognized words in the active document. It's rather old, but since it's python and not LibreOfficeBasic, it still works, and it's quite fast too:

https://extensions.libreoffice.org/extensions/linguist/1.5.1

Regards,

Kim
Reply 

#16  Doitsu 07-09-2019, 03:47 PM
Quote KevinH
This process seems to have been lost over the years as people do not understand the affix rules and affix compression.
IMHO, the main problem is that there aren't any user-friendy tools for editing/generating dictionary and suffix files.
Reply 

#17  BetterRed 07-09-2019, 05:48 PM
↑ ↑ ↑ ✔️

Several years ago I tried, and failed, to edit the Kracked Press en GB hunspell dictionary. I also tried and failed to create a domain specific dictionary. I was surprised there were no tools specific to the task - no demand I guess.

Today, I could possibly create an epub from scratch with notepad and pkzip - but only because of what I've learnt from using Sigil On reflection that's a big 'possibly', if they were all I had.

BR
Reply 

#18  KevinH 07-09-2019, 10:20 PM
Unfortunately MySpell 2 or 3 had both munch and unmunch tools that worked for the dictionaries used at that time (including en, german, french, spanish, etc) but Hunspell needed compound prefixes, compound suffixes, and compound words to handle Hungarian and other languages. The standard munch and unmunch tools were never really modified for those changes and nothing was ever documented.

MySpell dictionaries still work in Hunspell and work for most western languages. I can probably dig up a copy of MySpell-3 source someplace and walk anyone through it.
Reply 

#19  elchamaco 07-10-2019, 11:41 AM
Quote KevinH
Please note for Hunspell dictionaries that properly use affix detection and compression, you should not add unflagged words to the dictionary. The proper way to handle that for en is to expand the dictionary (by reversing affix flag usage) to recreate a plain word list, add you new words and be sure to add all versions of the word with prefixes and suffixes, and then re-crunch the wordlist.

This process seems to have been lost over the years as people do not understand the affix rules and affix compression.

For example the en US dict that Sigil used to use had no affix compression used at all. Being the original author of MySpell (predecessor of hunspell) and one-time head of OpenOffice's lingucomponent project, it is sad to see information on how to properly create dictionaries that are not giant wordlists has been lost.

In addition, the role of a spellcheck dictionary is not the same as an online dictionary or real dictionary. Spellcheck dictionaries should be designed to focus on the "working set" of a language and NOT try to be all encompassing as this actually leads to fewer incorrect words being detected as common mistakes turn out to be real but not typically used words, or slang, or abbreviations, or whatnot.

You are better off creating additional user dictionaries that catch common words you use that are not covered by the spellcheck dictionaries, to expand your personal "working set" of the language.

Some time ago i created a spanish hunspell spanih dict, i needed to dig to create a good one, now it's used with sigil by a lot of people. Now the idea is to improve it.

Also I want improve a real dict with definitions.

It's hard to find documentation about dictionaries, or a good program to edit them and export to differente formats.
Reply 

#20  KevinH 07-10-2019, 12:00 PM
I will grab a copy of the spanish hunspell dictionary and take a look to see what features are being used. If they stick to things that MySpell groks, we can use the MySpell tools to expand the spanish dictionary and then remunch it for use in hunspell. If it uses any of the newer Hunspell features, the older munch and unmunch tools will not be of any help.

KevinH




Quote elchamaco
Some time ago i created a spanish hunspell spanih dict, i needed to dig to create a good one, now it's used with sigil by a lot of people. Now the idea is to improve it.

Also I want improve a real dict with definitions.

It's hard to find documentation about dictionaries, or a good program to edit them and export to differente formats.
Reply 

 « First  « Prev Next »  Last »  (2/3)
Today's Posts | Search this Thread | Login | Register