Mobileread
Export list of words in spellcheck
#21  KevinH 07-10-2019, 12:54 PM
Okay, the version shipped inside Sigil on Windows and Mac of the spanish dictionary is a straight MySpell level dictionary and as such the munch and unmunch tools will work.

I found an old copy of MySpell-3 stored on a google code archive and was able to easily build and run it on my Mac. This included munch and unmunch tools as well.

So with unmunch, I can take the es.aff (which describes prefixes and suffixes commonly used in Spanish along with the rules when they apply) and the es.dic files and create one long universal list of words recognized in all of its forms.

You can then add lots of new words. Or even create a new Prefixes or Suffixes flag if you know which ones might be missing and the rules for applying them.

Once we have that we can run munch to create the new .dic file. We can also add charmaps and replacement tables along with phonetic sound alike rules to help improve the suggestions generated.

So if this is something you would like to do, I would be happy to help. Once you get into Hunspell only features, then munch and unmunch will no longer work and you are on your own so to speak.
Reply 

#22  KevinH 07-10-2019, 01:31 PM
Just for laughs, I ran unmunch on the en_US.dic and en_US.aff file and the 62,074 base words with affix flags expanded to a word list of 152,469 unique words.

I tried the same thing for es.dic and es.aff and the 58,154 base words with affix flags expanded to a word list of 689,751 unique words.

So Spanish must make use of prefixes and suffixes much more than English!

Also, if you lookat the working set vocabulary used by Shakespeare for example, it was something like 35,000 words. Most average people have working sets of 10,000 to 20,000 words.

Any way you look at it having 689751 unique words seems to be huge coverage.

Has anyone validated the universe of words the Spanish dictionary already covers?
Reply 

#23  KevinH 07-11-2019, 12:34 PM
@elchamaco
If I were to zip up the unmunched spanish wordlist and post it here would you be willing to download the wordlist and look at it to see if it at all makes sense. Having over 600,000
unique letter combinations that a spellcheck dictionary would deem correct for a wordlist just seems too big to be true without compound words.

Thanks,

KevinH
Reply 

#24  elchamaco 07-18-2019, 04:01 AM
The one i created was near 1 million words the base (980-990), 234k the muched list. I used the aff from libreoffice spanish if i remember well.
Reply 

#25  KevinH 07-18-2019, 10:22 AM
The problem is more words do not make a spellcheck dictionary necessarily better (unlike an online dictionary).

As I tried to explain earlier, a spellcheck dictionary is meant to cover the "working set" of a language. It is not meant to be exhaustive such as an online or paper copy dictionary would attempt to be.

The reason is that many times common mistakes and typos turn out to be actual but very infrequently used "words" and not what the author intended. It also results in words being suggested for replacement that the author would never use. Both lower the effectiveness of the spellchecker.

The idea is that more rarely used or more esoteric words can and should be looked up in online dictionaries.

One of the nice features of spellcheck dictionaries is that authors can add their own list of more unique words that they actually use to augment the "working set" making the spellcheck function fine tuned that that particular person and their writing.

That was and continues to be the concept behind the design of spell check dictionaries.

Hope something here helps.
Reply 

 « First  « Prev   (3/3)
Today's Posts | Search this Thread | Login | Register