conversion pyglossary pdf
#1  pzack 09-06-2022, 05:11 PM
Good afternoon,

I need help converting a pdf stand alone dictionary that I use on my e-reader to a stardict dictionary for use under koreader.

I have tried converting a full text file of this pdf (I did not create it) but pyglossary is giving me a boat load of no tab errors and the stardict files that it creates from this txt file are empty. I also have an xml file but I cannot get pyglossary to convert it to stardict even though pyglossary is supposed to support .xml

Can anyone suggest ways to convert this pdf to a stardict dictionary? I know of no program that would convert pdf to stardict in pyglossary. Perhaps, there is another conversion tool-I am fishing for a way to do this.
The dictionary would be much more useful to me under stardict.


#2  Doitsu 09-07-2022, 06:16 PM
Quote pzack
I need help converting a pdf stand alone dictionary that I use on my e-reader to a stardict dictionary for use under koreader.
You can't directly convert a PDF file to another dictionary file format, because the converter wouldn't be able to reliably identify headwords and definitions.

You might want to ask about converting the xml version in the Index of Custom Dictionaries for Kobo eReader thread.

MR member Markismus might be able to help you, because he often converts non-standard dictionary files to Pocketbook dictionaries.

#3  Markismus 09-08-2022, 02:24 PM
XML is a language for data storage, not a dictionary format. So you can't expect pyglossary to support any XML whatsoever. However, you can put the XML-file online and post a link to it. Maybe you're lucky.

The Pdf-(2-epub-)2-html-2-stardict tool isn't there, yet. Probably never. The problem is that the nice styling of a PDF puts a lot of extra code in there, that has to be differentiated from the words&definitions. Optically easy, but not code-wise. You could try to get ABBYY Finereader to recognize it and specify the output format as a spreadsheet or CSV-format. However, even ABBYY's output will still have a lot of noise, that you'll have to deal with.

What is the name of the dictionary? Maybe it's already present in a nicer format than PDF.

#4  pzack 09-09-2022, 01:46 PM
Good morning M. Markismus,

Thank you for taking the time to respond to my query. I mentioned .xml because it is a supported format in pyglossary(according to github)for conversion, however, the xml file that I have is not converting. I am not sure if this file actually contains the whole dictonary anyway.

The only thing that I can think of is convert the full text file that I have but it is not tab delimited. When I look at this file in notepad I see that the headword is not separated out-it is the leading word-but it is part of the definition which is a paragraph.

Pyglossary asks for a tab delimited file citing no-tab errors as it was converting ; it produced the three stardict files from my text file but they were empty. I did not create the text and xml files.

If there is a way to do a mass conversion of the text file, that is, get the leading head word separated out, and I think that this is what is meant by tab-delimiting a file-then pyglossary may correctly convert the text file. It is almost there but needs the head word separated from the definition. However, I admit that I don't fully understand the structure of a tab delimited file.

I have seen something about dumping the text into excel or another spreadsheet to build a tab-delimited file but,unfortunately, I have zero experience and knowledge of spreadsheets.

The dictionary has over 100,000 words and I certainly cannot do it manually.

And then there is the file converter "penelope" but I don't know if there is any help in that direction.


#5  Markismus 09-09-2022, 02:04 PM
@pzack Why don't you post a link to the non-tab-delimited file?

If what you're saying turns out correct, than all you would need is to prefix each line with a repetition of the 1st word and a delimiter.
Sed could do that on Linux, any pattern-substitution in Perl, Python, Awk or Lua could do that.

You could even do it in Excel. First column your line, second column the LEFT-function, third column a concatenation of both column-values with a delimiter in between.

#6  pzack 09-09-2022, 04:58 PM
Good afternoon M. Markismus,

Thank you for your quick response.

As I indicated, I don't know how to work with excel and spreadsheets. However, you have suggested some other possibilities of tab-delimiting the text file.

May I impose upon you to give me an example of how I may do this with the apps that you listed. If you would choose one that may be the simplist to work with. Please understand that I am not a programmer and I am shakey with working with scripts. But I can work in linux terminal. Your example could be short and sweet.

I figured that there may be a way to do this and I did see a script for converting this file to tab-delimited but I can't find it; it was a short script for use in linux.

Please let me see what you come up with before I try a new thread on a tab delimited conversion.

I think, thanks to you, that we may be headed in the right direction. And here's hoping that once converted-if it can be done-that pyglossary will cooperate and give me a stardict dictionary!


#7  Markismus 09-09-2022, 05:00 PM
I already wrote it out with the Excel example. What prevents you from posting a link to the text file? If it's small, you could even zip it and upload it here.

#8  pzack 09-09-2022, 05:16 PM
M. Markismus,

I want to add to my just-sent reply to you that, though I don't understand fully the structure of tab-delimited text files, I assume that pyglossary needs the head word as a hook on which to hang the definition.

My sense is that the tab delimiting isolates or sets apart the headword so that pyglossary sees it as the headword and can build its index or pointers to the headword.

This is how I understand it but this is purely conjecture on my part. If I am correct, then I need an app,maybe among the apps that you have provided for me, to isolate or tab? the headword which is the first word of each of the paragraphs that include the headword and definition. There are spaces between each paragraph of text. There are no illustrations in the text file.

I would need the syntax to instruct the app to tab-delimit the first word which is the head word.
Maybe this helps to clarify things.


#9  Doitsu 09-09-2022, 08:00 PM
Quote pzack
I would need the syntax to instruct the app to tab-delimit the first word which is the head word.
Maybe this helps to clarify things.
You also might want to look into using StarDict Editor, which can you use to compile and decompile StarDict dictionaries.
It also supports compiling and decompiling Babylon BGL dictionaries.

The Babylon glossary source file syntax, which supports inflections, is very simple:

#bookname=Spanish-English Dictionary
<p>single line definition of 'libro' (may contain html 3.2 tags, e.g <br>)</p>
<p>single line definition of 'rana' (may contain html 3.2 tags, e.g <br>)</p>

#10  pzack 09-09-2022, 08:42 PM
Good evening, Doitsu

Thank you for responding. My dictionary is not in a bgl format thus, I don't think that the stardict editor is useful here. Actually, I tried this editor and like pyglossary, it threw up countless no-tab errors in the full text file and gave me empty stardict files.

Thank you the excel example but I don't understand excel.

In looking again at the text file it is like this:

headword space [prononciation of headword]space definition. In other words, the bracketed prononciation-in the international alphabet-is what separates what follows from the next headword and bracket. So that, what follows the bracketed prononciation of the headword will pertain to the headword until the next bracketed prononciation with the headword just before it. Now, where would one set the tab that would separate headword and bracket from the next headword and bracket?

Again, I am trying to understand the workings of a tab delimited file and what pyglossary is looking for.


  Next »  Last »  (1/15)
Today's Posts | Search this Thread | Login | Register