Mobileread
Editor plugin : problem with regex and special characters
#1  EbookMakers 11-14-2019, 04:18 AM
Inside an editor plugin I'm running regex out of a Json file, like saved searches.
All works fine, except for high rank Unicode characters, for example I have :

Code
{ "case_sensitive": false, "dot_all": false, "find": "(‘)", "mode": "regex", "name": "LEFT SINGLE QUOTATION MARK REPLACE", "replace": "'" },
Problem : this character is never found, even if I replace it with \u2018.
My Json file is Utf-8 encoded. I extract the pattern with :
Code
pattern=unicode(searches["find"])
Even tried ur'pattern', nothing works.
I'm using the regex module and my compilation flags are : regex.VERSION1 | regex.WORD | regex.FULLCASE | regex.MULTILINE | regex.UNICODE

Same problem with all Unicode characters above \u2000.

Any idea to get it working ?
Thanks
Reply 

#2  kovidgoyal 11-14-2019, 05:10 AM
hard to say without looking at your code.
Reply 

#3  EbookMakers 11-14-2019, 07:12 AM
The code is rather long, but I can give some crucial points :
I extract the editor text with
Code
data=current_container.raw_data(file, decode=True, normalize_to_nfc=True)
Then apply the search on it :
Code
pattern = regex.compile(unicode(search['find']), flags)
match = pattern.search(data)
I have no error, except if I replace ‘ or \u2018 with \xE2\x80\x98
Tell me if you want more.
Reply 

#4  kovidgoyal 11-14-2019, 07:54 AM
Looks fine to me. Check if data actually contains the character you are looking for using the in operator. And check what is in search['find']
Reply 

#5  EbookMakers 11-14-2019, 08:52 AM
Damned ! All is fine and works.
The only problem was : in my real code I have replace: "\\1" and was only detecting matches if match != replace.

Obviously it could'nt be the case.

Thank you Kovid for your tips and driving me to the good way.
Sorry for the inconvenience.
Reply 

Today's Posts | Search this Thread | Login | Register