Mobileread
Regex examples
#11  DiapDealer 02-20-2012, 11:14 PM
Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…).

Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word.

Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway.
Reply 

#12  theducks 02-21-2012, 12:17 AM
Quote DiapDealer
Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…).

Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word.

Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway. ;)
The trick I found:
there are 1 to n cases of a word followed by a space AND then a single word with No space. I don't know if [:punct:] will find mdash and ellipse
Reply 

#13  davidfor 02-21-2012, 12:35 AM
How about:

Code
<span class="italics">\w+\s+.*</span>
That seems to work in my tests. There is an issue with greediness as I happened to have a paragraph with two multiword italic sections in my test book. The search worked but it selected the two italic sections and everything between them. But it didn't find any of the single word italics.
Reply 

#14  Timur 02-21-2012, 01:45 AM
@davidfor: Add (?U) in front of your regexp for lazy matching.
Reply 

#15  DiapDealer 02-21-2012, 08:13 AM
Correct me if wrong davidfor, but won't that still exclude instances where the very first word contains an apostrophe, or any other non-word character? It gets me very close, certainly, but I just don't think what I'm looking for is going to be based on "\w". The potential for too many non-word characters being present in the words (including the first one) is just too great.

I'm not so much concerned with the greediness of the expression (as I'm not blindly replacing anything with it) as I am with seeing every single non-one-word occurrence... regardless if that occurrence contains non-word characters or not.
Reply 

#16  Jellby 02-21-2012, 08:26 AM
So you want something like this?

Code
<span class="italics">[^<]*\s.*</span>
It might not be the right regex dialect, but [^<] is intended to mean "any character not <". That won't match instances where there is something nested in the <span> before the first space, but those should be rare, and can be looked for afterwards.
Reply 

#17  DiapDealer 02-21-2012, 09:07 AM
Quote Jellby
So you want something like this?

Code
<span class="italics">[^<]*\s.*</span>
It might not be the right regex dialect, but [^<] is intended to mean "any character not <". That won't match instances where there is something nested in the <span> before the first space, but those should be rare, and can be looked for afterwards.
That seems to be the ticket. Thanks!

I had a mishap where an ill-thought-out global replace (because of nested spans and greedy expressions) left me with a boatload of long, incorrectly italicized passages. And it got saved before I caught it. I could've backed up a few revisions and started over, but I didn't want to ( maybe not always straight, but ever forward ).

Anyway, since I know the one word occurrences aren't mistakes, I can safely skip those. And in this particular document... that little regex expression knocks the number of occurrences I have to manually proof against the original text from 700+ down to around 150.

Thanks again!
Reply 

#18  Timur 02-21-2012, 09:08 AM
@DiapDealer: Does this narrow down your set enough? This one should match anything with at least one non-word(unicode) character in italics, including contractions but excluding empty spans(which should be easy enough to remove before- or afterwards.)

Code
(*UCP)(?U)<span class="italic">[^<]*\W[^<]*</span>
If you do not want to miss absolutely anything(like nested spans) use .* instead of [^<]*. But you will probably match some unwanted multi-span matches.
Reply 

#19  DiapDealer 02-21-2012, 09:25 AM
@Timur: a non-word character in the target isn't required, but any multi-word instances that happen to contain non-word characters needs to be included by the search, too. (also... that expression crashes my Sigil 5.1 )

I've yet to find an instance where Jellby's expression skips something that I wanted included. And I've given it a hell of a workout so far.

This is the non-beta version of what is now working outstandingly well for me.

Code
(?U)<span class="italics">[^<]*\s.*</span>
Reply 

#20  Timur 02-21-2012, 09:53 AM
Strange that my pattern causes a crash, I use 0.5.1 here too and it works. Anyway, I am glad that you have found a regexp working for you.
Reply 

 « First  « Prev Next »  Last »  (2/63)
Today's Posts | Search this Thread | Login | Register