Mobileread
Regex examples
#1  meme 02-05-2012, 01:56 PM
I'd like to see if I can collect Regular Expressions (PCRE format as introduced in Sigil 0.5.0) used for common or difficult issues, and maybe add them to the FAQ, etc. Partly so I can have a list to refer to when needed, but also to collect a lot of what's probably already been mentioned in this forum. And maybe to find out if there isn't a way to do a replacement that's needed.

For instance, is there a regex to do other types of replacement but only inside body tags?

Is there one only for the actual text - words not part of a tag name or attribute? Words that are only aprt of a tag name or attribute?

If you have any suggestions for the above cases, or any other useful Regex expressions please post them.
Reply 

#2  Timur 02-05-2012, 08:15 PM
Matches regex inside body element and inside character data only.
(First negative look-ahead(character data req.) works, although specification does not require the greater-than sign in character data to be escaped. But you have to save the epub at least once in Sigil and then reload it to escape all greater-than signs to > , or else you might miss some matches.)
(Second negative lookahead will not work if your document has more than one body element. Sigil allows this, but W3C validator gives error to such documents. I do not know the strict specifications for multiple body elements.)
Code
(?s)regex(?![^<>]*>)(?!.*<body[^>]*>)
Matches regex only inside attribute values.
(If your document has single quotes(apostrophe) somewhere as attribute value delimiter instead of doubles, again, save and reload to change them all to double quotes, so that this regex works reliably. Saving and reloading also escapes all quotes inside attribute values to &quot; , so that your elements stay well-formed. Reloading also escapes all greater-than signs, otherwise you might have the risk of matching something inside character data.)
Code
regex(?=[^<]*>)(?!(?:[^<"]*"[^<"]*")+\s*/?>)
Edit: Typo.
Edit 2: Added clarification in bold.
Edit 3: Slight simplification in the second code.
Reply 

#3  JeremyR 02-05-2012, 08:32 PM
This one uses the old format of Sigil Regex, but I find it very useful. Basically I use it to take a document that was text and thus not broken up into chapters but with them labeled, and to find them, highlight them and add the break marks

For books with Chapter I, Chapter II, and so on (Roman Numerals) or with Chapter 1, Chapter 2. (And of course, use it in code view)

Search for

CHAPTER [0-9XVI]+

And replace with

<hr class="sigilChapterBreak" /><h3>\0</h3>

On occasion it will find a phrase like "chapter in" in the text, but that's pretty rare (and just check the TOC before having it split)
Reply 

#4  WS64 02-06-2012, 07:26 AM
Quote meme
Is there one only for the actual text - words not part of a tag name or attribute?
F: (>[^<]*)old
R: \1new
(If the text contains > or < this will go wrong, but Sigil cleans the code up so it should work.)
Reply 

#5  WS64 02-06-2012, 07:29 AM
I'm collecting the regex expressions I often use here, I guess at least some of them are interesting for others too:
http://ws64.com/regex/ (this page lives, so there might be changes anytime, and of course, use at your own risk!)
Reply 

#6  DiapDealer 02-20-2012, 06:10 PM
I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help...

I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics".

So I'm looking for an expression that will find:
Code
<span class="italics">This is three words</span>
or:
Code
<span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span>
but not:
Code
<span class="italics">one</span>
I know I'm probably missing something obvious and am going to feel like a complete idiot when I see the answer. I don't even really care what, exactly, gets gets highlighted, as I won't be using it to replace anything... only to manually eyeball the occurrence.
Reply 

#7  theducks 02-20-2012, 06:23 PM
Quote DiapDealer
I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help...

I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics".

So I'm looking for an expression that will find:
Code
<span class="italics">This is three words</span>
or:
Code
<span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span>
but not:
Code
<span class="italics">one</span>
I know I'm probably missing something obvious and am going to feel like a complete idiot when I see the answer. I don't even really care what, exactly, gets gets highlighted, as I won't be using it to replace anything... only to manually eyeball the occurrence.
Keyword Quantifier
Code
<span class="italics">(\w+){2,}</span>
2 or more
Reply 

#8  DiapDealer 02-20-2012, 07:48 PM
Quote theducks
Keyword Quantifier
Code
<span class="italics">(\w+){2,}</span>
2 or more
That seems to be finding all occurrences of <span class="italics"></span> that enclose 2 or more word characters. And it's still returning one-word instances, while skipping things like:
Code
<span class="italics">Three weeks later</span></p>
And definitely skipping multiple word instances that contain punctuation and/or quotes:
Code
<span class="italics">Well, dammit, itÂ’s been two days.</span>
What else ya got?
Reply 

#9  tilia 02-20-2012, 09:06 PM
What about:

Code
<span class="italics">"?\w+,?\.?\s
Reply 

#10  theducks 02-20-2012, 09:23 PM
Code
(?<=<span class="italics">)(\w+ ){1,}(\w+)(?=</span>)
Reply 

  Next »  Last »  (1/63)
Today's Posts | Search this Thread | Login | Register