Mobileread
regex newbie search end of string char problem
#1  michaelbr 10-10-2020, 11:30 AM
I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z]</p>, here is an example:
paragraph 1: .....
Code
.’</p>
paragraph 2: .....
Code
.</p>
paragraph 3: ......
Code
</p>
the .... can be either char or number, I'd like to find only paragraph 3, I tried this regex
Code
([^.]|[^.’])<\/p>$
, but it's not working, can someone please tell me what's the best way to search for this string?
Reply 

#2  theducks 10-10-2020, 12:28 PM
I prefer to do my Joins individually by type. I also only use Replace ALL for these 2 (I have a number of others for special instances that I step thru and Skip false positives)
(The code was snipped from my saved Search file. so things sown are 'escaped'. They also takeinto consideration valid punctuation marks)
Code
74\Name=Cleanup/Joins/Join to upper
74\Find="([[:alpha:],][\"\x201d\xe2\x80\x9d]*)</p>\\s*<p\\b[^>]*>([A-Z\xe2\x80\x9c\"])"
74\Replace=\\1 \\2
75\Name=Cleanup/Joins/To Lower
75\Find="\\s*([a-z],*)</p>\\s+<p class=\"calibre1\">([a-z])"
75\Replace=\\1 \\2
Reply 

#3  michaelbr 10-10-2020, 01:19 PM
Quote theducks
I prefer to do my Joins individually by type. I also only use Replace ALL for these 2 (I have a number of others for special instances that I step thru and Skip false positives)
(The code was snipped from my saved Search file. so things sown are 'escaped'. They also takeinto consideration valid punctuation marks)
Code
74\Name=Cleanup/Joins/Join to upper
74\Find="([[:alpha:],][\"\x201d\xe2\x80\x9d]*)</p>\\s*<p\\b[^>]*>([A-Z\xe2\x80\x9c\"])"
74\Replace=\\1 \\2
75\Name=Cleanup/Joins/To Lower
75\Find="\\s*([a-z],*)</p>\\s+<p class=\"calibre1\">([a-z])"
75\Replace=\\1 \\2
Thanks for the tips, it's solved.
Reply 

#4  Tex2002ans 10-12-2020, 03:35 PM
Quote michaelbr
I tried this regex
Code
([^.]|[^.’])<\/p>$
, but it's not working, can someone please tell me what's the best way to search for this string?
The . is a very special symbol in Regex. It stands for "any character". If you want to look for an actual period, you'll want to add a \ before it:

. = any character
\. = a period

Quote michaelbr
I have a text file with several paragraphs, I'd like to search for paragraphs ending with *[a-zA-Z]</p>, [...]
Can you try to explain, in words, what's the issue you're trying to solve? And give a few more examples of before/after?

From what I can tell, I think you're trying to find paragraphs without a closing punctuation mark. (aka, paragraphs that end in a letter.)

Like if you're taking an OCRed book, and trying to combine broken lines together:

Code
<p>This is a copied and</p>
<p>pasted paragraph from the</p>
<p>book.</p>
<p>And true paragraph 2.</p>
After:

Code
<p>This is a copied and pasted paragraph from the book.</p>
<p>And true paragraph 2.</p>
* * *

Here are the 3 sets of Regex I personally use:

Note: DO NOT do a "Replace All". Replace most of these on a case-by-case basis. Also, make sure to save a backup copy of your file.

Regex #1 (Hyphens)

This searches for a hyphen at the end of a paragraph:

Search: -</p>\s+<p>
Replace: (LEAVE THIS COMPLETELY BLANK)

OR alternate:

Search: -</p>\s+<p>
Replace: -

Example:

Code
<p>This example is where the pre-</p>
<p>split occurs.</p>
Regex #2 (Not Closing Punctuation)

This searches for everything that's NOT a period, exclamation point, question mark, etc.:

Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1

Example:

Code
<p>This is an example</p>
<p>sentence where the person,</p>
<p>places, and things occur.</p>
Note: You can easily add different "valid" punctuation endings as needed. Like a colon may or may not be:

In Fiction, colons likely occur within sentences.
In Non-Fiction, colons likely occur at the end of paragraphs.

Regex #3 (Lowercase Start)

This searches for a lowercase letter at the very beginning of the paragraph:

Search: <p>[a-z]

I make sure to run this after #1 and #2 to catch any strays, then decide these on a case-by-case basis.

Example:

Code
<p>The fishy “car dealership”</p>
<p>was called Mr. X’s Emporium.</p>
Reply 

#5  michaelbr 10-13-2020, 03:30 PM
Quote Tex2002ans
From what I can tell, I think you're trying to find paragraphs without a closing punctuation mark. (aka, paragraphs that end in a letter.)

Like if you're taking an OCRed book, and trying to combine broken lines together:

Code
<p>This is a copied and</p>
<p>pasted paragraph from the</p>
<p>book.</p>
<p>And true paragraph 2.</p>
After:

Code
<p>This is a copied and pasted paragraph from the book.</p>
<p>And true paragraph 2.</p>
Hi Tex2002ans, thanks so much for your detailed explanation, that's exactly what I'm trying to do, I used your solution Regex #2 (partially, searching for small letters at the end), but yours is much better, I'll use yours instead. Again thanks so much for sharing.
Reply 

#6  Tex2002ans 10-13-2020, 04:48 PM
Quote michaelbr
Hi Tex2002ans, thanks so much for your detailed explanation, that's exactly what I'm trying to do,
Glad to see I guessed correctly.

Quote michaelbr
I used your solution Regex #2 (partially, searching for small letters at the end), but yours is much better, I'll use yours instead.
If you're looking for lowercase letters at the end, you could also use something like this:

Search: ([a-z])</p>\s+<p>
Replace: \1 <---- Make sure you put a space after.

Code
<p>This is an example</p>
<p>sentence. But THIS LINE</p>
<p>won't match.</p>
but I think my Regexes are better. :P
Reply 

#7  michaelbr 10-15-2020, 02:54 PM
Quote Tex2002ans
Glad to see I guessed correctly.



If you're looking for lowercase letters at the end, you could also use something like this:

Search: ([a-z])</p>\s+<p>
Replace: \1 <---- Make sure you put a space after.

Code
<p>This is an example</p>
<p>sentence. But THIS LINE</p>
<p>won't match.</p>
but I think my Regexes are better. :P
Yes, certainly, yours are much better, thanks for sharing.
Reply 

Today's Posts | Search this Thread | Login | Register