Welcome to WebmasterWorld Guest from 3.92.92.168

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Regex Match to closest string.

     
4:12 pm on Nov 13, 2015 (gmt 0)

Junior Member from US 

10+ Year Member

joined:Feb 27, 2004
posts:104
votes: 0


I hope this ok to post here... I've have been trying to do this for a couple of days now.. read tons of info.. but just can't nail it down...
I've actually been testing this with Notepad++ , figuring if I can do it there I can migrate to Perl (CGI) script that I use...

So I have a repeating html tag.. that I want to start the string match.. ie <p>
The match must contain.... XXXX... and ends with YYYY

I know the following isn't "Perl"... but I can't even get it to work with Notepad++ ...
But you get the concept of what I am trying to do.. and thusly regenerate the code for Perl as the end result..

(?=<p>).*?(XXXX).*(YYYY)


It just grabs the very first "<p>" tag.. but I want it to match the closest "<p>" tag... to my other criteria...
I can't isolate the particular "starting" <p> I want from the other's..

Is this something easy to do ? I can't find an answer... to find the closest <p> match...

Basically... I am trying to remove a chunk like this...
<p><strong><span style="color: #0000ff;"><strong>This XXXX To Remove YYYY Starting With "<p>"</strong></p>


From this mess....

<p><strong><span style="color: #0000ff;"><strong>BlahBlahBlah</strong></p>
<br>
<p><strong><span style="color: #0000ff;"><strong>BlahBlahBlah</strong></p>
<br>
<p><strong><span style="color: #0000ff;"><strong>This XXXX To Remove YYYY Starting With "<p>"</strong></p>
<br>
<p><strong><span style="color: #0000ff;"><strong>BlahBlahBlah</strong></p>
<br>
<br>
<p><strong><span style="color: #0000ff;"><strong>This XXXX To Remove YYYY Starting With "<p>"</strong></p>
<br>
<p><strong><span style="color: #0000ff;"><strong>BlahBlahBlah</strong></p>
<br>

===========================
Thanks group... my apologies but I just haven't found a clear easy bit of Regex to do this....
4:27 pm on Nov 13, 2015 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38257
votes: 115


Regex's in perl are "greedy". I think you are running into that. Here are a couple tutorials on those:
[ultraedit.com...]
[itworld.com...]

You can go after this a couple of ways. One would be via a tricky regex or looping through the page with smaller bits.

This would be my approach: split it into an array and then loop through the array.
Start with your webpage or html content in $html

First split the html on the <p> tags:

@fragments =split(/\<p/,$html); #now all <p tags+rest of line are in @fragments.

now loop through fragments

foreach $f (@fragments) {
$f =~ s/\<span style\=\"color\: \#0000ff\;\"\>//gi; #use the substitute operator substitute to strip out your span style that you want... (tweak to suit)
#now print fragment that is left out to browser
print "<p$f";
}

That code would fail if there are multiple "spans" in the same paragraph fragment that you want to remove. If that is the case, then maybe spit the fragments on the SPANS instead of the paragraphs.

[edited by: Brett_Tabke at 5:27 pm (utc) on Nov 13, 2015]

4:42 pm on Nov 13, 2015 (gmt 0)

Junior Member from US 

10+ Year Member

joined:Feb 27, 2004
posts:104
votes: 0


Wow Brett thanks for the speedy reply !
If I was more experienced I would try to help out here more.. but alas...

I was afraid you were going to suggest a loop... I guess after a couple of days of reading that explains why I couldn't find a quick and easy "Lookbehind" Regex to the nearest item... like I am trying to do.

Thanks so much for your time... geeshh.. hard to believe I've been around here +10 years !
When I saw your name I felt honored by your response today.
Have a great thanks giving and thanks for the forum for so many years.
Rob.
7:24 pm on Nov 13, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15936
votes: 889


I can't isolate the particular "starting" <p> I want from the other's..

Wouldn't it be fun if you looked at your source text and discovered that the letter "p" never occurs, so you could express all those .* as [^p]* instead? No such luck, of course; you've even got "span" in there. But with some judicious use of [^\n] you can pick out the patterns.
<p><strong><span style="color: #0000ff;"><strong>BlahBlahBlah</strong></p>
<br>

At the risk of a fatal digression: you might be able to make the whole problem go away just by deploying your CSS more effectively. Then each group becomes
<p class = "myclass">blahblah</p>
and your RegEx becomes (I don't speak CGI so I don't know if anything needs escaping)
(<p class = "myclass">[^<\n]*)(XXXX[^<\n]*YYYY)([^<\n]*</p>
I captured three pieces: everything before the X-to-Y element, the X-to-Y element itself, and everything after. It wasn't clear from your post which bits you're changing. The ^\n is just for insurance; once you've said ^< it should be redundant.
2:27 pm on Feb 12, 2016 (gmt 0)

Junior Member from US 

10+ Year Member

joined:Feb 27, 2004
posts:104
votes: 0


Thanks for all your time on this topic folks !
I didn't want to leave this hanging' but I got a work around to my request.
My data was also available as a JSON... so I turned it into an XML file...
and was able to successfully segregate out the information. And learned a bunch (more) along the way.

I just got all the kinks worked out a few minutes ago.. and couldn't wait to share my jubilee !
Thanks again group !