preg match - searching backwards? - PHP Server Side Scripting forum at WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

preg match - searching backwards?

search for content and strip it and the tag around it

Josefu

6:19 pm on Feb 20, 2007 (gmt 0)

I'm at a dead-end here. I've got thousands of pages containing a no-longer-functioning hard-coded ad that I'd like to remove - and its wrapping <div> with it. The ad is easy enough to locate - I just target the url - but how to remove everything, even line breaks, from the beginning of the targeted url back to the last opening <div> tag before that url?

Or, in other words, is there any way to search backwards using grep?

Thank you for any advice.

[edited by: Josefu at 6:25 pm (utc) on Feb. 20, 2007]

mcibor

11:53 am on Feb 21, 2007 (gmt 0)

Hi Josefu!

Is this code exactly the same on each page, or it can differ?
If it's the same, then I would preg_replace the whole code with null, otherwise it's hard to do what you require

if it's sth like:

$file = '<div .......>Ad</div>';
then you could

$ad = 'Ad';

while(($ad_pos = strpos($ad))!== false){
$begin = strrpos($file, '<', $ad_pos);
$end = strpos($file, '>', $ad_pos);
//cut it out
}

Hope this helps.
I don't know how to do that with preg_match or preg_replace
Michal

Josefu

2:55 pm on Feb 21, 2007 (gmt 0)

Thank you for your reply. Unfortunately the ad does vary throughout the website, as some time ago its owner made the switch from HTML to CSS/XHTML.

I think what I'll have to do is first check the page for the presence of "$ad" (with a quickie strpos), and if it is there, go through the page's <div> tags one by one - using a "preg_match_all' - to see which one the ad is in - then strip the offending one. But what if the ad appears in a table, or a p tag? Onerous.

Another possiblility would be to strip the offending ad, then do a search and replace for empty tags. This should grab all occasions... but what of empty "clear" div's used to expand content? Again onerous.

I think there is a way to do a "negative lookback" using perl, but I'm not sure that this is recognised by php. It is also a very slow process.

coopster

4:33 pm on Feb 21, 2007 (gmt 0)

The technique is called "assertion" and yes, PHP uses the Perl-Compatible regular expression engine which includes assertion pattern matching [php.net]. However, you are correct in that if the ad might be contained in something other than the element you expect, your pattern is going to be quite difficult to develop. You might have to make a few passes at it with different patterns.

Josefu

5:09 pm on Feb 21, 2007 (gmt 0)

Thanks. I m~a~y have singled out a method similar to the above - once the ad is detected, doing a preg_match_all to set (ether p or div maybe) tags into arrays, then searching the resulting array for the entry containing the target url, taking that and doing a search and replace (with neant)... quite a bother!

I'll let you know how it works.

Josefu

8:54 pm on Feb 21, 2007 (gmt 0)

Success. But what a workaround. All compiled into a function, now working for all pages. Thanks for your help! The function:

a) parses the page for the offending target, and if present:
b) sets up an array of tags that may be around the target (<div>, <p>, etc)
c) does a preg_match_all using first tag in above array
d) walks through the first part of the preg_match_all searching for the offending target and if it is found, puts the entire "between tag" content into a variable, else runs the preg_match_all again with the next tag in the tag array
e) with the above variable in the entire page, does a "search and replace with neant"

Voila. But really. Wouldn't a "find "this" and the "that" just before it" function be useful? I would think so.

[edited by: Josefu at 8:57 pm (utc) on Feb. 21, 2007]

phranque

12:40 am on Feb 22, 2007 (gmt 0)

awk would probably be the correct *nix tool to do this but it's been a while since i've used it...