Forum Moderators: coopster

Message Too Old, No Replies

preg_replace - exclude text in html and between anchor tags

Avoiding nested <a...</a> tags

         

bobs12

12:26 pm on May 23, 2006 (gmt 0)

10+ Year Member



I'm trying to get certain keywords linked to a glossary on a site. I can get to an expression that keeps preg_replace() out of html tags, but can't refine it to exclude anchor text - this is important to avoid nesting anchor tags!

What I have so far (after many sleepless nights) is this (unescaped):

(?<=[^a-z0-9])(KEYWORD)(?=[^a-z0-9])(?=[^>]*<)(?!.*?</a>)

If I leave out the last (?!.*?</a>) it will nest anchor tags. With it, it only links keywords AFTER the last anchor in the search string.

Can anybody help?

Another problem is that if there are angle brackets in the anchor tags, this also causes grief...

eelixduppy

2:44 am on May 24, 2006 (gmt 0)



Since im not all that familiar with regular expressions(ive never had a case where i needed them), the best i can do at the moment is refer you to a thread in our library that may help you out: [webmasterworld.com...]

Good luck!

ahmedtheking

8:41 am on May 24, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is a tough one, been there! Ok, you should know about this piece of regex:

[^\>] (escaped).

The little hat ^ means to exclude this char from all other chars. So before where you where using the star *, that was prob counting the < and > at the end of your tags!

bobs12

9:41 pm on May 24, 2006 (gmt 0)

10+ Year Member



Thanks folks! That's a good resource for regex syntax, I'll put up a few links to it when I'm done messing around.

I think I got round the problem. It's not pretty but it seems to work well enough for what I'm doing - it goes a little something like this:

Put all anchor text into nonsense <##..##> tags:

$text=preg_replace('/(<a([^>]+)>)(.*?)(<\/a>)/is', "$1<##$3##>$4", $text);

Replace keywords not between <..> tags. <##..##> should be skipped too:

$search = "/(>)([^<]*)([^#a-z]+)($keys)([^#a-z]+)/is";
$replace = "\$1\$2\$3***\$4***\$5";

$text = preg_replace($search, $replace, $text);

Get rid of <##..##>

$text=str_replace("<##", "", $text);
$text=str_replace("##>", "", $text);

That could be done by preg too, but I'm guessing str_replace() might use less resources?

I'm sure it could be improved on but I'm not a programmer :) If anyone has any ideas to improve it I'd be grateful.

I haven't found a specific weakness yet but I'll keep testing as I'm sure there will be flaws. I think it might get screwed up if there are tags in the anchor text.

[1][edited by: coopster at 10:03 pm (utc) on May 24, 2006]
[edit reason] removed url per TOS [webmasterworld.com] [/edit]