Forum Moderators: coopster

Message Too Old, No Replies

Help with RegEx

         

Slapyo

5:23 pm on Feb 21, 2008 (gmt 0)

10+ Year Member



sorry to bring back an old thread [webmasterworld.com] ... but this one deals with exactly what i was looking for except for one thing. within the HREF for some links the keyword exists and is being replaced. for instance, someone links to google search with the keyword. i'd like it to leave the HREF alone for all links too.

[edited by: eelixduppy at 5:26 pm (utc) on Feb. 21, 2008]
[edit reason] added link to thread [/edit]

PHP_Chimp

7:54 pm on Feb 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Can you give us an example of what you are trying to achieve?

As are you saying that you want -

this contains a keyword here and keyword here

to turn into

this contains a <a href="/keyword">keyword</a> here and <a href="keyword">keyword</a> here

but you want

this is a google like [google.co.uk...] keyword &btnG=Google+Search&meta= this

not to end up as

this is a google like [google.co.uk...] <a href="/keyword">keyword</a> &btnG=Google+Search&meta= this

?
Or are you after something different?

Slapyo

8:00 pm on Feb 21, 2008 (gmt 0)

10+ Year Member



yes, exactly. the google links end up with a link in a link.

i want keyword to be replaced by a link only if it is not in or between A tags. shouldn't touch HREF and shouldn't touch the anchor text. just the keywords that are out in the rest of the text. also, it could be possible for the keyword to be wrapped in like B or I tags ... those are ok. i just don't want to be placing a link within a link, and i don't want to alter the HREF tags that may contain the keyword.

Slapyo

8:21 pm on Feb 21, 2008 (gmt 0)

10+ Year Member



i pieced together a solution ... whether or not there is an easier way to do it i don't know.

i used this script for the initial replace:
<?php
/**
* Perform a simple text replace
* This should be used when the string does not contain HTML
* (off by default)
*/
define('STR_HIGHLIGHT_SIMPLE', 1);

/**
* Only match whole words in the string
* (off by default)
*/
define('STR_HIGHLIGHT_WHOLEWD', 2);

/**
* Case sensitive matching
* (off by default)
*/
define('STR_HIGHLIGHT_CASESENS', 4);

/**
* Overwrite links if matched
* This should be used when the replacement string is a link
* (off by default)
*/
define('STR_HIGHLIGHT_STRIPLINKS', 8);

/**
* Highlight a string in text without corrupting HTML tags
*
* @author Aidan Lister <aidan@php.net>
* @version 3.1.1
* @link [aidanlister.com...]
* @param string $text Haystack - The text to search
* @param array¦string $needle Needle - The string to highlight
* @param bool $options Bitwise set of options
* @param array $highlight Replacement string
* @return Text with needle highlighted
*/
function str_highlight($text, $needle, $options = null, $highlight = null)
{
// Default highlighting
if ($highlight === null) {
$highlight = '<strong>\1</strong>';
}

// Select pattern to use
if ($options & STR_HIGHLIGHT_SIMPLE) {
$pattern = '#(%s)#';
$sl_pattern = '#(%s)#';
} else {
$pattern = '#(?!<.*?)(%s)(?![^<>]*?>)#';
$sl_pattern = '#<a\s(?:.*?)>(%s)</a>#';
}

// Case sensitivity
if (!($options & STR_HIGHLIGHT_CASESENS)) {
$pattern .= 'i';
$sl_pattern .= 'i';
}

$needle = (array) $needle;
foreach ($needle as $needle_s) {
$needle_s = preg_quote($needle_s);

// Escape needle with optional whole word check
if ($options & STR_HIGHLIGHT_WHOLEWD) {
$needle_s = '\b' . $needle_s . '\b';
}

// Strip links
if ($options & STR_HIGHLIGHT_STRIPLINKS) {
$sl_regex = sprintf($sl_pattern, $needle_s);
$text = preg_replace($sl_regex, '\1', $text);
}

$regex = sprintf($pattern, $needle_s);
$text = preg_replace($regex, $highlight, $text);
}

return $text;
}

?>

this script doesn't touch the HREF attribute of the A tag. then i used the 2nd step from the script i found here to remove the links that were made inside anchor text.

$step2 = preg_replace('{(<a[^<]*)<a href="link">keyword</a>([^<]*</a>)}', '$1keyword$2', $step1);

[edited by: eelixduppy at 9:28 pm (utc) on Feb. 21, 2008]
[edit reason] disabled smileys [/edit]

PHP_Chimp

11:48 pm on Feb 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A little smaller version.

<?php
$test['ok'] = 'this contains a keyword and another keyword';
$test['not_ok'] = 'this contains a linked <a href="http://google.co.uk/">keyword</a> here';
$keyword = 'keyword';
foreach ($test as $subject) {
$pattern = "%($keyword)(?!.*</a>)%i";
$replacement = '<a href="/$1">$1</a>';
$out = preg_replace($pattern, $replacement, $subject);
echo "ALL: $out<br />\n";
}
?>

Gives -
ALL: this contains a <a href="/keyword">keyword</a> and another <a href="/keyword">keyword</a><br />
ALL: this contains a linked <a href="http://google.co.uk/">keyword</a> here<br />

This is not perfect as it is only checking for </a> after the keyword. However this should get you what you want most of the time.

Your solution is more definite, just longer...not that there is any problem with that ;)

[edited by: eelixduppy at 11:57 pm (utc) on Feb. 21, 2008]
[edit reason] disabled smileys [/edit]

Slapyo

4:47 am on Feb 22, 2008 (gmt 0)

10+ Year Member



thanks, that should cover most things ... but i think it still modifies the HREF

<a href="http://www.keyword.com">keyword</a> gets turned into

<a href="http://www.<a href="/keyword">keyword</a>">keyword</a>

but this should give me a very good start at working towards something that covers everything. thanks for the help, i appreciate it.

PHP_Chimp

5:32 pm on Feb 22, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Humm must be something different in our pcre librarys then as I have just tried -

$test['not_ok'] = 'this is a linked <a href="http://google.co.uk/keyword">keyword</a> here';

and got -

ALL: this contains a <a href="/keyword">keyword</a> and another <a href="/keyword">keyword</a><br />
ALL: this contains a linked <a href="http://google.co.uk/keyword">keyword</a> here<br />

when looking at the source. So the linked keyword and url are getting left alone.
The regex is performing a negative lookahead, so if the keyword is followed by </a> then it will not match.

However I guess that the lookahead block should really be (?!.*?</a>) as otherwise the .* will consume everything up to the last </a> in the input, so that could mean that a very large chunk of input may be missed.

[edited by: eelixduppy at 7:33 pm (utc) on Feb. 22, 2008]
[edit reason] disabled smileys [/edit]

Slapyo

6:02 pm on Feb 22, 2008 (gmt 0)

10+ Year Member



i must have not uploaded the latest one with your code because i saved and uploaded now and it works perfect. that's so much better than having to do preg_replace 2 times and having a much longer script.

thank you!