Forum Moderators: coopster
As far as the first x words, what I usually do is explode on spaces and then take the first x members of the array. It may not be the fastest, but it works. If anybody out there has a faster way, I'd like to know - I'm working on a site right now that's using my way and am always looking for tweaks.
If you want first x characters, it's a little quicker. Take the left x characters and then back track until you get a space to make sure you're ending on a word.
- Ryan
Since there are a theoretically limitless number of HTML tags, I'd suggest keeping a "white list" of OK HTML tags and getting rid of everything else. I think the following will work for that:
$string = preg_replace("/<\/?(?!(bŠiŠulŠli)>)[^>]+>/","", $string); When you take the first part of your string, you may end up with multiple unclosed HTML tags, like so:
<ul><li><b>Code carefully Be careful to rebuild them back to front.
How close to 500 characters you want the count to be -- that is, do you want to exclude the HTML tags before you measure the characters or just take a guess that x% of the characters will be HTML tags?
I haven't tested this extensively, but it seems to work pretty well. One caveat is that it will rewrite whitespace to suit its own needs. Linefeeds are not preserved, which may be an issue if you use <pre> types of constructs.
The
_recordTag()function tracks the tags. As written it is a hybrid XHTML/HTML detector, and should be tuned to your needs. If you've running all XHTML through this you can simplify the function to ignore all self-closing tags (<br />), and add anything else to the queue (<div>). If you've got HTML, you may want to modify it to be case-insensitive and expand the list of nested tags. (div, p, b, em, etc.)
Apologies for the lack of indenting, the forum software isn't very code-friendly.
function abstract($text, $num=500) {
if (preg_match_all('/\s+/', $text, $junk) <= $num) return $num;
$text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/',
'_abstractProtect', $text);
$words = 0;
$out = array();
$stack = array();
$tok = strtok($text, "\n\t ");
while ($tok!== false and strlen($tok)) {
if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',
$tok,
$matches,
PREG_SET_ORDER)) {
foreach ($matches as $tag) _recordTag($stack, $tag[1], $tag[2]);
}
$out[] = $tok;
if (! preg_match('/^(<[^>]+>)+$/', $tok)) ++$words;
if ($words == $num) break;
$tok = strtok("\n\t ");
}
$abstract = _abstractRestore(implode(' ', $out));
foreach ($stack as $tag) { $abstract .= "</$tag>"; }
return $abstract;
}
function _abstractProtect($match) {
return preg_replace('/\s/', "\x01", $match[0]);
}
function _abstractRestore($strings) {
return preg_replace('/\x01/', ' ', $strings);
}
function _recordTag(&$stack, $tag, $args) {
// XHTML
if (strlen($args) and $args[strlen($args) - 1] == '/') {
return;
}
else if ($tag[0] == '/') {
$tag = substr($tag, 1);
for ($i=count($stack) -1; $i >= 0; $i--) {
if ($stack[$i] == $tag) {
array_splice($stack, $i, 1);
return;
}
}
return;
}
else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a'))) {
$stack[] = $tag;
}
else {
// no-op
}
}
[edited by: coopster at 2:16 am (utc) on Feb. 15, 2005]
[edit reason] disabled graphic smilies [/edit]