How to get first 500 words of a paragraph

Forum Moderators: coopster

Message Too Old, No Replies

How to get first 500 words of a paragraph

...and the paragraph may have HTML tags

iProgram

1:10 pm on Feb 14, 2005 (gmt 0)

If you have your own blog, you may notice this feature: you write a long article and add it to your blog. In your blog homepage visitors can see a summary of this article plus a "Read More" link. The summary is the first, let say, 500 words of your article. It's a piece of cake if you use plain text only, however, you've already used HTML tags like <img>, <p>, <ul>, <li>, <a href>.... So the question is, how to remove some tags (e.g.<img>) and get the first 500 words without breaking other tags (<ul>. <li> and <anchor>)....

Nutter

2:03 pm on Feb 14, 2005 (gmt 0)

I'm thinking a preg_replace would work if you only wanted to strip out certain tags. Something like '^<img(*.)>$'. (Don't quote me on this, I'm pretty bad at regex; it's my next learning project :)

As far as the first x words, what I usually do is explode on spaces and then take the first x members of the array. It may not be the fastest, but it works. If anybody out there has a faster way, I'd like to know - I'm working on a site right now that's using my way and am always looking for tweaks.

If you want first x characters, it's a little quicker. Take the left x characters and then back track until you get a space to make sure you're ending on a word.

- Ryan

iProgram

2:32 pm on Feb 14, 2005 (gmt 0)

This_is_<a href="http://www.google.com">Google_Homepage</a>,

First 20 chars are: This_is_<a href="htt

This is the problem I want to solve:(

timster

2:48 pm on Feb 14, 2005 (gmt 0)

A few year's back I did something like this, and yes, it was tricky.

Since there are a theoretically limitless number of HTML tags, I'd suggest keeping a "white list" of OK HTML tags and getting rid of everything else. I think the following will work for that:

$string = preg_replace("/<\/?(?!(b�i�ul�li)>)[^>]+>/","", $string);

When you take the first part of your string, you may end up with multiple unclosed HTML tags, like so:

<ul><li><b>Code carefully

Be careful to rebuild them back to front.

How close to 500 characters you want the count to be -- that is, do you want to exclude the HTML tags before you measure the characters or just take a guess that x% of the characters will be HTML tags?

timster

4:34 pm on Feb 14, 2005 (gmt 0)

This is the problem I want to solve

Ooh, I'm guessing that's only the beginning of your headaches.

How about...

$string = substr($string, 0, $max_string_length);
$string = preg_replace('/<[^>]+$/', '',$string);

jusdrum

9:12 pm on Feb 14, 2005 (gmt 0)

Or, strip_tags(), which the first parameter is the string, the second optional is allowed HTML tags.

$Content = strip_tags($Content,"<a><p>");

jollymcfats

10:36 pm on Feb 14, 2005 (gmt 0)

Here's a function that will pull out the first x words of a chunk of markup (Not characters, words.) It maintains a stack of unclosed HTML tags, and closes whatever needs closing when the word count is met. It does not consider tags themselves words, and should behave properly for closely-nested tags like " <a>foo</a> ".

I haven't tested this extensively, but it seems to work pretty well. One caveat is that it will rewrite whitespace to suit its own needs. Linefeeds are not preserved, which may be an issue if you use <pre> types of constructs.

The

_recordTag()

function tracks the tags. As written it is a hybrid XHTML/HTML detector, and should be tuned to your needs. If you've running all XHTML through this you can simplify the function to ignore all self-closing tags (<br />), and add anything else to the queue (<div>). If you've got HTML, you may want to modify it to be case-insensitive and expand the list of nested tags. (div, p, b, em, etc.)

Apologies for the lack of indenting, the forum software isn't very code-friendly.


function abstract($text, $num=500) { 
 if (preg_match_all('/\s+/', $text, $junk) <= $num) return $num; 
 $text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/', 
 '_abstractProtect', $text); 
 $words = 0; 
 $out = array(); 
 $stack = array(); 
 $tok = strtok($text, "\n\t "); 
 while ($tok!== false and strlen($tok)) { 
  if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/', 
    $tok, 
    $matches, 
    PREG_SET_ORDER)) { 
    foreach ($matches as $tag) _recordTag($stack, $tag[1], $tag[2]); 
  } 
  $out[] = $tok; 
  if (! preg_match('/^(<[^>]+>)+$/', $tok)) ++$words; 
  if ($words == $num) break; 
  $tok = strtok("\n\t "); 
 } 
 $abstract = _abstractRestore(implode(' ', $out)); 
 foreach ($stack as $tag) { $abstract .= "</$tag>"; } 
 return $abstract; 
} 
function _abstractProtect($match) { 
 return preg_replace('/\s/', "\x01", $match[0]); 
} 
function _abstractRestore($strings) { 
 return preg_replace('/\x01/', ' ', $strings); 
} 
function _recordTag(&$stack, $tag, $args) { 
 // XHTML 
 if (strlen($args) and $args[strlen($args) - 1] == '/') { 
   return; 
 } 
 else if ($tag[0] == '/') { 
   $tag = substr($tag, 1); 
   for ($i=count($stack) -1; $i >= 0; $i--) { 
     if ($stack[$i] == $tag) { 
       array_splice($stack, $i, 1); 
       return; 
     } 
   } 
   return; 
 } 
 else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a'))) { 
   $stack[] = $tag; 
 } 
 else { 
   // no-op 
 } 
}

[edited by: coopster at 2:16 am (utc) on Feb. 15, 2005]
[edit reason] disabled graphic smilies [/edit]