Forum Moderators: coopster
I decided on a brute force approach, but my eregi_replace is not working.
The regex is:
\s(class¦style¦id)[\s=]+["']?.*["'\s]?
Now, as I read that it says
1. Find a pattern starting with one or more spaces, \s
2. Followed by "class" or "style" or "id", (class¦style¦id)
3. Followed by 1 or more spaces or equal signs, [\s=]+
4. Followed by zero or one single or double quotes, ["']?
5. Followed by any combination of characters, .*
6. Ending zero or one double or single quotes or spaces ["'\s]?
So, I should be able to run this:
$text = eregi_replace('\s(class¦style¦id)[\s=]+["\']?.*["\'\s]?', '', $text);
and strip every class, style and id attribute in the entire document - right?
Doesn't work:(
So, what am I doing wrong here?
That said, I still don't understand where my logic in the provided expression was wrong, or why it didn't work, and thought it might be a good learning tool for me.
For the record, I first ran a strip_tags($text) to get rid of spurilous stuff like spans, fonts, and the ubiquitious <o:> (whatever that is - but there sure are a lot of them) and leaving only p, ul, ol, li, b and i tags. That cleaned up a whole lot of garbage.
Then I was trying to strip out all the attributes with that little regex.
what I ended up with was this:
$text = ereg_replace("<([^>]*)(class¦lang¦style¦size¦face)=(\"[^\"]*\"¦'[^']*'¦[^>]+)([^>]*)>","<\\1>",$text);
$text = ereg_replace("<([^>]*)(class¦lang¦style¦size¦face)=(\"[^\"]*\"¦'[^']*'¦[^>]+)([^>]*)>","<\\1>",$text);
which appears to work. Why the identical ereg_replace needs to be called twice I have not a clue, and the pragmatist in me doesn't have the energy to figure it out at this point.
The complete function looks like this, and seems to be doing the intended job:
function cleanHTML($text)
{
$text = stripslashes($text);
$text = stristr($text, '<body>');
//convert div's to p's
$text = eregi_replace("div>", "p>", $text);
//discard tags
$text = strip_tags($text, '<p><b><i><ol><ul><li>');
//remove all style, id and class tags
$text = ereg_replace("<([^>]*)(class¦lang¦style¦size¦face)=(\"[^\"]*\"¦'[^']*'¦[^>]+)([^>]*)>","<\\1>",$text);
$text = ereg_replace("<([^>]*)(class¦lang¦style¦size¦face)=(\"[^\"]*\"¦'[^']*'¦[^>]+)([^>]*)>","<\\1>",$text);
$text = ereg_replace(" >",">",$text);
// stip non-breaking spaces
$text = eregi_replace(" ", "", $text);
//get rid of empty p's
$text = eregi_replace("<p></p>", "", $text);
}
Like I said, I was willing to take a brute force approach on this one. Nothing elegant about the above.
What is missing is a way to convert things like long hyphens, curly quotes, etc. into utf-8 coding. I used to have something bookmarked about how to do that conversion in a couple of lines, but I'll be danged if I can find it :(
was only capturing the first character or two of the string
I don't usually use the POSIX regular expressions so I cannot truthfully offer any insight as to why. I'm partial to the perl compatible regular expression engine. Which, after reading this and trying to see why you are having issues brought a whole new topic of discussion! But I'll start a new thread regarding why you should no longer use the Regular Expression (POSIX Extended) [webmasterworld.com] functions, such as ereg_replace.
and the ubiquitious <o:> (whatever that is - but there sure are a lot of them)
Looks like an invalid "ordered list" element's opening tag, except instead of hitting the letter "L", somebody fatfingered it and hit the colon ... <- no jokes please :)
Why the identical ereg_replace needs to be called twice I have not a clue, and the pragmatist in me doesn't have the energy to figure it out at this point.
I'm guessing it is a temporary workaround. Try your code against an element that has more than two of those attributes enclosed in it's tags and I'm guessing the processes fails to remove the third one in line within the element.