Forum Moderators: coopster

Message Too Old, No Replies

eregi help

stripping attributes not working

         

willybfriendly

2:19 am on Jan 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OJ, I have always struggled with regex. I have a bunch of documents generated by Outlook and Word sitting in a DB that need to be displayed in a page. (We all know about MS HTML garbage, but that is not a part of this post.)

I decided on a brute force approach, but my eregi_replace is not working.

The regex is:

\s(class¦style¦id)[\s=]+["']?.*["'\s]?

Now, as I read that it says

1. Find a pattern starting with one or more spaces, \s
2. Followed by "class" or "style" or "id", (class¦style¦id)
3. Followed by 1 or more spaces or equal signs, [\s=]+
4. Followed by zero or one single or double quotes, ["']?
5. Followed by any combination of characters, .*
6. Ending zero or one double or single quotes or spaces ["'\s]?

So, I should be able to run this:

$text = eregi_replace('\s(class¦style¦id)[\s=]+["\']?.*["\'\s]?', '', $text);

and strip every class, style and id attribute in the entire document - right?

Doesn't work:(

So, what am I doing wrong here?

coopster

11:41 pm on Jan 21, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



One issue is that you may have multiple space separated class values, no?

willybfriendly

2:29 pm on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



True enough, Coopster, but I can clean the empty spaces up easy enough. Problem was that the expression in question was only capturing the first character or two of the string. I ended up going in search and found an expression that worked out on the net.

That said, I still don't understand where my logic in the provided expression was wrong, or why it didn't work, and thought it might be a good learning tool for me.

For the record, I first ran a strip_tags($text) to get rid of spurilous stuff like spans, fonts, and the ubiquitious <o:> (whatever that is - but there sure are a lot of them) and leaving only p, ul, ol, li, b and i tags. That cleaned up a whole lot of garbage.

Then I was trying to strip out all the attributes with that little regex.

what I ended up with was this:

$text = ereg_replace("<([^>]*)(class¦lang¦style¦size¦face)=(\"[^\"]*\"¦'[^']*'¦[^>]+)([^>]*)>","<\\1>",$text);
$text = ereg_replace("<([^>]*)(class¦lang¦style¦size¦face)=(\"[^\"]*\"¦'[^']*'¦[^>]+)([^>]*)>","<\\1>",$text);

which appears to work. Why the identical ereg_replace needs to be called twice I have not a clue, and the pragmatist in me doesn't have the energy to figure it out at this point.

The complete function looks like this, and seems to be doing the intended job:

function cleanHTML($text)
{
$text = stripslashes($text);
$text = stristr($text, '<body>');
//convert div's to p's
$text = eregi_replace("div>", "p>", $text);
//discard tags
$text = strip_tags($text, '<p><b><i><ol><ul><li>');
//remove all style, id and class tags
$text = ereg_replace("<([^>]*)(class¦lang¦style¦size¦face)=(\"[^\"]*\"¦'[^']*'¦[^>]+)([^>]*)>","<\\1>",$text);
$text = ereg_replace("<([^>]*)(class¦lang¦style¦size¦face)=(\"[^\"]*\"¦'[^']*'¦[^>]+)([^>]*)>","<\\1>",$text);
$text = ereg_replace(" >",">",$text);
// stip non-breaking spaces
$text = eregi_replace("&nbsp;", "", $text);
//get rid of empty p's
$text = eregi_replace("<p></p>", "", $text);
}

Like I said, I was willing to take a brute force approach on this one. Nothing elegant about the above.

What is missing is a way to convert things like long hyphens, curly quotes, etc. into utf-8 coding. I used to have something bookmarked about how to do that conversion in a couple of lines, but I'll be danged if I can find it :(

coopster

7:03 pm on Jan 22, 2009 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



was only capturing the first character or two of the string

I don't usually use the POSIX regular expressions so I cannot truthfully offer any insight as to why. I'm partial to the perl compatible regular expression engine. Which, after reading this and trying to see why you are having issues brought a whole new topic of discussion! But I'll start a new thread regarding why you should no longer use the Regular Expression (POSIX Extended) [webmasterworld.com] functions, such as ereg_replace.

and the ubiquitious <o:> (whatever that is - but there sure are a lot of them)

Looks like an invalid "ordered list" element's opening tag, except instead of hitting the letter "L", somebody fatfingered it and hit the colon ... <- no jokes please :)

Why the identical ereg_replace needs to be called twice I have not a clue, and the pragmatist in me doesn't have the energy to figure it out at this point.

I'm guessing it is a temporary workaround. Try your code against an element that has more than two of those attributes enclosed in it's tags and I'm guessing the processes fails to remove the third one in line within the element.

willybfriendly

10:29 pm on Jan 22, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So, given your new thread it looks like I will need to rework this code a bit. The site in question is still running on some 4.x.x version of PHP, but I suspect that an upgrade is not too far off in the future.

Thanks for the help. I might be back!