Was eregi replace help - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Was eregi replace help

now preg_replace_callback

willybfriendly

5:12 am on Jan 23, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Over here [webmasterworld.com] I was looking for help with an eregi_replace function. Thanks to coopster I dropped that and started looking at preg_replace.

For those that may trod this path in the future, this seems to do the job

$text = preg_replace_callback("/(<\w)(.*?)(>)/s", create_function('$matches', 'return $matches[1].$matches[3];'), $text);

New ground for me.

I finally identified the problem with my previous attempts, which had to do with odd linebreaks that word/outlook were inserting here and there.

Going to preg_replace_callback (which I had never used before) opened new possibilities, with even newer complexities.

In the end, out of frustration, I made that a lazy dot-star by adding the "?". Using the dotall flag was making it far too greedy. That was the piece that made it work after some 3 hours of head banging.

So, I have really messy code generated by Outlook like this:

<P class=MsoNormal style="MARGIN: 0in 0in 6pt; LINE-HEIGHT: normal"><B
style="mso-bidi-font-weight: normal"><SPAN style="FONT-SIZE: 12pt"><FONT
face=Arial>Financials:<o:p></o:p></FONT></SPAN></B></P>

I run it through this function:

function cleanUpHTML($text)
{
//strip header stuff
$text = stristr($text, '<body>');
//convert div's to p's, since outlook sometimes appears to use div's rather than p's
$text = preg_replace("/div>/i", "p>", $text);
//discard unwanted tags (font, img, a, etc.)
$text = strip_tags($text, '<p><b><i><ol><ul><li>');
//remove all style, id, class tags, etc.
$text = preg_replace_callback("/(<\w)(.*?)(>)/s", create_function('$matches', 'return $matches[1].$matches[3];'), $text);
//get rid of useless non breaking spaces
$text = preg_replace("/ /", "", $text);
//get rid of empty p's
$text = preg_replace("/<p><\/p>/i", "", $text);

return $text;
}

And I end up with this:

<P><B>Financials:</B></P>

I am still looking for a simple way to convert those extended MS characters (long dashes, curly quotes, elipses, etc.) into utf-8 characters, but the above little routine is doing a smash-up job cleaning up (most of) the MS garbage.

Thanks for the help coopster...

coopster

2:41 pm on Jan 23, 2009 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

I'm wondering if you truly need to be using a callback function though? It seems as if you should be able to just use preg_replace without the overhead of creating a lambda function and invoking it.

$text = preg_replace("/<(\w)[^>]*?>/s", "<$1>", $text);

willybfriendly

7:35 pm on Jan 23, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Ah, all of the power and frustration of regular expressions revealed in a mere 14 characters.

Your solution works like a charm!

Thanks again coopster.

coopster

8:17 pm on Jan 23, 2009 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Don't get frustrated! You're learning and regular expressions can take some time to wrap your head around. It's all worth it, trust me.

I figured you really didn't want to capture anything else there except the opening tag name of the element, skip all the attributes and find the closing tag marker. Since you weren't using any of the attributes, why capture additional subpatterns? Therefore, I put the less than and greater than signs (the tag boundary markers) in the replacement expression as they were just static characters anyway. The real timesaver now in that updated expression is the middle part,

[^>]

Rather than match anything, we'll match anything that is not the closing marker. It will run much quicker, much more efficient.

Keep at it, you'll be an old regex pro in no time!

willybfriendly

1:55 am on Jan 24, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks for the morale boost coopster.

The frustration with regular expressions is directly related to their power. Not only are they cryptic (which can be overcome) but a single misplaced character can mess everything up - without any warning or explanation, like a missing semi or parenthesis in more mundane code might kick up.

To wit, I had tried a negation [^>] but had placed it after the dot-star rather than before. Of course that didn't work.

I think a person would need to take about a week doing nothing but regex (with a cheat sheet) to get their head around it, and then do weekly booster sessions for about 6 months.

It is not like learning a foreign language, it is like learning an alien language consisting of squeaks, chirps and grunts.

That said, I have every intention of continuing to expand my ability to use these little gems. Perhaps someday I can near the fluency of folks like you and jdmorgan - people that seem to read regex like the rest of us read Dr. Seuss.

jdMorgan

8:55 pm on Jan 26, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

It seems a common experience to have 'a eureka moment' when learning regular expressions, after which things become much clearer.

However, a hard fact is that it is *much* easier to write regex when you know the goal than it is to read regex written by someone else --or even something you yourself wrote six months ago-- and especially if you don't know (or remember) exactly what the goal was.

Because of this the learning curve is rather steep, and it does take a lot of experience to get comfortable. As in most things, nothing helps so much at learning regex as a compelling need to solve a number of problems in an important project.

And as a further ray of hope, I'd like to offer the observation that regular expressions are like Legos or Tinker-Toys: There are really not very many different kinds of 'pieces,' it is just that they can be combined in many, many ways to build a huge variety of things.

"I will not match it with a star
That makes the parser scan too far
Instead I'll negate with a carat
I would go on, but I can't bear it"

Jim

coopster

12:32 am on Jan 27, 2009 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

LOL!
Well, that little Dr. Seuss moment was nice. It was much easier to read than a regex. Thanks for the light reading, jdSeuss ;)