Forum Moderators: coopster
For those that may trod this path in the future, this seems to do the job
$text = preg_replace_callback("/(<\w)(.*?)(>)/s", create_function('$matches', 'return $matches[1].$matches[3];'), $text);
New ground for me.
I finally identified the problem with my previous attempts, which had to do with odd linebreaks that word/outlook were inserting here and there.
Going to preg_replace_callback (which I had never used before) opened new possibilities, with even newer complexities.
In the end, out of frustration, I made that a lazy dot-star by adding the "?". Using the dotall flag was making it far too greedy. That was the piece that made it work after some 3 hours of head banging.
So, I have really messy code generated by Outlook like this:
<P class=MsoNormal style="MARGIN: 0in 0in 6pt; LINE-HEIGHT: normal"><B
style="mso-bidi-font-weight: normal"><SPAN style="FONT-SIZE: 12pt"><FONT
face=Arial>Financials:<o:p></o:p></FONT></SPAN></B></P>
I run it through this function:
function cleanUpHTML($text)
{
//strip header stuff
$text = stristr($text, '<body>');
//convert div's to p's, since outlook sometimes appears to use div's rather than p's
$text = preg_replace("/div>/i", "p>", $text);
//discard unwanted tags (font, img, a, etc.)
$text = strip_tags($text, '<p><b><i><ol><ul><li>');
//remove all style, id, class tags, etc.
$text = preg_replace_callback("/(<\w)(.*?)(>)/s", create_function('$matches', 'return $matches[1].$matches[3];'), $text);
//get rid of useless non breaking spaces
$text = preg_replace("/ /", "", $text);
//get rid of empty p's
$text = preg_replace("/<p><\/p>/i", "", $text);
return $text;
}
And I end up with this:
<P><B>Financials:</B></P>
I am still looking for a simple way to convert those extended MS characters (long dashes, curly quotes, elipses, etc.) into utf-8 characters, but the above little routine is doing a smash-up job cleaning up (most of) the MS garbage.
Thanks for the help coopster...
I figured you really didn't want to capture anything else there except the opening tag name of the element, skip all the attributes and find the closing tag marker. Since you weren't using any of the attributes, why capture additional subpatterns? Therefore, I put the less than and greater than signs (the tag boundary markers) in the replacement expression as they were just static characters anyway. The real timesaver now in that updated expression is the middle part,
[^>]
Rather than match anything, we'll match anything that is not the closing marker. It will run much quicker, much more efficient.
Keep at it, you'll be an old regex pro in no time!
The frustration with regular expressions is directly related to their power. Not only are they cryptic (which can be overcome) but a single misplaced character can mess everything up - without any warning or explanation, like a missing semi or parenthesis in more mundane code might kick up.
To wit, I had tried a negation [^>] but had placed it after the dot-star rather than before. Of course that didn't work.
I think a person would need to take about a week doing nothing but regex (with a cheat sheet) to get their head around it, and then do weekly booster sessions for about 6 months.
It is not like learning a foreign language, it is like learning an alien language consisting of squeaks, chirps and grunts.
That said, I have every intention of continuing to expand my ability to use these little gems. Perhaps someday I can near the fluency of folks like you and jdmorgan - people that seem to read regex like the rest of us read Dr. Seuss.
However, a hard fact is that it is *much* easier to write regex when you know the goal than it is to read regex written by someone else --or even something you yourself wrote six months ago-- and especially if you don't know (or remember) exactly what the goal was.
Because of this the learning curve is rather steep, and it does take a lot of experience to get comfortable. As in most things, nothing helps so much at learning regex as a compelling need to solve a number of problems in an important project.
And as a further ray of hope, I'd like to offer the observation that regular expressions are like Legos or Tinker-Toys: There are really not very many different kinds of 'pieces,' it is just that they can be combined in many, many ways to build a huge variety of things.
"I will not match it with a star
That makes the parser scan too far
Instead I'll negate with a carat
I would go on, but I can't bear it"
Jim