Tough Regular Expressions string replace

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Tough Regular Expressions string replace

replace string patterns

Garoun

6:11 pm on Oct 14, 2003 (gmt 0)

I'll use two examples to display my problem. Basicaly I'm writing and application text type editor that sends its output in plain text to the database upon submitting. What I need is a regular expression that removes all of the special HTML tag formatting that MSWord is so kind as to paste in with.

(Side Note: I've thought about using the inner references which are another option 'currently', so suggestions to that extent are welcome also)

String 1 Example: (assume Str1 = )


Str1 = "<p class=MsoNormal><b style='mso-bidi-font-height:normal'>Glossary of assumed
terms:<o:p></o:p></b></p>"

String 2 Example: (assume Str2 = )


Str2 = "<li class=MsoNormal style='mso-list:l0 level1 lfo1;tab-stops:list .5in'>Desktop</li>"

Currently I don't care about lost formatting, I simply want all of the tags gone. and the plain text output. Hopefully the regular expressions are pretty straight forward and any ideas would be greatly appreciated ASAP.

Thanks

claus

6:27 pm on Oct 14, 2003 (gmt 0)

>> I simply want all of the tags gone. and the plain text output

$someString = <li class=MsoNormal style='mso-list:...
$someString =~ s/<.*>//igm

Perl Regexp syntax, line two is what you need, line 1+2 is how to do it. (igm = ignore case, global, multiline)

/claus

Welcome to WebmasterWorld Garoun :)

Garoun

6:49 pm on Oct 14, 2003 (gmt 0)

My fault on the terminology. You definately gave me a good starting point for deciphering what I need :) thanks a lot.

I shouldn't need the multiline flag but I'll keep that in mind if my data isn't kicking out quite right which will most likely be from the lovely carriage returns.

Now I just need to convert your perl RegExp over to javascript for use in html/javascript which won't be very hard... I"ll be sure to come up with a harder question later. :D

Thank you for the prompt reply, I was crossing my fingers.

I will get this regexp stuff figured out eventually, it's all too powerful of an option not to learn.

timster

3:36 pm on Oct 15, 2003 (gmt 0)

$someString =~ s/<.*>//igm ;

That could remove more than the doctor intended (i.e., everything between the first "<" and the last ">" on a line).

These might be better:

$someString =~ s/<.*?>//gis ;
$someString =~ s/<[^>]*>//gis ;

Garoun

4:11 pm on Oct 15, 2003 (gmt 0)

Yeah, the original peice Claus gave was doing a bit much once I tested it, so my 'current' implementation uses:


temp2_field_val = temp_field_val.replace(/<.*?>/ig," ");

Where temp_field_val is my text string I want stripped

Storyteller

2:42 am on Oct 17, 2003 (gmt 0)

Garoun, for this particular task you're much better off using HTML::TreeBuilder or even W3C's HTML Tidy utility (if all you want is to scratch MS Word artifacts).

With just the regular expressions you'll have a hard time covering numerous special cases that will totally screw such a straightforward "parser".

Garoun

1:49 pm on Oct 17, 2003 (gmt 0)

Thanks for the info Storyteller, I'll be working more with my parse and such later this month. I will definately check out those items you mentioned to see how they could help with the issues I encounter as the application advances.

timster

2:19 pm on Oct 17, 2003 (gmt 0)

Very true Storyteller. Still, I've never seen MS Word create any HTML that the one-liner can't handle.

Garoun

3:52 pm on Oct 17, 2003 (gmt 0)

Well its already been asked of me to attempt to 'save' the formatting and convert it to a format I can use
aka:
<UL class=asfawref.sdfsd sdfasd sadfasd>Text</UL> -> <UL>Text</UL>

It won't be easy at all due to man variations word adds, but if it was easy there'd be more people doing what we do :D.

Garoun