Forum Moderators: open

Message Too Old, No Replies

Need help with RegExp (I think)

         

Lance

9:53 pm on Oct 30, 2004 (gmt 0)

10+ Year Member



I'm trying to provide a "view source" feature with HTML syntax highlighting.

So far, I have the source code captured via an ActiveXObject('WinHttp.WinHttpRequest.5.1') and I have the <'s and >'s replaced with entities. I can then place this string object on the destination page using "document.body.innerHTML = CapturedHTML;". This is all well and good and working fine. What I'd like to do however, is parse my CapturedHTML and apply the appropriately styled <span> tags to the various elements (which are now all part of a string, and not actual elements any more).

I think RegExp is the way to do this, but my knowledge only goes as far as being able to spell it. I learn well by example, so if some kind sole could show me how to parse, for example, this code:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Example Page</title>
<meta http-equiv="Content-Type" content="text/xhtml; charset=ISO-8859-1" />
</head>

<body>
<p class="bodytext">
</body>
</html>

I'd very greatly appreciate it.

Or perhaps I'm barking up the wrong tree all together and there is a much easier way to do it. That would be great too.

Thanks in advance,
Lance

kaled

9:36 am on Oct 31, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you use ActiveX objects
1) It will only work on IE
2) IE users often have ActiveX disabled anyway.

I would suggests using perl. In its simplest form, the script would simply have to read the file and supply an alternative mime type (I think) to cause the browser to render the page as plain text.

Beyond that, you could use regular expressions/html/css to provide highlighting.

There is probably a script out there that does all this, but if not, it would be suitable as programming exercise to learn perl (or php maybe).

Kaled.

Lance

12:58 pm on Oct 31, 2004 (gmt 0)

10+ Year Member



Thanks but...

It's the regular expressions, or some other parsing technique that I need help with. I am happy with rest of the methods I have chosen to implement and your suggestions just get me back to the same place I already am.

FWIW, I'm writing an IE equivalent to the Firefox developer's toolbar. Everything needs to run on the client so server side technologies are not a choice. And you can permit ActiveX in the "My Computer" zone, which is all that's required. Since the source page is being stripped of elements and being converted to text, ActiveX on the source page doesn't matter.

Regarding it only working in IE, well yeah, that's the point.

jollymcfats

3:57 pm on Oct 31, 2004 (gmt 0)

10+ Year Member



Why not lift the regular expressions from an existing code hilighter? Having personally spent time writing hilighting regexes, that's what I would suggest. :)

There are lots of open source hilighters out there, and you can probably find one that has a license compatible with your project. I would probably start my search for a good project at Freshmeat.

Lance

4:45 pm on Oct 31, 2004 (gmt 0)

10+ Year Member



jollymcfats:

That's a good idea, and something I've actually tried already. Problem is, regular expressions are so foreign to me that I don't understand what I'm looking at/for.

As it turns out, I've just about got it anyway. Quite a bit of RTFM and trial and error has me pretty much done with it. Last thing I need to do is determine that an attribute name only occurs within an element tag. Otherwise it changes "3 + 2=5" into "3 + 2=5" even if it's in the middle of a paragraph.

If you're interested, here's what I have so far:

HTML = HTML.replace(/</g, '&lt;');
HTML = HTML.replace(/>/g, '&gt;');

This one still needs some work to determine that it is within an element tag:
HTML = HTML.replace(/([ :])([A-Za-z0-9]*)=/g, '$1<span class=Attribute>$2</span>=');

These could probably be combined, but I'm not good enough yet and getting too complex would confuse me later:
HTML = HTML.replace(/&lt;([A-Za-z0-9]*)&gt;/g, '&lt;<span class=Element>$1</span>&gt;');
HTML = HTML.replace(/&lt;([A-Za-z0-9]*) /g, '&lt;<span class=Element>$1</span> ');
HTML = HTML.replace(/&lt;\/([A-Za-z0-9]*)&gt;/g, '&lt;\/<span class=Element>$1</span>&gt;');

HTML = HTML.replace(/(&lt;!.*&gt;)/gi, '<span class=Doctype>$1</span>');
HTML = HTML.replace(/&lt;!--(.*)--&gt;/g, '<span class=Comment>&lt;!--$1--&gt;</span>');

HTML = "<pre>" + HTML + "</pre>"

I have to give credit to a Great Resource for Regular Expressions [regular-expressions.info] I found.

jollymcfats

5:24 pm on Oct 31, 2004 (gmt 0)

10+ Year Member



Something along these lines might get you started. The idea below is to anchor the search to something that starts with <tag. This works in a global replace, but you'll get attribute="value" rather than attribute="value" as you had originally.

/&lt;([A-Za-z0-9]+)((\s+[a-zA-Z0-9]+(?:=(?:[A-Za-z0-9]¦"[^"]+")))+)/ 
->
&lt;$1<span class="Attributes">$2</span>