Forum Moderators: open
So far, I have the source code captured via an ActiveXObject('WinHttp.WinHttpRequest.5.1') and I have the <'s and >'s replaced with entities. I can then place this string object on the destination page using "document.body.innerHTML = CapturedHTML;". This is all well and good and working fine. What I'd like to do however, is parse my CapturedHTML and apply the appropriately styled <span> tags to the various elements (which are now all part of a string, and not actual elements any more).
I think RegExp is the way to do this, but my knowledge only goes as far as being able to spell it. I learn well by example, so if some kind sole could show me how to parse, for example, this code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Example Page</title>
<meta http-equiv="Content-Type" content="text/xhtml; charset=ISO-8859-1" />
</head><body>
<p class="bodytext">
</body>
</html>
I'd very greatly appreciate it.
Or perhaps I'm barking up the wrong tree all together and there is a much easier way to do it. That would be great too.
Thanks in advance,
Lance
I would suggests using perl. In its simplest form, the script would simply have to read the file and supply an alternative mime type (I think) to cause the browser to render the page as plain text.
Beyond that, you could use regular expressions/html/css to provide highlighting.
There is probably a script out there that does all this, but if not, it would be suitable as programming exercise to learn perl (or php maybe).
Kaled.
It's the regular expressions, or some other parsing technique that I need help with. I am happy with rest of the methods I have chosen to implement and your suggestions just get me back to the same place I already am.
FWIW, I'm writing an IE equivalent to the Firefox developer's toolbar. Everything needs to run on the client so server side technologies are not a choice. And you can permit ActiveX in the "My Computer" zone, which is all that's required. Since the source page is being stripped of elements and being converted to text, ActiveX on the source page doesn't matter.
Regarding it only working in IE, well yeah, that's the point.
There are lots of open source hilighters out there, and you can probably find one that has a license compatible with your project. I would probably start my search for a good project at Freshmeat.
That's a good idea, and something I've actually tried already. Problem is, regular expressions are so foreign to me that I don't understand what I'm looking at/for.
As it turns out, I've just about got it anyway. Quite a bit of RTFM and trial and error has me pretty much done with it. Last thing I need to do is determine that an attribute name only occurs within an element tag. Otherwise it changes "3 + 2=5" into "3 + 2=5" even if it's in the middle of a paragraph.
If you're interested, here's what I have so far:
HTML = HTML.replace(/</g, '<');
HTML = HTML.replace(/>/g, '>');This one still needs some work to determine that it is within an element tag:
HTML = HTML.replace(/([ :])([A-Za-z0-9]*)=/g, '$1<span class=Attribute>$2</span>=');These could probably be combined, but I'm not good enough yet and getting too complex would confuse me later:
HTML = HTML.replace(/<([A-Za-z0-9]*)>/g, '<<span class=Element>$1</span>>');
HTML = HTML.replace(/<([A-Za-z0-9]*) /g, '<<span class=Element>$1</span> ');
HTML = HTML.replace(/<\/([A-Za-z0-9]*)>/g, '<\/<span class=Element>$1</span>>');HTML = HTML.replace(/(<!.*>)/gi, '<span class=Doctype>$1</span>');
HTML = HTML.replace(/<!--(.*)-->/g, '<span class=Comment><!--$1--></span>');HTML = "<pre>" + HTML + "</pre>"
I have to give credit to a Great Resource for Regular Expressions [regular-expressions.info] I found.
/<([A-Za-z0-9]+)((\s+[a-zA-Z0-9]+(?:=(?:[A-Za-z0-9]¦"[^"]+")))+)/
->
<$1<span class="Attributes">$2</span>