Forum Moderators: open
Is it possible for a browser to interpret non-English tags. For example lets say the German word for "script" was "foo", would <foo src="badthings.js"></foo> be able to load? I've visited a few foreign websites and all the tags appear to be in English, but I don't want to let an oversight on my part allow for insecurity.
On that topic, other than events in other tags (ex: <a onClick="javascript:foo()">blah</a>), are there any tags that allow javascript to be loaded? I've tried <embed> and <object>, and neither of those seem to work.
Thanks
The only safe say to accept user input is to sanitize it by HTMLencoding it. That means changing all < into < and all & into & etc.
That's why so many 2.0 apps use alternative markup languages like BBcode or WikiStyle. A hacker can't get a page to execute a "[script]" no matter how hard they try. Or you can invent your own markup language, the way Facebook did with <fb:> tags, which essentially serve the same purpose.
Then not only are you protected against "<script>hack()</script>", you are also protected against "<img src='.' onerror='hack()' />" and "<div onmouseover='hack()'>" and "<span style='background:url(javascript:hack())'>" and a hundred other variations which someone can use to execute scripts.
Rather than try to blacklist specific known patterns for XSS hacking, sanitize the input completely and whitelist known markup you do want to support. You can, for instance, go though your content after HTMLencoding and turn any <b> back into a <b>.
Although, that doesn't answer my question at all.
And, on another note, thinking that you can't remove javascript with regular expressions is a bit naive. There are many products available that do just that. It might not be as easy as (<script[^>]>), but it's definitely feasible. If you haven't heard of it before, check out the browser DOM. There are indeed many circumstances that allow for javascript, but it is a list one developer who knows what they're doing could exhaust in a day. There is a fine line you have to walk when developing web applications: The line between giving your clients freedom and allowing for exploitation. Your solution does put the client squarely on one side of that line, but it also eliminates many of the benefits of the other.
Also, I'm interested in this line: <span style='background:url(javascript:hack())'>. How does this work? as far as I can tell, this does nothing. Anyone else had any luck with this?
I've visited a few foreign websites and all the tags appear to be in English
If you see a Document that contains non-HTML elements (like "<foo>"), then you're likely viewing XHTML, a variant of XML which allows a developer to include arbitrarily named elements, or elements belonging to other namespaces.
Is it possible for a browser to interpret non-English tags
Absolutely. XHTML is accepted by most browsers, and some will allow javascript events on anything even if the document is not declared as XHTML.
for example, paste this into a new document:
<html>
<style>
zingpop{display:block;}
</style>
<head>
</head>
<body id="body">
<zingpop onclick="alert('hello!')">test</zingpop>
</body>
</html>
In the example above, the onclick() event of the <zingpop> worked in Firefox, but not in IE7. Note that I didn't declare the document as XHTML with a DOCTYPE definition.
<span style='background:url(javascript:hack())'>
That technique was exploited by the famous Samy worm. It doesn't work in all browsers; when Samy brought MySpace to its knees (October 2005), it was shown to work in IE and Safari.
thinking that you can't remove javascript with regular expressions is a bit naive
A fair comment. Blacklisting all known exploitation techniques is possible. But what about the unknown ones? It will take you many hundreds of lines of code, and you'll never be sure if the next version of IE or Opera or Firefox or Konqueror or whatever will enable some wacky thing you've never heard of. Whitelisting known patterns after HTMLencoding is like... pasteurization. It's proven, safe, easy, and future-proof.
I don't really think XHTML is a problem for regular expressions because it has to be formatted in such a way that the browser can understand it, and therefore is formatted in such a way that regular expressions can find it; also, it still uses a standard set of events.
For example: if you searched for <a onclick="...">, <span onclick="...">, and <div onclick="..">; finding <foo onclick="..."> wouldn't work. But, if you form a good regex that starts with < and look until a >, you know you've found a tag. Then, parse out the onclick and you're set. Each browser supports a limited number of events that can occur on a given element, so parse all of those and you should be set. If the browser allows for script to be executed in things like styles, that is a problem that I wasn't aware of. But I'm not 100% sure that is within the scope of my place to police. I remember a few years ago entering a certain character sequence would cause IE to crash... there's no way a designer can be responsible for a user doing something like that.
One issue that XHTML might bring up for me is this: say I wanted to make a tag that behaved exactly like and <object> tag... is there a way to do that with XHTML definitions? if so, it would make it a touch harder to outlaw those pesky ones.
Thanks again for answering my question, I suspected that was the case, but I wasn't sure.
Lastly, I work for a larger company and am not part of the decision making process. So, even if I don't agree that we should allow users to put in HTML, it doesn't really matter what I think.