Forum Moderators: open

Message Too Old, No Replies

DDOS by Doctype: W3C burdened with excessive DTD traffic

         

encyclo

1:42 am on Feb 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



From the W3C Systems Team: W3C's Excessive DTD Traffic [w3.org]
If you view the source code of a typical web page, you are likely to see something like this near the top:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">


and/or

<html xmlns="http://www.w3.org/1999/xhtml" ...>


These refer to HTML DTDs and namespace documents hosted on W3C's site.

Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say "this is HTML". In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.

That's 1500 requests a second, with close to 100% of requests being totally unnecessary. Note that normal web browsers such as IE or Firefox don't fetch the DTD, or if they did they would cache it.

I believe this is a good example of the effect that rogue crawlers are having on websites - there are a large number of crawlers which grab pages and fetch everything which looks like a link - and the crawlers are not sophisticated enough to analyse the HTML and realize that the links in the DTD and

xmlns
should be ignored.

So, as there are millions of documents out there which declare a "full" doctype, we are all contributing to a permanent distributed denial-of-service (DDOS) attack on the W3C!

The W3C are partially to blame - they placed the DTDs under their primary website (www.w3.org) instead of a dedicated subdomain, and they encouraged the use of their DTDs for all (X)HTML documents, despite the fact that the XML folks all know that DTDs Don’t Work on the Web [hsivonen.iki.fi]. Perhaps it's time for me to update the Doctype FAQ [webmasterworld.com] and suggest some of the (several) doctypes which conserve standards-compliance mode but don't include the DTD link...

Solution1

7:12 am on Feb 10, 2008 (gmt 0)

10+ Year Member Top Contributors Of The Month



The HTML 5 doctype would solve this problem:

<!DOCTYPE html>

JAB Creations

8:52 pm on Feb 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



HTML 5 is not acceptable as it's lacking version passively declares itself as the final version of HTML which frankly I find laughable and inexcusable. I will stay with XHTML 1.1 and use HTML 5's features by utilizing the beauty of XML by using the proper media type: application/xhtml+xml when the time comes that support exists for useful HTML 5 features.

I agree with encyclo in regards to the idea revolving around a subdomain. Bots are just plain stupid and screw up access logs with endless 404s making it difficult to find and repair legitimate 404s if it weren't for interesting counter-tactics and intelligent statistics scripts.

A good question: what browsers have a legitimate reason to fetch doctypes? I presume browsers that don't keep an internal copy?

- John

encyclo

9:29 pm on Feb 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



HTML5 versioning is a whole other debate ;) However, you're right in that the HTML5 pseudo-doctype is a solution, however it has the current disadvantage that you can't validate against a draft specification, and validation helps fix many problems when developing a site.

Like I said, the traffic problem doesn't come from browsers, as they don't actually take any notice of the doctype as such apart from using it as a rendering-mode switch. However, many bots are simple downloaders which just parse for anything that looks like a link without analyzing the page content at all, and this is probably they are requesting the DTDs for every page which includes them.

There is no real possibility of the W3C moving the DTDs, they are the biggest proponents of "Cool URIs don't Change" after all, and it would break many thousands of documents and applications.