Forum Moderators: open
If you view the source code of a typical web page, you are likely to see something like this near the top:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
and/or
<html xmlns="http://www.w3.org/1999/xhtml" ...>
These refer to HTML DTDs and namespace documents hosted on W3C's site.
Note that these are not hyperlinks; these URIs are used for identification. This is a machine-readable way to say "this is HTML". In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.
That's 1500 requests a second, with close to 100% of requests being totally unnecessary. Note that normal web browsers such as IE or Firefox don't fetch the DTD, or if they did they would cache it.
I believe this is a good example of the effect that rogue crawlers are having on websites - there are a large number of crawlers which grab pages and fetch everything which looks like a link - and the crawlers are not sophisticated enough to analyse the HTML and realize that the links in the DTD and
xmlns should be ignored. So, as there are millions of documents out there which declare a "full" doctype, we are all contributing to a permanent distributed denial-of-service (DDOS) attack on the W3C!
The W3C are partially to blame - they placed the DTDs under their primary website (www.w3.org) instead of a dedicated subdomain, and they encouraged the use of their DTDs for all (X)HTML documents, despite the fact that the XML folks all know that DTDs Don’t Work on the Web [hsivonen.iki.fi]. Perhaps it's time for me to update the Doctype FAQ [webmasterworld.com] and suggest some of the (several) doctypes which conserve standards-compliance mode but don't include the DTD link...
I agree with encyclo in regards to the idea revolving around a subdomain. Bots are just plain stupid and screw up access logs with endless 404s making it difficult to find and repair legitimate 404s if it weren't for interesting counter-tactics and intelligent statistics scripts.
A good question: what browsers have a legitimate reason to fetch doctypes? I presume browsers that don't keep an internal copy?
- John
Like I said, the traffic problem doesn't come from browsers, as they don't actually take any notice of the doctype as such apart from using it as a rendering-mode switch. However, many bots are simple downloaders which just parse for anything that looks like a link without analyzing the page content at all, and this is probably they are requesting the DTDs for every page which includes them.
There is no real possibility of the W3C moving the DTDs, they are the biggest proponents of "Cool URIs don't Change" after all, and it would break many thousands of documents and applications.