Page is a not externally linkable
jdMorgan - 3:43 pm on Oct 17, 2003 (gmt 0)
As far as I know, *all* search engine spiders must occasionally generate a "bad" URL in order "to characterise [server] behavior for non-existent pages: do they send a 404 response, a 200 response with an error page, a redirect to an error page." There is a very large number of mis-configured servers on the Web, many of which will not generate a 404-Not Found response under *any* circumstances. This can be caused by a very common and simple mistake in an Apache Error Document directive, or by script-based sites that *always* return a page and a 200-OK response, no matter what URL is requested. Some sites intentionally redirect all nonexistent pages to the home page or to a site map. There are many other examples, too. But the point is that if a server does not return a 404-Not Found under any circumstances, then the URL-space of that server is practically infinite... There is no URL that can be requested that will not result in a 200-OK response, along with a page of content. The search engine spiders need to characterize each domain they index to avoid these infinite URL spaces; otherwise, the spider could spend its entire lifetime trying to index a single such site. If all sites and servers were perfectly-configured, and no sites used dynamically-generated pages, this 404-testing would not be necessary. But with the Web the way it is, 404-testing is a necessary survival tool for search engine spiders. All that said, search engine spiders should 404-test conservatively, and it wouldn't hurt to use an "obvious" initial URL like "/404verification/404test.htmx" for testing whenever possible, or even to use a special user-agent variant to do it... like the way that several of the search engines now uniquely-identify their "fresh bots" and "daily-update bots". Webmasters are less likely to react negatively to spider behaviour that they are informed about and can understand. Jim
mvl22,