Welcome to WebmasterWorld Guest from 126.96.36.199
Forum Moderators: open
I received a response which said:
"Slurp follows URLs from href anchors in published pages. These nonsense URLs are probably being generated from someone's casual PHP on the server."
Aside from the fact that I resent the suggestion that PHP is inherently a language for spammers, I doubt these links actually exist, and backlinks on google or alltheweb certainly don't exist, and there are no instances of referers to them. So I replied and they then responded:
"The junk URL requests from wm3024 were part of a task checking web servers to characterise their behavior for non-existent pages: do they send a 404 response, a 200 response with an error page, a redirect to an error page, ...? We have cut back greatly on the schedule for this task so you should be seeing far fewer such requests in future, but we do plan to re-check server behavior occasionally."
Does anyone feel that this is not acceptable? I agree there's practically nothing I can do about it, but sending out junk deliberately is not on. Google doesn't need to do this, so why should Inktomi?
As far as I know, *all* search engine spiders must occasionally generate a "bad" URL in order "to characterise [server] behavior for non-existent pages: do they send a 404 response, a 200 response with an error page, a redirect to an error page."
There is a very large number of mis-configured servers on the Web, many of which will not generate a 404-Not Found response under *any* circumstances. This can be caused by a very common and simple mistake in an Apache Error Document directive, or by script-based sites that *always* return a page and a 200-OK response, no matter what URL is requested. Some sites intentionally redirect all nonexistent pages to the home page or to a site map. There are many other examples, too.
But the point is that if a server does not return a 404-Not Found under any circumstances, then the URL-space of that server is practically infinite... There is no URL that can be requested that will not result in a 200-OK response, along with a page of content.
The search engine spiders need to characterize each domain they index to avoid these infinite URL spaces; otherwise, the spider could spend its entire lifetime trying to index a single such site.
If all sites and servers were perfectly-configured, and no sites used dynamically-generated pages, this 404-testing would not be necessary. But with the Web the way it is, 404-testing is a necessary survival tool for search engine spiders.
All that said, search engine spiders should 404-test conservatively, and it wouldn't hurt to use an "obvious" initial URL like "/404verification/404test.htmx" for testing whenever possible, or even to use a special user-agent variant to do it... like the way that several of the search engines now uniquely-identify their "fresh bots" and "daily-update bots". Webmasters are less likely to react negatively to spider behaviour that they are informed about and can understand.