homepage Welcome to WebmasterWorld Guest from 50.16.130.188
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Advertising / Paid Inclusion Engines and Topics
Forum Library, Charter, Moderators: Brett Tabke

Paid Inclusion Engines and Topics Forum

  posting off  
Inktomi crawling deliberately requesting non-existent URLs
Does anyone else feel this is unacceptable?
mvl22




msg:27663
 1:22 pm on Oct 17, 2003 (gmt 0)

I got a pile of 404s from inktomi last week, basically taking snippets of other URLs and mixing them with mine, eg.

/matrices/sprilib/malawi_profile.htm
/perb/france/handk_weightlifter.htm
/household_charges/spriant/reklam_ver.htm
/mothers-day-covered-box/rfn.htm
/just-married.htm
/opie-login_0.9.9-20030720_arm/rfn.htm
/basics/contacts/just-tell-santa.htm

I received a response which said:

"Slurp follows URLs from href anchors in published pages. These nonsense URLs are probably being generated from someone's casual PHP on the server."

Aside from the fact that I resent the suggestion that PHP is inherently a language for spammers, I doubt these links actually exist, and backlinks on google or alltheweb certainly don't exist, and there are no instances of referers to them. So I replied and they then responded:

"The junk URL requests from wm3024 were part of a task checking web servers to characterise their behavior for non-existent pages: do they send a 404 response, a 200 response with an error page, a redirect to an error page, ...? We have cut back greatly on the schedule for this task so you should be seeing far fewer such requests in future, but we do plan to re-check server behavior occasionally."

Does anyone feel that this is not acceptable? I agree there's practically nothing I can do about it, but sending out junk deliberately is not on. Google doesn't need to do this, so why should Inktomi?

 

Bobby_Davro




msg:27664
 1:37 pm on Oct 17, 2003 (gmt 0)

I'd settle for any kind of spidering form Inktomi. Count yourself lucky that it even bothers to turn up. I watch numerous sites and none of them are spidered properly (just a couple of pages each).

jdMorgan




msg:27665
 3:43 pm on Oct 17, 2003 (gmt 0)

mvl22,

As far as I know, *all* search engine spiders must occasionally generate a "bad" URL in order "to characterise [server] behavior for non-existent pages: do they send a 404 response, a 200 response with an error page, a redirect to an error page."

There is a very large number of mis-configured servers on the Web, many of which will not generate a 404-Not Found response under *any* circumstances. This can be caused by a very common and simple mistake in an Apache Error Document directive, or by script-based sites that *always* return a page and a 200-OK response, no matter what URL is requested. Some sites intentionally redirect all nonexistent pages to the home page or to a site map. There are many other examples, too.

But the point is that if a server does not return a 404-Not Found under any circumstances, then the URL-space of that server is practically infinite... There is no URL that can be requested that will not result in a 200-OK response, along with a page of content.

The search engine spiders need to characterize each domain they index to avoid these infinite URL spaces; otherwise, the spider could spend its entire lifetime trying to index a single such site.

If all sites and servers were perfectly-configured, and no sites used dynamically-generated pages, this 404-testing would not be necessary. But with the Web the way it is, 404-testing is a necessary survival tool for search engine spiders.

All that said, search engine spiders should 404-test conservatively, and it wouldn't hurt to use an "obvious" initial URL like "/404verification/404test.htmx" for testing whenever possible, or even to use a special user-agent variant to do it... like the way that several of the search engines now uniquely-identify their "fresh bots" and "daily-update bots". Webmasters are less likely to react negatively to spider behaviour that they are informed about and can understand.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Advertising / Paid Inclusion Engines and Topics
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved