Forum Moderators: open

Message Too Old, No Replies

Inktomi spider slurp went nuts - grabbing robots.txt over and over

slurp from inktomi can't understand a 404 robots.txt

         

jeremy goodrich

5:57 pm on Sep 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Seems they are having trouble writing their spiders these days, too.

Inktomi, any comment on why this is happening? or does the spider not understand a 404? :)

Though just so I understand, could this be caused by the web host, or is it stricly a spidering issue?

Excerpt from my log file:
66.196.65.12 - - [11/Sep/2002:06:52:44 -0500] "GET /robots.txt HTTP/1.0" 404 133 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.14 - - [11/Sep/2002:07:03:22 -0500] "GET /robots.txt HTTP/1.0" 404 133 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.11 - - [11/Sep/2002:07:05:06 -0500] "GET /robots.txt HTTP/1.0" 404 133 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.28 - - [11/Sep/2002:07:07:56 -0500] "GET /robots.txt HTTP/1.0" 404 133 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.24 - - [11/Sep/2002:07:10:28 -0500] "GET /robots.txt HTTP/1.0" 404 133 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.22 - - [11/Sep/2002:07:11:01 -0500] "GET /robots.txt HTTP/1.0" 404 133 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.13 - - [11/Sep/2002:07:13:51 -0500] "GET /robots.txt HTTP/1.0" 404 133 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.23 - - [11/Sep/2002:07:16:19 -0500] "GET /robots.txt HTTP/1.0" 404 133 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
66.196.65.21 - - [11/Sep/2002:07:18:07 -0500] "GET /robots.txt HTTP/1.0" 404 133 "-" "Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]

Key_Master

10:03 pm on Sep 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is what I get when I hit your robots.txt using HTTP/1.0:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>302 Found</TITLE>
</HEAD><BODY>
<H1>Found</H1>
The document has moved <A HREF="http://www.genesis(.*?).com/www.domain-in-your-profile.com/robots.txt">here</A>.<P>
<HR>
<ADDRESS>Apache/1.3.22 Server at www.genesis(.*?).com Port 80</ADDRESS>
</BODY></HTML>

I get the correct robots.txt when I use HTTP/1.1

jeremy goodrich

10:13 pm on Sep 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



he he he...

not the site in my profile! too funny, Key_master, way too funny.

the site in my profile is only there for giggles - the site in question was an actual SEO site which gets traffic other than the occasional webmasterworld user :)

Key_Master

10:15 pm on Sep 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



the site in my profile is only there for giggles - the site in question was an actual SEO site which gets traffic other than the occasional webmasterworld user

You need to post more info if you want a more realistic answer. :)

jdMorgan

3:03 am on Sep 14, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jeremy,

I don't think Slurp understands 404s... It's been trying to request a file on one of my sites that has been gone for more than a year. My server responds with 404-Not Found. A few hours, days, or weeks later, Slurp comes back and tries again. While I appreciate that they don't drop files instantly, allowing for outages and occasional webmaster errors, I figure after a week, maybe they should drop it.

And before anyone pounces on me, there are no links to this page on the web anywhere that I have been able to find. It always was linked only from another deep page of my site, and has always had at least a <meta robots noindex> on it.

I've thought about trying a 410-Gone to break the logjam...

Jim
<edited for typo>

gsbread

11:15 am on Sep 16, 2002 (gmt 0)



Hi,
Inktomi has dropped almost all of about 3500 pages in one domain.
It has been crawled regularly for 4 years by Slurp and now it is not. Funny thing is I just found 3 url's of the domain, not the home page, in its index but none others; it has also rendered my Looksmart listings less relevant in MSN, whereas they are now position 45 from position one.
Should I care? Is MSN going to go with Google/Fast/or develop its own as back in 96-99?
Any relevant comments would be appreciated. I have never paid for inclusion; and I understand that people that have paid are having problems with the listings in Inktomi's index as well from what I have read in this forum. Also, I understand Inktomi may be having general indexing trouble too, and this may be the cause.
Other than that, I know absolutely nothing.
Thank you for your responses.
GB