homepage Welcome to WebmasterWorld Guest from 54.211.201.65
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.txt Question?
How often does Ink check for the file??
MarkHutch




msg:1527148
 12:05 am on Mar 10, 2002 (gmt 0)

I noticed a few weeks ago that Inktomi was trying to spider /_vti_cnf/ directories within one of our domains.

We have never used front page extensions, but a previous web host put these extensions on their server and somehow added them to everyone's account. At that time, we had open sub-directories (no index page) and I assume that Inktomi picked up these front page extensions by spidering those open directories and finding the front page directories within our regular directories. When this previous host ran a backup, they backed ALL of our current files up and added all of our files and sub-directories to their front page extensions. This created a spidering nightmare...

About one week ago, I wrote a Robots.txt file to ask ALL robots to quit trying to find files in those directories. All search engines have stopped trying to spider pages within these front page directories except, Inktomi.

I went through my server logs and it appears that each IP that Inktomi uses gets their own copy of a web sites Robots.txt file. Several of their IP's have requested and received the robots.txt file and have stopped trying to spider those directories. However, some Inktomi IP's haven't checked for my robots.txt file in several days and they continue to try to spider those non-existent directories.

Does anyone out there know the default number of days an Inktomi IP waits before it request another copy of a robots.txt file? It seems Google, AltaVista, Ask Jeeves and most of the others check for the robots.txt file at least once per day or before they start spidering. However, Inktomi doesn't seem to follow that same pattern. Anyone with any useful information, please post a reply and let me know the Inktomi schedule for updating their robots.txt information. Thank you...

(edited by: MarkHutch at 6:54 pm (utc) on Mar. 26, 2002)

 

Brett_Tabke




msg:1527149
 1:13 pm on Mar 11, 2002 (gmt 0)

It usually takes 7 to 31 days for Ink to recognize a robots.txt update. Why it's that long isn't known (it sure shouldn't be).

cfel2000




msg:1527150
 1:15 pm on Mar 11, 2002 (gmt 0)

Like all search engines. When it recrawls your site it will check for a new robot.txt file. Otherwise it will use the last recorded one.

MarkHutch




msg:1527151
 4:54 pm on Mar 11, 2002 (gmt 0)

Thanks for the replies. It sure seems like they are spending a bunch of time and bandwitdth trying to spider pages that are not there anymore. However, maybe they have their reasons for not checking more often for a robots.txt file. Maybe most people don't use one and they just don't want to waste their time checking too often or something like that...

cfel2000




msg:1527152
 4:57 pm on Mar 11, 2002 (gmt 0)

What do you mean by 'checking for pages which are no longer there'? A robot.txt file doesn't contain a list of pages to your site. It just tells the spider which pages you don't want crawled.

MarkHutch




msg:1527153
 5:11 pm on Mar 11, 2002 (gmt 0)

If you'll read my original post you'll see that I have added pages that no longer exist on my server to the robots.txt file. Why waste the spiders time trying to re-crawl pages and only get 404 errors on them?? It's working great for all search engines except for Inktomi. Thanks for the reply...

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved