homepage Welcome to WebmasterWorld Guest from 54.204.215.209
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Website
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Site friendly spiders : the list
Brett_Tabke




msg:401723
 5:41 pm on Jun 12, 2001 (gmt 0)

Friendly Spider:
Obey's robots.txt.
Does not request more than one page per minute.
Visits in low traffic hours or reduces requests during peak traffic hours.
Has info on site about spider.

The List
Slurp : Inktomi.com
GoogleBot : google.com
Scooter : altavista.com
DirectHit : directhit.com
Fast : alltheweb.com
teoma : teoma.com
ArchitextSpider : excite.com
Gulliver : northernlight.com
T-Rex : Lycos.com

 

littleman




msg:401724
 6:22 pm on Jun 12, 2001 (gmt 0)

I'd take Slurp out, especially the Japanese based bots. They will rip threw a C-name layout very aggressively. Here is an example. All page requests are like such:
category1.domain.com
category2.domain.com

The bot is 202.212.5.32 -> goo311.inktomi.com
The requests come in like this:
at 1:59:58 PM on Monday, June 9, 2001
at 2:00:00 PM on Monday, June 9, 200
at 2:00:01 PM on Monday, June 9, 2001
at 2:00:02 PM on Monday, June 9, 2001
at 2:00:02 PM on Monday, June 9, 2001
at 2:00:03 PM on Monday, June 9, 2001
at 2:00:04 PM on Monday, June 9, 2001
at 2:00:06 PM on Monday, June 9, 2001
at 2:00:06 PM on Monday, June 9, 2001
at 2:00:08 PM on Monday, June 9, 2001
at 2:00:09 PM on Monday, June 9, 2001
at 2:00:10 PM on Monday, June 9, 2001
at 2:00:11 PM on Monday, June 9, 2001
at 2:00:12 PM on Monday, June 9, 2001
at 2:00:14 PM on Monday, June 9, 2001
at 2:00:15 PM on Monday, June 9, 2001
at 2:00:16 PM on Monday, June 9, 2001
and on, and on...

Adding up to tens of thousands of requests per day per server.

msgraph




msg:401725
 6:58 pm on Jun 12, 2001 (gmt 0)

I'm going to have to second Littleman. I've been getting over swamped by Inktomi as well. I went ahead and added robots.txt files in a few domains and that stopped most of the Slurps except one: Slurp/cat

This version is like a virus. Sometimes it will grab pages on one domain with only 2 seconds in between, BUT they(Slurp/cat) are off multiple IPs on the same Inktomi C-block. Just goes to show how much they coordinate with one another. Unless, they are running off of separate lists of URLs from their dozen or so databases. But even so that could still clog up a server.

So for almost two or more weeks this lil bugger wouldn't even try to glance at the robots.txt file. After that, if you don't have it disallowed in the robots.txt file, it takes a lil break for a week and starts all over again.

Brett_Tabke




msg:401726
 7:56 pm on Jun 12, 2001 (gmt 0)

Ok debatable - on the gray list since we don't have alot of choice if we want ink traffic.

Who else is in on the Friendly list?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved