homepage Welcome to WebmasterWorld Guest from 54.197.183.230
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Site friendly spiders : the list
Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 766 posted 5:41 pm on Jun 12, 2001 (gmt 0)

Friendly Spider:
Obey's robots.txt.
Does not request more than one page per minute.
Visits in low traffic hours or reduces requests during peak traffic hours.
Has info on site about spider.

The List
Slurp : Inktomi.com
GoogleBot : google.com
Scooter : altavista.com
DirectHit : directhit.com
Fast : alltheweb.com
teoma : teoma.com
ArchitextSpider : excite.com
Gulliver : northernlight.com
T-Rex : Lycos.com

 

littleman

WebmasterWorld Senior Member littleman us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 766 posted 6:22 pm on Jun 12, 2001 (gmt 0)

I'd take Slurp out, especially the Japanese based bots. They will rip threw a C-name layout very aggressively. Here is an example. All page requests are like such:
category1.domain.com
category2.domain.com

The bot is 202.212.5.32 -> goo311.inktomi.com
The requests come in like this:
at 1:59:58 PM on Monday, June 9, 2001
at 2:00:00 PM on Monday, June 9, 200
at 2:00:01 PM on Monday, June 9, 2001
at 2:00:02 PM on Monday, June 9, 2001
at 2:00:02 PM on Monday, June 9, 2001
at 2:00:03 PM on Monday, June 9, 2001
at 2:00:04 PM on Monday, June 9, 2001
at 2:00:06 PM on Monday, June 9, 2001
at 2:00:06 PM on Monday, June 9, 2001
at 2:00:08 PM on Monday, June 9, 2001
at 2:00:09 PM on Monday, June 9, 2001
at 2:00:10 PM on Monday, June 9, 2001
at 2:00:11 PM on Monday, June 9, 2001
at 2:00:12 PM on Monday, June 9, 2001
at 2:00:14 PM on Monday, June 9, 2001
at 2:00:15 PM on Monday, June 9, 2001
at 2:00:16 PM on Monday, June 9, 2001
and on, and on...

Adding up to tens of thousands of requests per day per server.

msgraph

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 766 posted 6:58 pm on Jun 12, 2001 (gmt 0)

I'm going to have to second Littleman. I've been getting over swamped by Inktomi as well. I went ahead and added robots.txt files in a few domains and that stopped most of the Slurps except one: Slurp/cat

This version is like a virus. Sometimes it will grab pages on one domain with only 2 seconds in between, BUT they(Slurp/cat) are off multiple IPs on the same Inktomi C-block. Just goes to show how much they coordinate with one another. Unless, they are running off of separate lists of URLs from their dozen or so databases. But even so that could still clog up a server.

So for almost two or more weeks this lil bugger wouldn't even try to glance at the robots.txt file. After that, if you don't have it disallowed in the robots.txt file, it takes a lil break for a week and starts all over again.

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 766 posted 7:56 pm on Jun 12, 2001 (gmt 0)

Ok debatable - on the gray list since we don't have alot of choice if we want ink traffic.

Who else is in on the Friendly list?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved