Forum Moderators: open

Message Too Old, No Replies

I need help building a list

I am building a list of spiders that we should let in and would like help

         

anoryu

8:46 pm on Dec 26, 2002 (gmt 0)

10+ Year Member



I have been tasked with building a list of spiders that my company wants to allow, I have looked for a list out there in the great expance of the internet... but have come up dry. So I thought I would call on my fellow seo's to help create this list.

A little background information:
my company has developed self protecting web sites... meaning that if one ip or USER_AGENT hits our pages more than 20 times in X amount of time they are blocked from our sites. We do this to prevent scraping our sites, and to stop any e-mail harvesters. The system works well... too well we have been noticing that as there are more sites added the searchengines have to hit more pages and are begining to be blocked. For obvious reasons this is not good. We can't increase the amount of times or the time period because that would defeat the purpose, we can however add the search engines to the allow list. We would like to allow them all at once instead of waiting for them to be blocked because they might not come back.

So my chalange to the seo world is to help come up with a list of all the USER_AGENTS and / or ip's for all the spiders (atleast the top 10).

I would like to thank every one in advance for helping in this. I will include you in the credits and if you would like a copy to the compleated "compiled list" let me know and I will be sure to get you a copy.

sun818

8:50 pm on Dec 26, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



hi anoryu, have you looked at WebmasterWorld's robots.txt?

[webmasterworld.com...]

anoryu

10:10 pm on Dec 26, 2002 (gmt 0)

10+ Year Member



No I had not seen that one. It is a great list of what not to allow but I need a list of the engines to allow. I will be adding the ones from that list to our block list tho. Thank you for pointing me to it.

jdMorgan

11:56 pm on Dec 26, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



anoryu,

Here is an old list extracted from my old .htaccess file. The syntax is for mod_rewrite, and it is set up as an exclusion list; That is, the rewrite rule that follows this list is a block, and the following agents are excluded from being blocked.

Some members here will not agree that all of these user agents should be allowed. However, this list is as complete as I could make it at the time, and allows as many legitimate robots as possible - while I admit that some of them are annoying, or only marginally useful.

This list is also intended for the "North American market" and may omit some very desirable European and Asian spiders - I list only those that have visited my sites.

HTH,
Jim

# SEARCH ENGINE ROBOTS & SPIDERS
#
# Alexa/Wayback Machine spider
RewriteCond %{HTTP_USER_AGENT} !^ia_archiver$
#
# Almaden IBM crawler
RewriteCond %{HTTP_USER_AGENT} !^http\://www\.almaden\.ibm\.com/cs/crawler
#
# Ask Jeeves robot
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ Ask\ Jeeves\)$
#
# DMOZ ODP robot
RewriteCond %{HTTP_USER_AGENT} !^Robozilla/[1-9][0-9]?\.[0-9]{1,2}$
#
# DMOZ ODP editor
RewriteCond %{HTTP_USER_AGENT} !^Tulipchain
#
# Excite spider (may be out of business)
RewriteCond %{HTTP_USER_AGENT} !^ArchitextSpider$
#
# ExactSeek spider
RewriteCond %{HTTP_USER_AGENT} !^ExactSeek\ Crawler/[1-9][0-9]?\.[0-9]{1,2}$
#
# Fast robot
# FAST-WebCrawler/3.6 (atw-crawler at fast dot no; [fast.no...]
# FAST-WebCrawler/3.6/FirstPage (crawler@fast.no; [fast.no...]
RewriteCond %{HTTP_USER_AGENT} !^FAST\-WebCrawler/[1-9][0-9]?\.[0-9]{1,2}.*\ \(.*fast.*\)$
#
# Google robot
# Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
# Googlebot Image
RewriteCond %{HTTP_USER_AGENT} !^Googlebot/[1-9][0-9]?\.[0-9]{1,2}\ \(.*google.*\)$
RewriteCond %{HTTP_USER_AGENT} !^Googlebot.Image
#
# GigaBlast robot
RewriteCond %{HTTP_USER_AGENT} !^Gigabot/[1-9][0-9]?\.[0-9]{1,2}$
#
# Inktomi robot
# Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
# Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; [inktomi.com...]
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(Slurp/[a-z]{2,3}\;\ .*inktomi.*\)$
#
# Looksmart robot
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ compatible\ ZyBorg/[1-9][0-9]?\.[0-9]{1,2}\ \(.*looksmart.*\)$
RewriteCond %{HTTP_USER_AGENT} !^MARTINI$
#RewriteCond %{HTTP_USER_AGENT} !^.*Zealbot\ [1-9][0-9]?\.[0-9]{1,2}$
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[0-9]{1,2}\.[0-9]{1,2}.*\(compatible\;\ Zealbot\ [1-9][0-9]?\.[0-9]{1,2}\)$
#
# Lycos spiders
RewriteCond %{HTTP_USER_AGENT} !^Lycos_Spider_\(.*\)$
#
# Mercator-2.0 robot (from atrax2.pa-x.dec.com)
RewriteCond %{HTTP_USER_AGENT} !^Mercator\-[1-9][0-9]?\.[0-9]{1,2}$
#
# Microsoft link checker libwww-perl/5.51
RewriteCond %{REMOTE_HOST} !^.*\.microsoft\.com$
#
# NationalDirectory WebSpider 1.3
RewriteCond %{HTTP_USER_AGENT} !^NationalDirectory\-WebSpider/[1-9][0-9]?\.[0-9]{1,2}$
#
# Openfind spider
RewriteCond %{HTTP_USER_AGENT} !^Openfind\ data\ gatherer\,\ Openbot/[1-9][0-9]?\.[0-9]{1,2}\+\(.*openfind.*\)$
#
# Polybot robot from NY Polytechnical
RewriteCond %{HTTP_USER_AGENT} !^polybot\ [1-9][0-9]?\.[0-9]{1,2}\ \(.*cis\.poly\.edu/polybot/\)$
#
# ScrubTheWeb spider
RewriteCond %{HTTP_USER_AGENT} !^Scrubby/[1-9][0-9]?\.[0-9]{1,2}\ \(.*scrubtheweb.*\)$
#
# SearchHippo spider
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ Fluffy\ the\ spider\;\ .*searchhippo.*\)$
#
# Teoma robots
RewriteCond %{HTTP_USER_AGENT} !^Teoma [NC]
#
# Thunderstone
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E\)
#
# Altavista robots
RewriteCond %{HTTP_USER_AGENT} !^Scooter.*[1-9][0-9]?\.[0-9]
#
#Vagabondo/2.0 MT (webagent@NOSPAMwise-guys.nl)
RewriteCond %{HTTP_USER_AGENT} !^Vagabondo/[1-9][0-9]?\.[0,9]{1,2}\ MT\ \(webagent.*wise\-guys\.nl\)$
#
# WebRing.com robot
RewriteCond %{HTTP_USER_AGENT} !^Jonzilla/[0-9]
#
# Yahoo directory checker
RewriteCond %{REMOTE_HOST} !^.*\.corp\.yahoo\.com$
#
# appie 1.1 (www.walhello.com)
# BunnySlippers (from tide.microsoft.com)

anoryu

2:35 pm on Dec 27, 2002 (gmt 0)

10+ Year Member



Thank you JD this will go a long way to building my list.

carfac

6:10 am on Dec 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



anoryu:

Hey- I do not have a good list, just a bad one (but I would share, if you like...)

BUT.... I am very interested in a accesses/time period block. Can you PM me with any details? Is it for public release?

Thanks!

Dave