Forum Moderators: open
A little background information:
my company has developed self protecting web sites... meaning that if one ip or USER_AGENT hits our pages more than 20 times in X amount of time they are blocked from our sites. We do this to prevent scraping our sites, and to stop any e-mail harvesters. The system works well... too well we have been noticing that as there are more sites added the searchengines have to hit more pages and are begining to be blocked. For obvious reasons this is not good. We can't increase the amount of times or the time period because that would defeat the purpose, we can however add the search engines to the allow list. We would like to allow them all at once instead of waiting for them to be blocked because they might not come back.
So my chalange to the seo world is to help come up with a list of all the USER_AGENTS and / or ip's for all the spiders (atleast the top 10).
I would like to thank every one in advance for helping in this. I will include you in the credits and if you would like a copy to the compleated "compiled list" let me know and I will be sure to get you a copy.
[webmasterworld.com...]
Here is an old list extracted from my old .htaccess file. The syntax is for mod_rewrite, and it is set up as an exclusion list; That is, the rewrite rule that follows this list is a block, and the following agents are excluded from being blocked.
Some members here will not agree that all of these user agents should be allowed. However, this list is as complete as I could make it at the time, and allows as many legitimate robots as possible - while I admit that some of them are annoying, or only marginally useful.
This list is also intended for the "North American market" and may omit some very desirable European and Asian spiders - I list only those that have visited my sites.
HTH,
Jim
# SEARCH ENGINE ROBOTS & SPIDERS
#
# Alexa/Wayback Machine spider
RewriteCond %{HTTP_USER_AGENT} !^ia_archiver$
#
# Almaden IBM crawler
RewriteCond %{HTTP_USER_AGENT} !^http\://www\.almaden\.ibm\.com/cs/crawler
#
# Ask Jeeves robot
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ Ask\ Jeeves\)$
#
# DMOZ ODP robot
RewriteCond %{HTTP_USER_AGENT} !^Robozilla/[1-9][0-9]?\.[0-9]{1,2}$
#
# DMOZ ODP editor
RewriteCond %{HTTP_USER_AGENT} !^Tulipchain
#
# Excite spider (may be out of business)
RewriteCond %{HTTP_USER_AGENT} !^ArchitextSpider$
#
# ExactSeek spider
RewriteCond %{HTTP_USER_AGENT} !^ExactSeek\ Crawler/[1-9][0-9]?\.[0-9]{1,2}$
#
# Fast robot
# FAST-WebCrawler/3.6 (atw-crawler at fast dot no; [fast.no...]
# FAST-WebCrawler/3.6/FirstPage (crawler@fast.no; [fast.no...]
RewriteCond %{HTTP_USER_AGENT} !^FAST\-WebCrawler/[1-9][0-9]?\.[0-9]{1,2}.*\ \(.*fast.*\)$
#
# Google robot
# Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
# Googlebot Image
RewriteCond %{HTTP_USER_AGENT} !^Googlebot/[1-9][0-9]?\.[0-9]{1,2}\ \(.*google.*\)$
RewriteCond %{HTTP_USER_AGENT} !^Googlebot.Image
#
# GigaBlast robot
RewriteCond %{HTTP_USER_AGENT} !^Gigabot/[1-9][0-9]?\.[0-9]{1,2}$
#
# Inktomi robot
# Mozilla/5.0 (Slurp/si; slurp@inktomi.com; [inktomi.com...]
# Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; [inktomi.com...]
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(Slurp/[a-z]{2,3}\;\ .*inktomi.*\)$
#
# Looksmart robot
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ compatible\ ZyBorg/[1-9][0-9]?\.[0-9]{1,2}\ \(.*looksmart.*\)$
RewriteCond %{HTTP_USER_AGENT} !^MARTINI$
#RewriteCond %{HTTP_USER_AGENT} !^.*Zealbot\ [1-9][0-9]?\.[0-9]{1,2}$
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[0-9]{1,2}\.[0-9]{1,2}.*\(compatible\;\ Zealbot\ [1-9][0-9]?\.[0-9]{1,2}\)$
#
# Lycos spiders
RewriteCond %{HTTP_USER_AGENT} !^Lycos_Spider_\(.*\)$
#
# Mercator-2.0 robot (from atrax2.pa-x.dec.com)
RewriteCond %{HTTP_USER_AGENT} !^Mercator\-[1-9][0-9]?\.[0-9]{1,2}$
#
# Microsoft link checker libwww-perl/5.51
RewriteCond %{REMOTE_HOST} !^.*\.microsoft\.com$
#
# NationalDirectory WebSpider 1.3
RewriteCond %{HTTP_USER_AGENT} !^NationalDirectory\-WebSpider/[1-9][0-9]?\.[0-9]{1,2}$
#
# Openfind spider
RewriteCond %{HTTP_USER_AGENT} !^Openfind\ data\ gatherer\,\ Openbot/[1-9][0-9]?\.[0-9]{1,2}\+\(.*openfind.*\)$
#
# Polybot robot from NY Polytechnical
RewriteCond %{HTTP_USER_AGENT} !^polybot\ [1-9][0-9]?\.[0-9]{1,2}\ \(.*cis\.poly\.edu/polybot/\)$
#
# ScrubTheWeb spider
RewriteCond %{HTTP_USER_AGENT} !^Scrubby/[1-9][0-9]?\.[0-9]{1,2}\ \(.*scrubtheweb.*\)$
#
# SearchHippo spider
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ Fluffy\ the\ spider\;\ .*searchhippo.*\)$
#
# Teoma robots
RewriteCond %{HTTP_USER_AGENT} !^Teoma [NC]
#
# Thunderstone
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E\)
#
# Altavista robots
RewriteCond %{HTTP_USER_AGENT} !^Scooter.*[1-9][0-9]?\.[0-9]
#
#Vagabondo/2.0 MT (webagent@NOSPAMwise-guys.nl)
RewriteCond %{HTTP_USER_AGENT} !^Vagabondo/[1-9][0-9]?\.[0,9]{1,2}\ MT\ \(webagent.*wise\-guys\.nl\)$
#
# WebRing.com robot
RewriteCond %{HTTP_USER_AGENT} !^Jonzilla/[0-9]
#
# Yahoo directory checker
RewriteCond %{REMOTE_HOST} !^.*\.corp\.yahoo\.com$
#
# appie 1.1 (www.walhello.com)
# BunnySlippers (from tide.microsoft.com)