Forum Moderators: open
The IT department of the site would use agent delivery to allow spidering of the pages. They're not experienced in agent delivery, and neither am I. The plan is that spider would go right to the site... Visitors would be redirected to a registration form.
Here's the check they'd be doing for crawlers (Apache server):
if ( $browser =~
/robot¦slurp¦crawl¦scooter¦googlebot¦libwww¦JennyBot¦polybot¦
ferret¦spider¦psbot¦openbot¦zyborg¦webstream\.net¦archiver¦
internetseer¦pompos¦ask jeeves¦teleportpro¦mercator¦
python-urllib¦webzip¦slysearch¦netsweeper/ ) {
return;
}
If it's not a crawler, they'd do a redirect for the visitor to the appropriate registration page:
print $cgi->redirect(-uri =>
"http://$ENV{'HTTP_HOST'}/cgi-bin/getReg/$section/$topic");
I'm not qualified to comment on whether the above would work, but I'd appreciate all thoughts I can pass on.
I want to make sure that the pages will get spidered, and also would value opinions on whether the procedure would withstand manual inspection by the engines. I'm assuming it would, and that the engines would manually inspect before they'd ban. The site is highly respected, clearly non-profit, and the content delivered to the user after registration is exactly what the spider sees.
Gotta run, but here's a quick cut and paste with a few more you may want to include - depends on your market and what directories you may be listed in. This is straight out of .htaccess, so the format's all wrong, but you can pick out any user-agents you might want.
Jim
# Excite spider (may be out of business)
RewriteCond %{HTTP_USER_AGENT} !^ArchitextSpider$
#
# ExactSeek spider
RewriteCond %{HTTP_USER_AGENT} !^ExactSeek\ Crawler/[1-9][0-9]?\.[0-9]{1,2}$
#
# Fast robot
RewriteCond %{HTTP_USER_AGENT} !^FAST\-WebCrawler/[1-9][0-9]?\.[0-9]{1,2}.*\ \(.*fast.*\)$
#
# GigaBlast robot
RewriteCond %{HTTP_USER_AGENT} !^Gigabot/[1-9][0-9]?\.[0-9]{1,2}$
#
# Looksmart robots
RewriteCond %{HTTP_USER_AGENT} !^MARTINI$
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[0-9]{1,2}\.[0-9]{1,2}.*\(compatible\;\ Zealbot\ [1-9][0-9]?\.[0-9]{1,2}\)$
#
# Lycos spiders
RewriteCond %{HTTP_USER_AGENT}!^Lycos_Spider_\(.*\)$
#
# Microsoft link checker libwww-perl/5.51
RewriteCond %{REMOTE_HOST} !^.*\.microsoft\.com$
#
# NationalDirectory WebSpider 1.3
RewriteCond %{HTTP_USER_AGENT} !^NationalDirectory\-WebSpider/[1-9][0-9]?\.[0-9]{1,2}$
#
# Openfind spider
RewriteCond %{HTTP_USER_AGENT} !^Openfind\ data\ gatherer\,\ Openbot/[1-9][0-9]?\.[0-9]{1,2}\+\(.*openfind.*\)$
#
# Polybot robot from NY Polytechnical
RewriteCond %{HTTP_USER_AGENT} !^polybot\ [1-9][0-9]?\.[0-9]{1,2}\ \(.*cis\.poly\.edu/polybot/\)$
#
# ScrubTheWeb spider
RewriteCond %{HTTP_USER_AGENT} !^Scrubby/[1-9][0-9]?\.[0-9]{1,2}\ \(.*scrubtheweb.*\)$
#
# SearchHippo spider
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ Fluffy\ the\ spider\;\ .*searchhippo.*\)$
#
# Teoma robots
RewriteCond %{HTTP_USER_AGENT} !^Teoma [NC]
#
# Thunderstone
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E\)
#
#Vagabondo/2.0 MT (webagent@NOSPAMwise-guys.nl)
RewriteCond %{HTTP_USER_AGENT} !^Vagabondo/[1-9][0-9]?\.[0,9]{1,2}\ MT\ \(webagent.*wise\-guys\.nl\)$
#
# Yahoo directory checker
RewriteCond %{REMOTE_HOST} !^.*\.corp\.yahoo\.com$
#
# appie 1.1 (www.walhello.com)
# BunnySlippers (from tide.microsoft.com)
#
# DMOZ ODP robot
RewriteCond %{HTTP_USER_AGENT} !^Robozilla/[1-9][0-9]?\.[0-9]{1,2}$
#
# DMOZ ODP editor
RewriteCond %{HTTP_USER_AGENT} !^Tulipchain
#
# SITE CHECKING TOOLS
#
# W3C_Validator W3C_Validator/1.183 libwww-perl/5.64
RewriteCond %{HTTP_USER_AGENT} !^W3C\_Validator/[1-9][0-9]?\.[0-9]{1,4}\ libwww\-perl/[1-9][0-9]?\.[0-9]{1,3}$
#
# Search Engine World Robots.txt Validator
RewriteCond %{HTTP_USER_AGENT} !^Search\ Engine\ World\ Robots\.txt\ Validator
#
# Xenu Link Sleuth 1.2c
RewriteCond %{HTTP_USER_AGENT} !^Xenu\ Link\ Sleuth\ [1-9][0-9]?\.[0-9]{1,2}
#
RewriteCond %{HTTP_USER_AGENT} !^LinkScan/[0-9]{1,2}\.[0-9]{1,2}\ Unix$
I don't know... Thought you guys would. ;) I'm not even sure where it's used on the server. I'm pretty much a beginner here. I think I understand the principles... I know I don't understand the specifics.
The string "modperl" appears in some of the fake directories that were created when we made the urls more user and search-engine friendly, so the language might be perl.
Jim - Thanks for the additional user agents. Is there a complete list on WebmasterWorld of bots they should allow? And where are updates to the spider list available? I assume keeping the list current is one of the costs of maintaining such a system. Also, I wonder, does Freshbot have a separate ID, or is it simply googlebot?
I cannot tell for sure from the code snippets you posted but I believe it is ok. Where does the return [perldoc.com] statement return to?
Since this is just UA delivery you can easily check by using a fake UA.
Andreas
Good point. I'm guessing that using the "noarchive" tag would trigger a manual inspection at some point.
I came acrosss a Google News thread about Getting subscription pages indexed [webmasterworld.com]. There was general negative reaction to the idea of indexing such content:
I would find it incredibly annoying if I came across a site doing this. You are wasting my time and reducing the quality of the search results.
Anyone have any idea whether the engines would react this way too?