Forum Moderators: open
IP: 207.36.47.237 (CyberGate, Inc., US)
UA1: eseek-crawler-larbin-2.63 (crawler@exactseek.com)
UA2: eseek-crawler-larbin-2.63 crawler@exactseek.com
UA1 grabbed robots.txt, UA2 grabbed index two minutes later from same IP.
It's the exactseek.com SE, didn't know that one, it has an interesting feature: Check your ranking. I was glad to discover i was #1 for my most important KW, so that bot's welcome anytime :) :)
/claus
Thanks for the heads-up. Here's some more related info for the SESI forum group:
Like many others, I had a complete block on Larbin - using an unanchored regex pattern, too.
Since I decided to allow ExactSeek, I had to change my code to make an exception:
# Permit ExactSeek to use larbin
RewriteCond %{HTTP_USER_AGENT} larbin [NC]
RewriteCond %{HTTP_USER_AGENT}!^eseek-larbin
RewriteRule !^(403.*\.html¦robots\.txt)$ - [F]
ExactSeek would be well-advised to dump the "larbin" in their UA, I think, and call it "ExactSeek spider" or something...
Jim
[google.com...]
As far as my German can tell, the GMX site is offering spam-free email so i guess they're not exactly harvesting, and their site search is now overture-driven, so they've got no reason to spider either.
The Freshmeat Larbin project page changelog also has these interesting notes:
V2.6.2 (2002-04-14) :
Rewrite the robots.txt and html parserV2.6.1 (2002-03-09) :
Improve robots.txt parser
Perhaps it's time to re-evaluate the larbin rule?
Then again, of course there might still be (older) versions around being used for whatever purpose.
/calus
207.36.47.237 - - [15/Jul/2003:17:42:38 -0400] "GET /robots.txt HTTP/1.0" 403 891 "-" "eseek-larbin_2.6.2 (crawler@exactseek.com)"
So, that changes the code a bit:
# Permit ExactSeek to use larbin
RewriteCond %{HTTP_USER_AGENT} larbin [NC]
RewriteCond %{HTTP_USER_AGENT} !^eseek-
RewriteRule !^(403.*\.html¦robots\.txt)$ - [F]
Jim