eseek-crawler-larbin-2.63 crawler@exactseek.com

Forum Moderators: open

Message Too Old, No Replies

eseek-crawler-larbin-2.63 crawler@exactseek.com

exactseek.com - 2 UA's

claus

5:52 pm on Aug 27, 2003 (gmt 0)

One IP, two UA-strings.

IP: 207.36.47.237 (CyberGate, Inc., US)
UA1: eseek-crawler-larbin-2.63 (crawler@exactseek.com)
UA2: eseek-crawler-larbin-2.63 crawler@exactseek.com

UA1 grabbed robots.txt, UA2 grabbed index two minutes later from same IP.

It's the exactseek.com SE, didn't know that one, it has an interesting feature: Check your ranking. I was glad to discover i was #1 for my most important KW, so that bot's welcome anytime :) :)

/claus

jdMorgan

6:48 pm on Aug 27, 2003 (gmt 0)

claus,

Thanks for the heads-up. Here's some more related info for the SESI forum group:

Like many others, I had a complete block on Larbin - using an unanchored regex pattern, too.

Since I decided to allow ExactSeek, I had to change my code to make an exception:


# Permit ExactSeek to use larbin
RewriteCond %{HTTP_USER_AGENT} larbin [NC]
RewriteCond %{HTTP_USER_AGENT}!^eseek-larbin
RewriteRule !^(403.*\.html¦robots\.txt)$ - [F]

Note that the RewriteRule allows ANY larbin agent to fetch my robots.txt and my custom 403-Forbidden error page. You may want to do something different.

ExactSeek would be well-advised to dump the "larbin" in their UA, I think, and call it "ExactSeek spider" or something...

Jim

claus

10:01 pm on Aug 27, 2003 (gmt 0)

Now that you mention it, i haven't really seen the bad Larbin for ages... well, at least a very long time, and this search does not indicate much activity either:

[google.com...]

As far as my German can tell, the GMX site is offering spam-free email so i guess they're not exactly harvesting, and their site search is now overture-driven, so they've got no reason to spider either.

The Freshmeat Larbin project page changelog also has these interesting notes:

V2.6.2 (2002-04-14) :
Rewrite the robots.txt and html parser
V2.6.1 (2002-03-09) :
Improve robots.txt parser

Perhaps it's time to re-evaluate the larbin rule?

Then again, of course there might still be (older) versions around being used for whatever purpose.

/calus

jdMorgan

11:53 pm on Sep 1, 2003 (gmt 0)

Hmmm... Just found another variant:

207.36.47.237 - - [15/Jul/2003:17:42:38 -0400] "GET /robots.txt HTTP/1.0" 403 891 "-" "eseek-larbin_2.6.2 (crawler@exactseek.com)"

So, that changes the code a bit:


# Permit ExactSeek to use larbin
RewriteCond %{HTTP_USER_AGENT} larbin [NC]
RewriteCond %{HTTP_USER_AGENT} !^eseek-
RewriteRule !^(403.*\.html¦robots\.txt)$ - [F]

As to allowing Larbin, it has abused my sites so badly in the past that I don't intend to allow it unless it carries the IP address and valid contact info for a company/organization that I know of, and that provides a benefit for my site, my users, or my prospective users.

Jim