- Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

mattie

4:44 pm on Oct 24, 2004 (gmt 0)

10+ Year Member

I'm new to WebmasterWorld, and would appreciate your help identifying legitimate robots from malicious ones.

I've already identified a wide variety of robots that misbehave, but there are some Urchin statistics that I' m having trouble with.

Specifically, two of the listings for robots that have visited my sites are described as "Mozilla Compatible Agent" and "Googlebot."

The specifics for the Mozilla compatible agent are:

Mozilla Compatible Agent:
* Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
* Mozilla/3.01 (compatible;)
* Mozilla/4.7 [en](Exabot@exava.com)
* Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
* Mozilla/4.0 (compatible; grub-client-2.3)
* Mozilla/3.0 (compatible; Indy Library)
* Mozilla/3.0 (compatible)

Specifics for the Googlebot listing is:

Googlebot:
* Googlebot/2.1 (+http://www.google.com/bot.html)

I have already banned the grub-client and Indy Library bots, but I'm unsure which Googlebot is legitimate. Also, which of the other various Mozilla compatible bots are suspect?

wilderness

8:46 pm on Oct 24, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

See bull's collated bot list.
Message #'s 16 and 17 are the most recent update.
[webmasterworld.com...]

claus

10:11 pm on Oct 24, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Wow, when did this forum come back? Great :) :)

Both Googlebots could be legitimate. You would have to check their IP address to be sure, though.

Btw, welcome to WebmasterWorld mattie :)

fiestagirl

10:44 pm on Oct 24, 2004 (gmt 0)

10+ Year Member

Exavabot:
[webmasterworld.com...]

wilderness

1:37 am on Oct 25, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Wow, when did this forum come back?

Mid June, Claus.
[webmasterworld.com...]

Welcome back.

mattie

1:56 pm on Oct 25, 2004 (gmt 0)

10+ Year Member

> See bull's collated bot list.
> Message #'s 16 and 17 are the most recent update.
> [webmasterworld.com...]

What a wealth of information! I'll add this to my resources. Thanks, Wilderness.

> Both Googlebots could be legitimate. You would have to check their
> IP address to be sure, though.

Oy. Both of the sites I'm working with now have log files I don't have access to, and, unfortunately, the Urchin stats don't seem to correlate the robot statistics with the IP address. With this info, however, maybe I can talk the host into giving me access to them.

> Btw, welcome to WebmasterWorld mattie :)

Thanks.

I've done extensive research about this over the last week, and consistently found that the answers I needed were found here.

I appreciate your input FiestaGirl, Wilderness and Claus!

mattie

8:59 pm on Oct 25, 2004 (gmt 0)

10+ Year Member

Given Bull's superb bot list at [webmasterworld.com...] I'm tempted to write code to stuff this info into a database, and automatically generate code for all of us to stuff into our htaccess files.

With this application, of course, we'd be able to add the latest Web cretins to the list.

Or has someone already done this?

Thanks all,
Mattie

wilderness

11:01 pm on Oct 25, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Or has someone already done this?

A Close to Perfect Htaccess (this old thread should keep you busy for a week or so ;)

[webmasterworld.com...]

volatilegx

4:29 am on Oct 26, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

With this application, of course, we'd be able to add the latest Web cretins to the list.
Or has someone already done this?

Somebody beat you to it: [joseluis.pellicer.org...]

mattie

2:13 pm on Oct 26, 2004 (gmt 0)

10+ Year Member

I'd already seen the close to perfect htaccess thread, but hadn't seen Jose's application. Beautiful!

opiesilver

1:16 am on Oct 28, 2004 (gmt 0)

10+ Year Member

Wow. That is truely awsome.

guitaristinus

10:47 am on Nov 1, 2004 (gmt 0)

10+ Year Member

I've just added the following to my .htaccess file. Got it from one of jdMorgan's posts. Trying to block malicious Mozillas with it. It's supposed to block requests with blank referrer and bogus UA containing Mozilla/x.xx.

RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9]{1,2}$
RewriteRule .* - [F]

Jim had RewriteRule!^403i?\.html$ - [F,L]
"This allows access only to custom 403 error and 'help' pages, which were not subsequently fetched." I just put the lines at end of my long RewriteCond...[OR] list.

guitaristinus

12:58 pm on Nov 2, 2004 (gmt 0)

10+ Year Member

I deleted

RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9]{1,2}$

because a MSIECrawler was getting in.

62.118.153.11 - - [01/Nov/2004:11:41:53 -0500] "GET /mydomain/page.htm HTTP/1.1" 200 2511 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; MSIECrawler)"

Another domain I didn't add the RewriteConds to gave the MSIECrawler a 403 because of other RewriteConds I have in the htaccess files on both domains.

The following MSIECrawler got a 403. Difference is that it has "SV1" instead of ".NET CLR 1.1.4322".

204.119.21.25 - - [01/Nov/2004:02:23:01 -0500] "GET /mydomain/ HTTP/1.1" 403 217 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MSIECrawler)"

I sure didn't understand it to begin with. I'm doing trial and error; copying, pasting and deleting; and learning what works along the way.

Umbra

4:33 pm on Nov 2, 2004 (gmt 0)

10+ Year Member

I have seen hits from these user agents:

Mozilla
Mozilla/4.0
Mozilla/5.0
MSIE5.5

These appear in the log files exactly as is (ie., "Mozilla/4.0") and I didn't see them on bull's list. Can these be legitimate user agents?

volatilegx

3:59 pm on Nov 3, 2004 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

> Can these be legitimate user agents?

In the sense that they are actual web browsers used by humans? Doubtful.

Umbra

8:36 am on Nov 4, 2004 (gmt 0)

10+ Year Member

Why do these email harvesters (or whatever they are) use incorrect syntax for their user agents? Why not just use a proper user agent and thus escape detection altogether?

<suspiciously narrowing my eyes> Surely some of Them are here on WWW right now, looking for ways to foil our Rewrites. Maybe it's even the guy who started this thread! If only the Internet allowed us to form an angry mob with pitchforks and a good scapegoat.

mattie

2:39 pm on Nov 5, 2004 (gmt 0)

10+ Year Member

> Maybe it's even the guy who started this thread!

Umm ... the person who started this thread is a woman. :)

I also share your sentiments about cretins who don't play nicely. I've occasionally wanted to brandish a pitchfork myself.