Forum Moderators: open

Message Too Old, No Replies

Two more slow spiders

         

Spearmaster

1:54 pm on Nov 27, 2002 (gmt 0)

10+ Year Member



Just in case anyone else sees these:

Mozilla la2@unspecified.mail
larbin_2.6.2 vitalbox1@hotmail.com

They have been slowly crawling my sites for the past couple of days, almost non-stop picking up a page every 20-30 minutes, and obviously unable to recognize when it has been banned.

jdMorgan

4:35 pm on Nov 27, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



... and obviously unable to recognize when it has been banned.

Do you mean you are returning a 403-Forbidden response, or that you Disallowed them in robots.txt?

I've got a 403-block on "larbin", but I've never heard of the other one.

Thanks,
Jim

wilderness

5:04 pm on Nov 27, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Jim,
I've seen both of these with a variety of beginnings;
SetEnvIf User-Agent unspecified.mail$ keep_out
SetEnvIf User-Agent sender.com>)$ keep_out

Spearmaster

1:53 am on Nov 28, 2002 (gmt 0)

10+ Year Member



Blocked the specific IPs in .htaccess. I really hate to play with robots.txt :) Unfortunately, Turnitinbot made me have to go update robots.txt.

Wilderness, I've seen the unspecified one as well - but that hasn't bothered me much.

By any chance, has anyone seen a browser that returns a null string? I can't ban this in .htaccess because Apache always returns "-" for a null string and some browsers legitimately have a hyphen...

jdMorgan

2:11 am on Nov 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Spearmaster,

I suggest you read deeply in this forum before implementing blocking. You can block user-agents starting with a specified string, user-agents ending with a specified string, user-agents containing a specified string, or user-agents which exactly match a specified string.

That's what all the hats and dollars ( "^" and "$" ) are for in the block lists you see here. These are present because the string-matching uses regular expressions [etext.lib.virginia.edu] - a pattern-matching specificaton language, if you will.

You really don't want to block null user-agents! There are many reasons a user-agent might be null, and many are beyond the control of the person using the browser. Corporate proxies often block UA, and Norton Internet Security and similar products will also block UA in their out-of-the-box default mode.

Jim

wilderness

3:11 am on Nov 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<snip>You really don't want to block null user-agents!>

Jim,
I ALMOST! Agree with you here 100 %.
I have seen many devious attempts to access with a null UA.

The only vaild spidering I've lost as a result of denying blank UA's was I believe Lycos.
I emailed the bot's website and they were totally baffled by explanation that the only reason their bot was being denied was because of the null UA.

The amazing thing was that each time they spidered there were two log entires. One with both null Referrer and UA. Then the second where the robots.txt was read and the UA was identified.
The logic of this defies me. Yet they wouldn't contine with the spider as a result of the 403 from the initial null.

Spearmaster

5:33 pm on Nov 28, 2002 (gmt 0)

10+ Year Member



I had implemented it for about 10 seconds before I realized I was potentially cancelling out lots of legitimate browser agents.

By the same token, I don't want someone ripping my site without any possible identification as to who it might be (other than an IP, of course) - and since I read my logs twice a day (I can hear you all laughing LOL) by the time I've discovered a rogue bot, it's too late. But personal anaylsis often makes the difference when determining whether or not the visit was legitimate.

wilderness

2:21 am on Nov 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<snip>But personal anaylsis often makes the difference when determining whether or not the visit was legitimate.>

Spearmaster,
Ilike your thinking :-)
I've had a visitor today at four different time frames from a University. They enter the main page. No robot's or nothing.
These types of visits have taugt me two be suspicious.

I'm sure the computer labs at many university's run 24/7 however I'm not sure they are out making probes on a holiday?