Forum Moderators: DixonJones

Message Too Old, No Replies

If a bot has no User Agent can i ban them by

User-Agent: i.e nothing ?

         

diddlydazz

12:50 pm on Feb 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Would that work ?

this bot leaves no User Agent and nothing else an example :

[01/Feb/2002:15:30:17 +0000] "GET / HTTP/1.0" 200 20253 "-" "-"

I can only use robots.txt, will this work

User-Agent:
Disallow: /

or

User-Agent: -
Disallow: /

Any ideas, this bot is really starting to get on my nerves.

Thanks in advance as always

Dazz

mark_roach

2:21 pm on Feb 8, 2002 (gmt 0)

10+ Year Member



Dazz

I think that:

User-Agent:
Disallow: /

will disallow all spiders. It is also likely that this "spider" would probably not check robots.txt anyway.

diddlydazz

5:04 pm on Feb 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I suspected as much Mark.

Thanks anyway.

Dazz

bird

7:33 pm on Feb 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



According to my references, the correct form to ban all robots looks like this:

User-agent: *
Disallow: /

(note the wildcard "*")
[robotstxt.org...]

diddlydazz

7:41 pm on Feb 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I realise that bird, but I don't think I will risk it :)

Dazz

bird

8:40 pm on Feb 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh, now I understand. The empty User-Agent entry is supposed to mean "the robot without an UA string? I can see the logic behind that, but I doubt if the robot would.

As apparently everybody else, I'd be surprised if a robot who's maintainer has neglected to configure an UA string would bother to read robots.txt anyway. After all, the same informal standard that defines robots.txt also requires that each robot identify itself.

diddlydazz

10:45 pm on Feb 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yeah sounds about right bird :)

Thanks

Dazz

littleman

11:03 pm on Feb 8, 2002 (gmt 0)



If you have an Apache server and you have the ability to have a .htaccess file this will deny all access to bots coming in without a UA:
SetEnvIf User-Agent ^$ keep_out
order allow,deny
allow from all
deny from env=keep_out

diddlydazz

11:11 pm on Feb 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks littleman but I don't have .htaccess rights for this site (its a hobby site with cheap hosting)

Thanks

Dazz

Key_Master

11:24 pm on Feb 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've been following a very crafty bot that has been crawling the web for a couple of months. It has no user agent or referrer string and all IP's begin with 64. It is released from one domain IP and returns the info to another domain IP and continues on back and forth switching from domain IP to domain IP. Most here probably haven't detected it yet.

There is a problem with banning a visitor because the browser doesn't contain a user agent. E.g. with some proxy services, you may inadvertently ban an visitor when the browser requests a javascript or css file. These types of request are often done through a different IP that has no user agent.

littleman

11:42 pm on Feb 8, 2002 (gmt 0)



That's too bad Diddlydazz.

Key_Master, you could work around that by putting your javascript and css files in a different a sub-folder with a .htaccess that does not have the above restriction

Or:
SetEnvIf User-Agent ^$ keep_out
<Files ~ "(\.html¦\.jpg¦\.gif¦\.what_ever_else)$">
order allow,deny
allow from all
deny from env=keep_out
</Files>

Key_Master

11:58 pm on Feb 8, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's a good work around littleman. The only problem I see is that some innocent users don't have a user agent and will be banned. The looksmart spider doesn't use one either, who knows who else. A better method would be the use of a php/perl script that would assign the visitor a user agent (e.g. their own IP address). That's what I have done.

littleman

12:03 am on Feb 9, 2002 (gmt 0)



No doubt you could do a lot more with a script.

>The looksmart spider doesn't use one either...
It won't matter in six months when they are out of business..

bird

1:13 pm on Feb 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have always seen the looksmart spider with an UA of "Mozilla/4.5 [en] (Win95; I)", and recently they seem to have upgraded to NT: "Mozilla/4.04 [en] (WinNT; I)".

Key_Master

3:14 pm on Feb 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Bird, we've had previous discussions here about the different agents coming out of looksmart's IP block.

[webmasterworld.com...]

Now you know they have one that doesn't use an agent. What's it for? I have no idea. Perhaps it checks for cloaked sites or possibly even ad listings on free submit sites.

bird

5:05 pm on Feb 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You live and learn... :)

Looks like they dumped the UA string some time between August and November. Since it only fetches URLs that are listed in the directory, I assume it is still the same link checker than before. Another interesting deatil I noted is that they switched from HEAD to GET requests between July and August last year.

I serve empty pages (without an error) to visitors without an UA, so far without any negative side effects on my LookSmart listings. Guess I'll start making an exception for them, just to be on the safe side.