Forum Moderators: DixonJones
I have GET / HTTP/1.1" 200 19764 "-" "-" in a UA string in today's log files.
Tracing the IP number through SpamCop, it goes to "abuse@Futuresoft.com" so I went to their website only to find they are a software company.
In the back of my mind I seem to recall some negative connotations with respect to "-" "-", but I can't find it.
Is there any reason to suspect the software producer of anything?
Most importantly, is "-" "-" something I should put in my .htaccess file or just ban by IP Number?
Thanks.
Pendanticist.
...my site reflects my views so when this happens they get taken to a page explaining why they got 403'd and what they can do to fix the problem.
The only thing to watch out for in this approach is that you always serve up robots.txt properly as at least one crawler requests it without any user-agent.
- Tony
...my site reflects my views so when this happens they get taken to a page explaining why they got 403'd and what they can do to fix the problem.
For my solution I'm thinking:
User-Agent: "-" "-"
Disallow:/path/
with a line in between each succeeding Disallow
User-Agent: other bot
Disallow:/path/
User-Agent: other bot
Disallow:/path/
User-Agent: other bot
Disallow:/path/
User-Agent: still other bot
Disallow:/path/
and so on....
Sound about right?
As an aside: Do you do this on a per UA/bot/spider basis so that each one gets it's own set of specific instructions/rememdies on an individually customized 403 page? Seems complicated.
Pendanticist.
The reason I ask is that robots.txt is optional meaning that normally there is nothing in place to enforce its rules, and most of the bots you really want to block just ignore it anyway!
My reference to the robots.txt was just to point out that you shouldn't deny *every* request from these bots because as a bare minimum you want to always be able to serve robots.txt to at least give them a hint that you don't want them to touch your site...
I'm an ASP person so my technical solution was {sound of Tony rummaging through code};
Sub AccessDenied()
'Code to generate access denied status based on certain criteria
Dim bIsAccessDenied, sExtraInfobIsAccessDenied = False
sExtraInfo = vbNullstringIf Len( Trim( Request.ServerVariables("HTTP_USER_AGENT") ) ) <= 1 Then
bIsAccessDenied = True
sExtraInfo = "Turn on your user-agent if you want to browse the site, otherwise go home - things that browse without user-agents are mostly bad web-spiders looking for e-mail addresses."End If
If bIsAccessDenied Then
Response.Status = "403 Site Access Denied"
%>
<html>
<head>
<title>Access Denied</title>
<meta name="robots" content="noindex" />
</head>
<body>
<h1>Access Denied</h1>
Access to the requested resource has been denied.
<p></p>
<%
If sExtraInfo <> vbNullString Then
Response.Write sExtraInfo
End If
%>
</body>
</html>
<%
Response.End
End IfEnd Sub
I put this inside my core include and have it fire before the on-page code gets run - this way if they do come-a-calling all they get is a 403 with a useful message but nothing from the actual page...
Essentially it checks if the total size of the user-agent, minus any leading/trailing spaces - if this is less than two characters it creates a simple, tidy & unindexable page in place of the actual content before stopping the rest of the page from running (response.end).
<added>because tony doesnt read stuff to thoroughly when it's late</added>
Do I deny them individually or using some sort of code?
At the moment there are two ways someone can get a 403 from my site;
1) giving me no user-agent or one which I think is gibberish
2) asking for a page such as formmail
Currently each trigger is programmed/loaded into the site individually and so whenever it could possibly not be the user's fault I try to give out a useful message (nb the formmail 403 has no nice explanation to go with it).
Okay, tired now. I'm off to bed!
- Tony
Is that a robots.txt?
The reason I ask is that robots.txt is optional meaning that normally there is nothing in place to enforce its rules, and most of the bots you really want to block just ignore it anyway!
Ok. Then what's the intended purpose of robots.txt...in the general scheme of things?
Most importantly, is "-" "-" something I should put in my .htaccess file or just ban by IP Number?
Should I ban in .htaccess or individually by IP Numbers?
I'm an ASP person...
From there on down you lost me. :( Don't know anything about ASP.
Pendanticist.
As explained earlier it's very much an "honour system" thing in that there is no physical mechanism linking robots.txt to how the server interprets/handles requests.
Now nice crawlers read robots.txt and obey it because they understand it's in their best interests, however there's always going to be a minority which ignore it.
At this point you can either just leave them be or you can identify & block them at the webserver level - which is where the 403 stuff comes in.
Should I ban in .htaccess or individually by IP Numbers?
If it were me I'd block the UA via .htaccess but I'd make sure that *whatever* happenned robots.txt always gets served.
- Tony