Forum Moderators: phranque

Message Too Old, No Replies

Can I *really* ban aggressive robots

without upsetting Google accidentally?

         

corpuscle

4:12 pm on Aug 12, 2002 (gmt 0)

10+ Year Member



Hello,

A Forum Newbie here! I keep a close eye on this forum, as it is by far the most worthwhile read when it comes to keeping up to speed with Google. So much so, that this is my first question in 8 months of reading the forum:

My site has relatively good positions in search results, lots of content and lots of pages.

As a result, I regularly have non-SE robots crawling my site; I don't know what they do with the data, but at worst they could replicate my site content.

I manually analyse my logs, and "ban" (through .htaccess) client IPs who seem to be making excessive numbers of requests to my server. This is ok, but it is a never-ending task.

I want to automate this procedure. I was thinking of writing a script to analyse my logs daily, and to "ban" any IP addresses which made over a certain number of requests. I was going to limit this "auto-ban" to clients with undefined or mozilla-like user-agent headers characteristic of suspicious robots.

Now, I can see two problems:

1) I obviously don't want to exclude acceptable robots (such as Google). That's easy enough, since googlebot's UA and IP-range are well-known. From reading this forum, however, I understand that Google may make requests from a non-google user-agent client (from some IP-range that I don't know). I don't want to inadvertently ban such a client, because it would then look like I am cloaking, which I am not.

2) A large number of users may be coming from a single proxy. I don't want to ban that proxy just because it looks like an invasive client!

Now, 2) is a non-Google specific problem, but 1) is very important since Google traffic feeds much of my site - I can't afford any issues with Google.

Other than writing a very intelligent script which looks carefully at the log patterns for any invasive clients, I can't see a good solution to this problem.

Have any of you tried to tackle this, and if so, could you make some general recommendations? For example, do you know whether google's non-googlebot UA client makes fewer than X requests per day?

Thank you for any suggestions you might have!

Key_Master

5:30 pm on Aug 12, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You'll be ok. Just don't ban the Googlebot user agent or any part of the Google IP block.

I think you may be partly referring to a previous thread where GG and myself were dealing with a 403 issue. It's worth noting that even Google blocks the Wget agent and many other bad bots. It won't be held against you if you do the same.

cYbErDaRk

8:40 am on Aug 13, 2002 (gmt 0)

10+ Year Member



Hi

I have a list of this kind of robots which I directly ban from my site. Well, not exactly robots, but crawling programs, such as wget or Teleport. It's easy in PHP:

if (isset($HTTP_SERVER_VARS["HTTP_USER_AGENT"])) $HTTP_USER_AGENT=$HTTP_SERVER_VARS["HTTP_USER_AGENT"];

if (strpos($HTTP_USER_AGENT,"Teleport")===true) exit;
if (strpos($HTTP_USER_AGENT,"wget")===true) exit;
if (strpos($HTTP_USER_AGENT,"WebCopier")===true) exit;
if (strpos($HTTP_USER_AGENT,"WebReaper")===true) exit;
if (strpos($HTTP_USER_AGENT,"Website eXtractor")===true) exit;
if (strpos($HTTP_USER_AGENT,"www.plagiarism.org")===true) exit;

and so...

Regards