Forum Moderators: DixonJones
I have a problem with traffic on my site. I have 15Gb/month traffic quota and it is ok for my site, but some people using software for off-line browsing like "teleport", webZIP, WebSiteExtractor and more. This programs authenticates as regular browsers, so i can't detect them using $HTTP_USER_AGENT.
I also don't want to make password-protected areas or captchas, because i want to allow google to crawl my pages (my site is 700 Mb of texts).
I wrote this code to prevent leeching but not shure if this will help.
(some error check omitted)
PHP Code:
$sql = "insert into visitors (`ip`,`count`) values ('".$_SERVER['REMOTE_ADDR']."',0);";
$res = mysql_query($sql) ;
$sql = "update visitors set count=count+1 where ip='".$_SERVER['REMOTE_ADDR']."';";
$res = mysql_query($sql) ;
$sql = "select * from visitors where ip='".$_SERVER['REMOTE_ADDR']."';";
$res = mysql_query($sql) ;
$item=mysql_fetch_array($res);
if($item['count'] > 500 )
die;
I put this code at the beginning of header include file so it runs on every page and the counter zeroed every day.
My questions are:
Is this approach right? I'm not shure about using
PHP Code:
$_SERVER['REMOTE_ADDR'] and die() function at the end.
Maybe i can detect leechers using time interval between page requests? If yes how i make shure that it is leecher but not google spider?
p.s.: i allready tried to catch leechers using "nofollow" and invisible links and ban their ips, but this doesn't work.
Thank you for any suggestions.
Sergey.
[gumer.info...]
Your problem is one of individual readers accessing too many pages in a day. Whether they are using an automatic system, or browsing manually is beside the point.
By limiting the number of pages requested, you may find you offend genuine readers. Perhaps you might consider increasing your monthly transfer allowance?
Matt
If you absolutely positively must penalize an IP address, then resetting once per day is certainly a good idea -- but could still, at least theoretically, end up banning quite a few normal humans each day (if you happen to hit an IP address that a major ISP is shuffling around to different visitors pretty frequently). I would rather penalize an IP address for a shorter period, like 10 minutes.
Assuming the bots are not hyper-intelligent, you might consider the problem as a) detect the bot and b) serve the bot fake pages that have no out-going links. The point of the latter, of course, is to give the bot no reason to think that they should just keep trying harder to connect to your website. Instead, make them think that they're done, that there are no more pages to fetch.
To avoid smacking Google, MSN, Yahoo!, etc., you pretty much have to white-list the known IP addresses they use (since you are already dealing with bots that are lying about themselves, so there's nothing to keep them from claiming to be GoogleBot).
You can also try the following bot-detecting approach.
a) add an entry to robots.txt that tells all bots to not fetch "/fredflintstone.html" (or whatever)
b) add a hidden link from all pages (or maybe just the home page will do) to "/fredflintstone.html" -- e.g., a hot-linked 1-pixel white spot would qualify as "hidden" in this context.
c) when an IP address does a fetch of "/fredflintstone.html", add it to your badbot list and, for the next ten minutes (or whatever), serve it only "/nolink.html", instead of whatever page it asks for, where that file looks like "<html><head></head><body></body></html>".
I would explicitly set the cache expiration header to an already expired date when transmitting the faked-up files. You don't want some giganto-caching server (like AOL) caching the fake HTML files and then serving them up to any "real" users who request them shortly thereafter.