Forum Moderators: DixonJones

Message Too Old, No Replies

bandwidth leech protect script needed

         

bigsmile

9:18 am on Feb 19, 2006 (gmt 0)



Hi all!
I'm new here. Sorry if i post not in the right thread.

I have a problem with traffic on my site. I have 15Gb/month traffic quota and it is ok for my site, but some people using software for off-line browsing like "teleport", webZIP, WebSiteExtractor and more. This programs authenticates as regular browsers, so i can't detect them using $HTTP_USER_AGENT.
I also don't want to make password-protected areas or captchas, because i want to allow google to crawl my pages (my site is 700 Mb of texts).
I wrote this code to prevent leeching but not shure if this will help.
(some error check omitted)
PHP Code:
$sql = "insert into visitors (`ip`,`count`) values ('".$_SERVER['REMOTE_ADDR']."',0);";
$res = mysql_query($sql) ;
$sql = "update visitors set count=count+1 where ip='".$_SERVER['REMOTE_ADDR']."';";
$res = mysql_query($sql) ;
$sql = "select * from visitors where ip='".$_SERVER['REMOTE_ADDR']."';";
$res = mysql_query($sql) ;
$item=mysql_fetch_array($res);
if($item['count'] > 500 )
die;

I put this code at the beginning of header include file so it runs on every page and the counter zeroed every day.

My questions are:
Is this approach right? I'm not shure about using
PHP Code:
$_SERVER['REMOTE_ADDR'] and die() function at the end.
Maybe i can detect leechers using time interval between page requests? If yes how i make shure that it is leecher but not google spider?

p.s.: i allready tried to catch leechers using "nofollow" and invisible links and ban their ips, but this doesn't work.

Thank you for any suggestions.

Sergey.
[gumer.info...]

Matt Probert

9:33 am on Feb 19, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The eternal question!

Your problem is one of individual readers accessing too many pages in a day. Whether they are using an automatic system, or browsing manually is beside the point.

By limiting the number of pages requested, you may find you offend genuine readers. Perhaps you might consider increasing your monthly transfer allowance?

Matt

bigsmile

9:08 pm on Feb 19, 2006 (gmt 0)



My problem is not "individual readers accessing too many pages in a day" but some bad people that downloading 2Gb in a single day to kill my traffic and hang the site.

ronburk

1:26 am on Feb 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's a problem. For example, suppose you implement your solution and I'm a bad guy who is coming in via an AOL dial-up. Your code detects me and effectively bans my IP address and kills my wget (or whatever), so I hang up. Now someone else gets that IP address and says "Huh! What a hosed website -- it won't let me read anything there."

If you absolutely positively must penalize an IP address, then resetting once per day is certainly a good idea -- but could still, at least theoretically, end up banning quite a few normal humans each day (if you happen to hit an IP address that a major ISP is shuffling around to different visitors pretty frequently). I would rather penalize an IP address for a shorter period, like 10 minutes.

Assuming the bots are not hyper-intelligent, you might consider the problem as a) detect the bot and b) serve the bot fake pages that have no out-going links. The point of the latter, of course, is to give the bot no reason to think that they should just keep trying harder to connect to your website. Instead, make them think that they're done, that there are no more pages to fetch.

To avoid smacking Google, MSN, Yahoo!, etc., you pretty much have to white-list the known IP addresses they use (since you are already dealing with bots that are lying about themselves, so there's nothing to keep them from claiming to be GoogleBot).

You can also try the following bot-detecting approach.

a) add an entry to robots.txt that tells all bots to not fetch "/fredflintstone.html" (or whatever)
b) add a hidden link from all pages (or maybe just the home page will do) to "/fredflintstone.html" -- e.g., a hot-linked 1-pixel white spot would qualify as "hidden" in this context.
c) when an IP address does a fetch of "/fredflintstone.html", add it to your badbot list and, for the next ten minutes (or whatever), serve it only "/nolink.html", instead of whatever page it asks for, where that file looks like "<html><head></head><body></body></html>".

I would explicitly set the cache expiration header to an already expired date when transmitting the faked-up files. You don't want some giganto-caching server (like AOL) caching the fake HTML files and then serving them up to any "real" users who request them shortly thereafter.

AlexK

1:43 am on Feb 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since you are using PHP code, the problem is already fixed for you [webmasterworld.com]. It can even be used on HTML files [webmasterworld.com] (msg58) (as long as you have access to PHP).