homepage Welcome to WebmasterWorld Guest from 54.205.236.46
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Accredited PayPal World Seller

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Anti-Leech Script Questions
Answers to a stickymail on Bot-Blocking Script
AlexK




msg:3210225
 7:12 am on Jan 6, 2007 (gmt 0)

I got a sticky from someone about the Bot-Blocking script that I champion, asking some obvious questions. Since questions in one person's mind are likely to be questions in many other minds, here are public answers to the private sticky.

First: Obtain Bad-Bot-Blocking script here [modem-help.freeserve.co.uk].
Second: The previous post is here [webmasterworld.com].

These are the questions:

  1. Is the final code linked to from your main first post the newest and most up to date?
  2. Does that have within it the avoidance of blocking good bots?
  3. You have a majority of the document commented out and only a small snippet at the start active. Is that all that is needed?

These are the answers:

1 Is the final code linked to from your main first post the newest and most up to date?
Yes. The link at top is (at the time of writing) the latest available.

The script is also available from the Downloads pages on my site. Changes at the free-ISP that hosts the top link mean that I can no longer update that script so, for future changes, you will need to search my home site.

3 You have a majority of the document commented out and only a small snippet at the start active. Is that all that is needed?
It depends on what you want!

The snippet of "active code" that lies between the 'Start(Stop) blocking badly-behaved bots : top code' comment lines does exactly what it says: it blocks badly-behaved bots.

You also need to pay attention to the very first comment block. The cannot-avoid part of this is: (before the snippet gets processed):

  1. define _B_DIRECTORY
  2. define _B_LOGFILE
  3. define _B_LOGMAXLINES
I hope that it goes without saying that _B_DIRECTORY also needs to pre-exist on the server, with the correct permissions. The previous post [webmasterworld.com] contains an extended discussion of this. In particular, chu2117 had a problem with the server backup-process screwing up the bot-block algorithm. The fix for that is here [webmasterworld.com].

The rest of the comments answer common questions and try to anticipate possible problems in particular situations. They also offer additional options; in particular:

2 Does that have within it the avoidance of blocking good bots?
The top snippet of code does not implement a whitelist.

The comments offer 3 means of implementing a whitelist, any or all of which may be used - your choice.

I do not use a whitelist on my site. My experience of so-called "good" bots is that they frequently go bad. I therefore tune the script variables--in particular $bTotVisit, $bStartOver and $ipLength--to match my site visitor patterns. If any bot goes wild, it gets banned, whoever it is, "good" or "bad".

 

Decius




msg:3210623
 7:22 pm on Jan 6, 2007 (gmt 0)

I think it would be helpful to clearly explain the general mechanism that this script uses so that those reading the code can follow along. It seems it doesn't use IP logging in a simple way and uses some form of file/directory timedate checking. This is important for tweaking. For example, I do not know why "Max visits allowed within $bInterval (MUST be > $bInterval)".

Thanks Alex.

AlexK




msg:3211917
 9:25 am on Jan 8, 2007 (gmt 0)

Decius:
I think it would be helpful to clearly explain the general mechanism that this script uses

Sorry for the delay in replying - I have just returned from visiting my son's family. His soon-to-be-5-years-old daughter Michaela took part in a ballet presentation (Nutcracker, plus street-dance by some of the elder girls) by her ballet-class. Totally delightful.

The info that you ask for is within the comments (just above the Changelog), although it only makes sense if you have read through all of the referenced-threads (!).

Your point is valid, of course - the mass of comments have been added piece-by-piece following each thread on this routine, and have grown to the point where it becomes tiring to make sense of them all. I'll make an attempt to precis:

General mechanism & Routine logic:
.
The routine uses zero-byte files ($ipFile) within `_B_DIRECTORY` to track individual users by IP-address. An algorithmic reduction--controlled by $ipLength--is used to keep the number of these files within manageable proportions.
.
Twin timestamps (mtime, atime) on each $ipFile are used to track both gross number and frequency of accesses, either of which can trip a block on further access:
  • gross access (slow-scraper, controlled by $bTotVisit)
  • frequency (fast-scraper, controlled by {$bMaxVisit / $bInterval})
A roll-over period ($bStartOver) prevents permanent exclusion.

Fast-Scraper reset logic:
The fast-scraper test is: (( $visits / $duration ) > ( $bMaxVisit / $bInterval )). Therefore, blocking needs to stop when (( $visits / $duration ) == ( $bMaxVisit / $bInterval )). Since $duration wants to be == $bPenalty, that equation can be solved for $visits. As the test does not begin until ( $visits >= $bMaxVisit ), it is imperative that ( $bMaxVisit / $bInterval ) >= 1.


The comment at the declaration to each control-variable (plus name) should make the use of each one self-evident. The only exception may be $ipLength, and the best advice there is to use the default, and increase by one notch if users are getting wrongly blocked due to conflation of IP-addresses.

I hope that all of the above is clear. I am normally a night-worker, and have time-shifted by 12 hours to be able to visit my son (as at top). At this instant I'm still suffering the side-effects of that, so my explanations may not be totally clear.

All the best.

selomelo




msg:3216214
 4:55 pm on Jan 11, 2007 (gmt 0)

Thank you for this great script. I am using the script since July last year. And it works well although I did make a mistake initially that led the search engines deindex my 8000 pages.

However, I think a mysql version (instead of thousands of files for tracking) would be a great feature.

AlexK




msg:3216858
 1:30 am on Jan 12, 2007 (gmt 0)

selomelo:
However, I think a mysql version ... would be a great feature.

Hmm. I think that I shall leave you to implement that, selomelo.

The current implementation is blindingly fast. I suspect that the involvement of MySQL could add unacceptable delays. I know that the use of many thousands of files may raise eyebrows, but as zero-byte files they consume vanishingly-small resources.

...although I did make a mistake initially that led the search engines deindex my 8000 pages.

What on earth did you do?

selomelo




msg:3217107
 11:03 am on Jan 12, 2007 (gmt 0)

AlexK:
What on earth did you do?

I did a stupid thing. But before telling the details of my story, let me say that I am not a programmer. I even cannot call myself a novince. I know no php coding. I have some basic knowledge on algorithms from university years.

Now, the fastest way to deindex pages:

I combined two different versions of your code (bot + IP check):

It was something like this: Check if it is a bot. If yes, exit. Else check the IP.

What lead to deindexing was the siple line:

If bot = true then exit;

Before implementing this code, my site was well indexed (some 8000 pages). After the implementation, site started being gradually deindexed by google. And within one month, the whole site was deindexed: No supplementals, no cache, nothing, as if my site was nonexistent. I even asked help here: [webmasterworld.com...]

At first sight, everything seemed OK. Google was crawling my site as always. But there was a strange observation: The log file was showing a typical pattern: Http Code: 200 Date: #*$!xx Http Version: HTTP/1.1 Size in Bytes: 5. That is to say, Google (as well as Yahoo and MSN) was fetching only 5 bytes!

In other words, the bots were visiting the site only to get an instruction to exit, and were leaving the site with a 0 "zero" byte content. And since there was nothing to index, they (all search engines) were deindexing the pages! Simple as that!

After some desperate and wrong attempts, I noticed my mistake, and modified the code as "If not bot then check IP." Luckily, both Google and Yahoo (with the help of sitemaps) quickly indexed the pages.

Can I recommend this for those who want to quickly remove their pages from the indexes? :)

AlexK




msg:3217306
 2:59 pm on Jan 12, 2007 (gmt 0)

selomelo:
(some 8000 pages) ... within one month, the whole site was deindexed

Cripes!

I've had to raise $bTotVisit to 8,000 (visits within 24 hours, $ipLength=4) because of Google. The G bot and the Adsense bot tend to browse on the same IP. I refuse to use the Whitelist, and such a high figure was the only way to stop those bots getting blocked.

My site uses a Content-Negotiation class, so the bots get plenty of 304's (normal PHP pages miss out on this), so bandwidth is low. All the same... 8,000 visits in one day from a bot on a 15,000-a-day site.

selomelo




msg:3217523
 5:52 pm on Jan 12, 2007 (gmt 0)

I am aware how this "if not bot check IP" thing is vulnerable. But I also know that those trying to scrap my site(s) are as much novince as myself! At least for now. Before implementing your code, they were trying to get all the content with HTtracker and similar programs, and I was on the watch to ban their IP's from CPanel.

AlexK




msg:3219909
 4:12 pm on Jan 15, 2007 (gmt 0)

selomelo:
I am aware how this "if not bot check IP" thing is vulnerable

Please do not misunderstand me - the bot whitelist is fine, if that is the way that you want to go (and also the reason that the code is in there).

I'm lazy, and simply do not want to have to maintain more code than I have to. I also get upset over the way that these so-called good bots often run wild, and get a perverse pleasure in seeing them banned. I'm not actually proud of this latter emotion, since I'm probably 'cutting off my own nose'.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved