Forum Moderators: coopster
First: Obtain Bad-Bot-Blocking script here [modem-help.freeserve.co.uk].
Second: The previous post is here [webmasterworld.com].
These are the questions:
1 Is the final code linked to from your main first post the newest and most up to date?Yes. The link at top is (at the time of writing) the latest available.
The script is also available from the Downloads pages on my site. Changes at the free-ISP that hosts the top link mean that I can no longer update that script so, for future changes, you will need to search my home site.
3 You have a majority of the document commented out and only a small snippet at the start active. Is that all that is needed?It depends on what you want!
The snippet of "active code" that lies between the '
Start(Stop) blocking badly-behaved bots : top code' comment lines does exactly what it says: it blocks badly-behaved bots.
You also need to pay attention to the very first comment block. The cannot-avoid part of this is: (before the snippet gets processed):
_B_DIRECTORY
_B_LOGFILE
_B_LOGMAXLINES
_B_DIRECTORYalso needs to pre-exist on the server, with the correct permissions. The previous post [webmasterworld.com] contains an extended discussion of this. In particular, chu2117 had a problem with the server backup-process screwing up the bot-block algorithm. The fix for that is here [webmasterworld.com].
The rest of the comments answer common questions and try to anticipate possible problems in particular situations. They also offer additional options; in particular:
2 Does that have within it the avoidance of blocking good bots?The top snippet of code does not implement a whitelist.
The comments offer 3 means of implementing a whitelist, any or all of which may be used - your choice.
I do not use a whitelist on my site. My experience of so-called "good" bots is that they frequently go bad. I therefore tune the script variables--in particular
$bTotVisit,
$bStartOverand
$ipLength--to match my site visitor patterns. If any bot goes wild, it gets banned, whoever it is, "good" or "bad".
Thanks Alex.
I think it would be helpful to clearly explain the general mechanism that this script uses
The info that you ask for is within the comments (just above the Changelog), although it only makes sense if you have read through all of the referenced-threads (!).
Your point is valid, of course - the mass of comments have been added piece-by-piece following each thread on this routine, and have grown to the point where it becomes tiring to make sense of them all. I'll make an attempt to precis:
General mechanism & Routine logic:
.
The routine uses zero-byte files ($ipFile) within `_B_DIRECTORY` to track individual users by IP-address. An algorithmic reduction--controlled by $ipLength--is used to keep the number of these files within manageable proportions.
.
Twin timestamps (mtime, atime) on each $ipFile are used to track both gross number and frequency of accesses, either of which can trip a block on further access:A roll-over period ($bStartOver) prevents permanent exclusion.
- gross access (slow-scraper, controlled by $bTotVisit)
- frequency (fast-scraper, controlled by {$bMaxVisit / $bInterval})
Fast-Scraper reset logic:
The fast-scraper test is: (( $visits / $duration ) > ( $bMaxVisit / $bInterval )). Therefore, blocking needs to stop when (( $visits / $duration ) == ( $bMaxVisit / $bInterval )). Since $duration wants to be == $bPenalty, that equation can be solved for $visits. As the test does not begin until ( $visits >= $bMaxVisit ), it is imperative that ( $bMaxVisit / $bInterval ) >= 1.
I hope that all of the above is clear. I am normally a night-worker, and have time-shifted by 12 hours to be able to visit my son (as at top). At this instant I'm still suffering the side-effects of that, so my explanations may not be totally clear.
All the best.
However, I think a mysql version ... would be a great feature.
The current implementation is blindingly fast. I suspect that the involvement of MySQL could add unacceptable delays. I know that the use of many thousands of files may raise eyebrows, but as zero-byte files they consume vanishingly-small resources.
...although I did make a mistake initially that led the search engines deindex my 8000 pages.
What on earth did you do?
I did a stupid thing. But before telling the details of my story, let me say that I am not a programmer. I even cannot call myself a novince. I know no php coding. I have some basic knowledge on algorithms from university years.
Now, the fastest way to deindex pages:
I combined two different versions of your code (bot + IP check):
It was something like this: Check if it is a bot. If yes, exit. Else check the IP.
What lead to deindexing was the siple line:
If bot = true then exit;
Before implementing this code, my site was well indexed (some 8000 pages). After the implementation, site started being gradually deindexed by google. And within one month, the whole site was deindexed: No supplementals, no cache, nothing, as if my site was nonexistent. I even asked help here: [webmasterworld.com...]
At first sight, everything seemed OK. Google was crawling my site as always. But there was a strange observation: The log file was showing a typical pattern: Http Code: 200 Date: #*$!xx Http Version: HTTP/1.1 Size in Bytes: 5. That is to say, Google (as well as Yahoo and MSN) was fetching only 5 bytes!
In other words, the bots were visiting the site only to get an instruction to exit, and were leaving the site with a 0 "zero" byte content. And since there was nothing to index, they (all search engines) were deindexing the pages! Simple as that!
After some desperate and wrong attempts, I noticed my mistake, and modified the code as "If not bot then check IP." Luckily, both Google and Yahoo (with the help of sitemaps) quickly indexed the pages.
Can I recommend this for those who want to quickly remove their pages from the indexes? :)
(some 8000 pages) ... within one month, the whole site was deindexed
I've had to raise
$bTotVisitto 8,000 (visits within 24 hours,
$ipLength=4) because of Google. The G bot and the Adsense bot tend to browse on the same IP. I refuse to use the Whitelist, and such a high figure was the only way to stop those bots getting blocked.
My site uses a Content-Negotiation class, so the bots get plenty of 304's (normal PHP pages miss out on this), so bandwidth is low. All the same... 8,000 visits in one day from a bot on a 15,000-a-day site.
I am aware how this "if not bot check IP" thing is vulnerable
I'm lazy, and simply do not want to have to maintain more code than I have to. I also get upset over the way that these so-called good bots often run wild, and get a perverse pleasure in seeing them banned. I'm not actually proud of this latter emotion, since I'm probably 'cutting off my own nose'.