I want to stop competitors from crawling my pages ... until they have my entire site. I'm looking into a modification that will track that behavior and just shut them down after a few hundred pages
Hmmm. The routine as written creates an empty file ($ipFile) and uses the mod/access time of this file to activate a block. The same file could also contain content...
As written, the routine pre-processes a page. What you want would require that it is split to both pre- and post-process a page:
1 Use ob_start() at the beginning of the page 2 Use ob_get_contents() and strlen() at the end. 3 Increment the integer obtained in $ipFile 4 Adjust mod/access time 5 Check against your limits during pre-processing
You owe me a beer; a pint of ice-cold Guinness, please.
[added]: Y'know, after preening myself on my clever, clever reply, it has occurred to me that there is an even simpler solution...
The routine tracks the number of visits in
$visits. It would be very easy to add a check for the gross number of visits. $visits is reset to zero when the mod-time falls behind the current time (a hit-rate of less than 1 a sec, which admittedly is 86k hits in 24 hours). My experience across the last 18 months, however, is that *no-one* scraping a site has the patience to hit your site so slowly (look at msg#9). Hence my original comment.