homepage Welcome to WebmasterWorld Guest from 54.145.182.50
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

This 32 message thread spans 2 pages: 32 ( [1] 2 > >     
Blocking badly behaved runaway WebCrawlers
PHP solution that doesn't need a bad bot list - Identifies them on the fly.
xlcus

10+ Year Member



 
Msg#: 119 posted 11:47 pm on Jan 11, 2003 (gmt 0)

I have a fairly processor intensive script on one of my sites which is fine most of the time, but when it gets many repeated hits in quick succession from badly behaved webcrawlers which don't honour my robots.txt file, it brings my server to its knees.

I needed a way to block these inconsiderate bots, many of which identified themselves as standard browsers, so an .htaccess black list wasn't helping. Besides, this would need to be kept up to date every time a bad bot was spotted.

I came up with a small bit of PHP code to put at the start of a script that detects rapid multiple accesses from a particular ip address, and then blocks that ip until the bombardment stops...

$itime = 10; // Minimum number of seconds between visits
$ipenalty = 60; // Seconds before visitor is allowed back
$imaxvisit = 42; // Maximum visits
$iplogdir = "/sites/my.site.com/iplog/";

$ipfile = substr(md5($_SERVER["REMOTE_ADDR"]), -2);
$oldtime = 0;
if (file_exists($iplogdir.$ipfile)) $oldtime = filemtime($iplogdir.$ipfile);

$time = time();
if ($oldtime < $time) $oldtime = $time;
$newtime = $oldtime + $itime;

if ($newtime >= $time + $itime*$imaxvisit)
{
touch($iplogdir.$ipfile, $time + $itime*($imaxvisit-1) + $ipenalty);
header("HTTP/1.0 503 Service Temporarily Unavailable");
header("Connection: close");
header("Content-Type: text/html");
echo "<html><body><p><b>Server under heavy load</b><br>";
echo "Please wait $ipenalty seconds and try again</p></body></html>";
exit();
}
touch($iplogdir.$ipfile, $newtime);

Notes...

  • $iplogdir needs to be a directory that's writable to by the web server.
  • $itime is the minimum number of seconds between visits on average over $itime*$imaxvisit seconds. So in the above example, a visitor isn't blocked if they visit the script multiple times in the first 10 seconds, as long as they don't visit more than 42 times within 420 seconds.
  • If the limit is reached, $ipenalty is the number of seconds a visitor has to wait before they are allowed back.

How it works...

For each visitor, an MD5 hash is made of their ip address and the last 2 hex digits of this are taken to generate one of a possible 256 filenames. If this is a new visitor, or a visitor who hasn't been seen for a while, the timestamp of the file is set to the current time, otherwise they must have been a recent visitor and the time stamp is increased by $itime. If they start loading the script more rapidly than $itime seconds per visit, you can see that the time stamp on their ips hashed filename will be increasing faster than the actual time is increasing. If the time stamp gets too far ahead of the current time, then they're branded as a bad visitor and the penalty is applied by increasing the time stamp on their file even further.

$itime, $ipenalty, $imaxvisit can be tweaked to fit your own traffic patterns.

Hope someone else finds my script useful. :) If you have any questions, ask away...

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 119 posted 1:34 am on Jan 12, 2003 (gmt 0)

xlcus,

What is an effective technique to "trim" all of the files this will produce - cron job?

Thanks for the post! - I'm interested in this "access throttling" subject area, but no experience with it.

Thanks,
Jim

andreasfriedrich

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 119 posted 2:23 am on Jan 12, 2003 (gmt 0)

What is an effective technique to "trim" all of the files this will produce - cron job?

If I understand the script correctly then there is no need to remove them at all since there will only be 256 files at the max.

I tried the script. It is a nice alternative if you cannot get Apache::Speedlimit to work.

Andreas

xlcus

10+ Year Member



 
Msg#: 119 posted 2:30 am on Jan 12, 2003 (gmt 0)

What is an effective technique to "trim" all of the files this will produce

The script will only produce 256 ip tracking files as only the last two characters of the ip MD5 hash are used to generate the file name. That's the beauty of this truncated hash method.

256 separate hash files is enough that it's very unlikely you'll get two visitors at exactly the same time with the same hash, but not to many that you have to keep tidying up the files.

And even if you do get more than one visitor with the same hash file at the same time, it's no great disaster. They'll just approach the throttle limit a little faster, which in most cases won't matter as the limits I've used in the example are quite generous.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 119 posted 4:20 am on Jan 12, 2003 (gmt 0)

Doh!

I need to increase my font size again!

Next question: Is it possible (and/or advisable) to create a PERL version of this script? I don't have PHP available, and the only sticking points would appear to be the MD5 hash and the touch function - not sure if I can get to those from PERL. Any other suggestions for porting this?

How about recommendations for testing methods on a "public" server?

Thanks,
Jim

andreasfriedrich

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 119 posted 3:21 pm on Jan 12, 2003 (gmt 0)

Digest::MD5 [perldoc.com]

Your touch implementation could look like this:

$t = time; 
utime $t, $t, $file or open 'NEW', ">$file";

Andreas

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 119 posted 5:10 pm on Jan 12, 2003 (gmt 0)

Thanks, Andreas! That'll get me going.

Jim

xlcus

10+ Year Member



 
Msg#: 119 posted 5:55 pm on Jan 12, 2003 (gmt 0)

If you're interested in logging which visitors and user agents have been blocked, you might find these improvements to the script useful... (Changes in bold)

$itime = 10; // Minimum number of seconds between visits
$ipenalty = 60; // Seconds before visitor is allowed back
$imaxvisit = 42; // Maximum visits
$iplogdir = "/sites/my.site.com/iplog/";
$iplogfile = "iplog.dat";

$ipfile = substr(md5($_SERVER["REMOTE_ADDR"]), -2);
$oldtime = 0;
if (file_exists($iplogdir.$ipfile)) $oldtime = filemtime($iplogdir.$ipfile);

$time = time();
if ($oldtime < $time) $oldtime = $time;
$newtime = $oldtime + $itime;

if ($newtime >= $time + $itime*$imaxvisit)
{
touch($iplogdir.$ipfile, $time + $itime*($imaxvisit-1) + $ipenalty);
header("HTTP/1.0 503 Service Temporarily Unavailable");
header("Connection: close");
header("Content-Type: text/html");
echo "<html><body><p><b>Server under heavy load</b><br>";
echo "Please wait $ipenalty seconds and try again</p></body></html>";
$fp = fopen($iplogdir.$iplogfile, "a");
if ($fp)
{
$useragent = "<unknown user agent>";
if (isset($_SERVER["HTTP_USER_AGENT"])) $useragent = $_SERVER["HTTP_USER_AGENT"];
fputs($fp, $_SERVER["REMOTE_ADDR"]." ".date("d/m/Y H:i:s")." ".$useragent."\n");
fclose($fp);
}

exit();
}
touch($iplogdir.$ipfile, $newtime);


andreasfriedrich

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 119 posted 6:43 pm on Jan 12, 2003 (gmt 0)

Please post your Perl version Jim since Im interested in using it as well.

Logging certainly is a nice feature xlcus. Thanks for sharing this with us.

Andreas

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 119 posted 4:23 am on Jan 13, 2003 (gmt 0)

Will do, but it will be awhile... Lots of other work to do this week.

Thanks xlcus, I've been looking for just such a method.

Jim

antirack

10+ Year Member



 
Msg#: 119 posted 12:55 am on Jan 25, 2003 (gmt 0)

Hi xlcus

thanks a lot for giving us this code. I came through the thread about mod_rewrite to ban crawlers, a 15 pages read ;-)

I've used the mod_rewrite code discussed there, and additionlally installed your script.

After about 18h, I've got the following in my iplog.dat:

80.95.97.252 24/01/2003 19:39:27 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:27 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
80.95.97.252 24/01/2003 19:39:28 Web Downloader/4.9
217.34.225.185 24/01/2003 21:26:20 Microsoft URL Control - 6.00.8862
217.34.225.185 24/01/2003 21:26:20 Microsoft URL Control - 6.00.8862
217.34.225.185 24/01/2003 21:26:20 Microsoft URL Control - 6.00.8862
217.34.225.185 24/01/2003 21:26:21 Microsoft URL Control - 6.00.8862
217.34.225.185 24/01/2003 21:26:21 Microsoft URL Control - 6.00.8862
217.34.225.185 24/01/2003 21:26:21 Microsoft URL Control - 6.00.8862
217.34.225.185 24/01/2003 21:26:22 Microsoft URL Control - 6.00.8862
217.34.225.185 24/01/2003 21:26:22 Microsoft URL Control - 6.00.8862
217.34.225.185 24/01/2003 21:26:23 Microsoft URL Control - 6.00.8862
217.34.225.185 24/01/2003 21:26:23 Microsoft URL Control - 6.00.8862
194.158.104.35 24/01/2003 21:26:53 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
194.158.104.35 24/01/2003 21:27:10 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
194.158.104.35 24/01/2003 21:27:26 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
194.158.104.35 24/01/2003 21:28:08 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
194.158.104.35 24/01/2003 21:28:24 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
194.158.104.35 24/01/2003 21:29:04 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
194.158.104.35 24/01/2003 21:29:25 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
130.192.76.147 24/01/2003 23:35:08 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
4.65.218.43 24/01/2003 23:35:21 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)
130.192.76.147 24/01/2003 23:35:21 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
4.65.218.43 24/01/2003 23:35:23 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)
130.192.76.147 24/01/2003 23:35:23 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
4.65.218.43 24/01/2003 23:35:28 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)
130.192.76.147 24/01/2003 23:35:29 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
130.192.76.147 24/01/2003 23:35:30 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
130.192.76.147 24/01/2003 23:35:30 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
130.192.76.147 24/01/2003 23:35:31 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
4.65.218.43 24/01/2003 23:35:33 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)
130.192.76.147 24/01/2003 23:35:35 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
4.65.218.43 24/01/2003 23:35:43 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)
130.192.76.147 24/01/2003 23:38:09 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
130.192.76.147 24/01/2003 23:38:11 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
130.192.76.147 24/01/2003 23:38:14 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
130.192.76.147 24/01/2003 23:38:14 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
130.192.76.147 24/01/2003 23:38:16 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

I've used your standard settings and tried it myself by hitting refresh or navigating very quick on my site. As you say, the timing should be sufficient for normal people browsing the site.

I am just wondering if those are real browsers, or programs with a changed UA.... I know there is no way to find out, but maybe users can share their experience a bit. I am going to put an email address and a "what happened?" screen there instead of the 'under heavy load' some time but I don't have the time today...

Thanks again
Alex

xlcus

10+ Year Member



 
Msg#: 119 posted 10:10 am on Jan 27, 2003 (gmt 0)

I am just wondering if those are real browsers, or programs with a changed UA.... I know there is no way to find out, but maybe users can share their experience a bit.

Hi antirack,

There's no 100% sure way to work out if it's a bot or a human, but you can usually have a good guess. Yes, naughty bots quite often disguise their UA to look like a normal browser, but take the ip address from the iplog.dat file, and find the corresponding entries in your log files...

  • Is the visitor from that ip address downloading images too? If they're not, then most likely they're a bot.
  • Look at the path they followed through the site. Does it look like they're doing a logical traverse of your link structure? (probably a bot) Or are they following a more random path? (probably a human)
  • Look at the number of seconds between accesses. Is it regular? (probably a bot) Or more irregular? (probably human)

antirack

10+ Year Member



 
Msg#: 119 posted 12:30 am on Jan 28, 2003 (gmt 0)

I have now realized that this also bans GoogleBot from our site.

64.68.86.59 28/01/2003 02:36:56 Googlebot/2.1 (+http://www.googlebot.com/bot.html)
64.68.86.79 28/01/2003 02:36:57 Googlebot/2.1 (+http://www.googlebot.com/bot.html)
64.68.86.79 28/01/2003 02:36:57 Googlebot/2.1 (+http://www.googlebot.com/bot.html)
64.68.86.59 28/01/2003 02:36:57 Googlebot/2.1 (+http://www.googlebot.com/bot.html)
64.68.86.59 28/01/2003 02:36:59 Googlebot/2.1 (+http://www.googlebot.com/bot.html)
64.68.87.43 28/01/2003 02:36:59 Googlebot/2.1 (+http://www.googlebot.com/bot.html)
64.68.87.42 28/01/2003 02:37:00 Googlebot/2.1 (+http://www.googlebot.com/bot.html)
64.68.87.43 28/01/2003 02:37:01 Googlebot/2.1 (+http://www.googlebot.com/bot.html)
64.68.86.59 28/01/2003 02:37:02 Googlebot/2.1 (+http://www.googlebot.com/bot.html)
64.68.87.42 28/01/2003 02:37:02 Googlebot/2.1 (+http://www.googlebot.com/bot.html)

It's the real GoogleBot, ast he IP is theirs. I have therefore added the following lines just at the beginning of the function, to avoid banning google. Seems GoogleBot is quite busy, I have had 11767 entries in iplog.dat for GoogleBot within just about 3 days, the IP's are all Google's.


// if it is google, or somebody pretending to be, return
if ( eregi("Googlebot", $_SERVER["HTTP_USER_AGENT"]) )
{
return;
}

ruserious

10+ Year Member



 
Msg#: 119 posted 1:07 am on Jan 31, 2003 (gmt 0)

Excluding Googlbot-Agent is not a good idea, because people who know would just have to spoof their User-Agent.

Google itself states that they do not fetch more than page every _6_ seconds, so reducing the interval would do the trick.

Another thing that worried me a bit: Why only use the last two hex numbers of the hash and not a few more? Would a bot that is running over a Proxy (with rotating IPs) effectively start a DOS attack with that script in place? Because several hashs might be blocked, making other people with the same hash get the same 500 page. 255 possibilities isn't all that much.
Is there an advantage when only using 2 instead of the last, say 4 hexnumbers of the hash?

I really like the idea, and I think it's a very good script, but I am always looking for possible drawbacks, I hope it doesn't come across rude. ;)

xlcus

10+ Year Member



 
Msg#: 119 posted 10:40 am on Jan 31, 2003 (gmt 0)

Why only use the last two hex numbers of the hash and not a few more?

You could use a few more if you wanted. The only drawback would be that you'd have more temporary files...
  • 2 digits = 256 files
  • 3 digits = 4096 files
  • 4 digits = 65536 files
  • etc...

Would a bot that is running over a Proxy (with rotating IPs) effectively start a DOS attack with that script in place?

If you have that situation, then yes, using more than 2 digits of the hash would be a good idea. Just be aware that you're going to have a lot of temporary files unless you implement some other way (Database maybe) of keeping track of the times.

antirack

10+ Year Member



 
Msg#: 119 posted 2:39 am on Feb 2, 2003 (gmt 0)

Unfortunately accessing your site 6 times per second can hurt your surver very much if the pages are rather complex and with lots of database accesses. We have a database driven web site and it is optimized as good as possible. We got a 2nd dedicated linux box with dual 1g cpu's, lots of ram, raid 5 scsi drives, etc... but googlebot brought that machine (and only the database server, mysql) nearly to it's knees last week with a load average of sometimes up to 10, but in average at about 4 to 5. The "attack" was lasting for a day. Looking at our statistics, it also generated huge traffic.

We got a very busy site, and usually the load average stays at < 1.

Whats generally faster, having the data in files, or in a database. We run as said everything out of a database, also our session data is stored in mysql tables, instead of the /tmp. But I am actually not sure what would be faster and healthy for the system: mySQL or files. Especially if we go to have 3 digits used.

antirack

10+ Year Member



 
Msg#: 119 posted 2:43 am on Feb 2, 2003 (gmt 0)

BTW: since I have posted the message about google bot being banned, I have refreshed (deleted) the IPLOG.DAT file and now I have 2800 new entries. If somebody is interested to look at it, please let me know. I thought this might be a good examle for people interested in using this sooner or later.

-Alex

DerekT

10+ Year Member



 
Msg#: 119 posted 8:46 pm on Mar 5, 2003 (gmt 0)

I have taken the code and made a few improvements. Primarialy, ensuring that Googlebot can index the pages.

The code will check for 3 variables:

Googlebot for USER_AGENT
or
IP of 64.68.8*.*
or
IP of 216.239.46.*

If one or more of these conditions are met, the code is not run.


if (( eregi("Googlebot", $_SERVER["HTTP_USER_AGENT"]) ) (substr($REMOTE_ADDR,0,7) == "64.68.8") (substr($REMOTE_ADDR,0,11) == "216.239.46."))
{
}
else {
$itime = 10; // Minimum number of seconds between visits
$ipenalty = 60; // Seconds before visitor is allowed back
$imaxvisit = 45; // Maximum visits
$iplogdir = "/path/to/logs/";
$iplogfile = "iplog.dat";
$ipfile = substr(md5($_SERVER["REMOTE_ADDR"]), -2);
$oldtime = 0;
if (file_exists($iplogdir.$ipfile)) $oldtime = filemtime($iplogdir.$ipfile);

$time = time();
if ($oldtime < $time) $oldtime = $time;
$newtime = $oldtime + $itime;

if ($newtime >= $time + $itime*$imaxvisit)
{
touch($iplogdir.$ipfile, $time + $itime*($imaxvisit-1) + $ipenalty);
header("HTTP/1.0 503 Service Temporarily Unavailable");
header("Connection: close");
header("Content-Type: text/html");
echo "<html><head><title></title><body>MESSAGE TEXT HERE</body></html>";
$fp = fopen($iplogdir.$iplogfile, "a");
if ($fp)
{
$useragent = "<unknown user agent>";
if (isset($_SERVER["HTTP_USER_AGENT"])) $useragent = $_SERVER["HTTP_USER_AGENT"];
fputs($fp, $_SERVER["REMOTE_ADDR"]." ".date("d/m/Y H:i:s")." ".$useragent."\n");
fclose($fp);
}
exit();
}
touch($iplogdir.$ipfile, $newtime);
}

daisho

10+ Year Member



 
Msg#: 119 posted 9:52 pm on Mar 5, 2003 (gmt 0)

My site is very database intensive also. I took things to the other extreme and created a file caching system to improve my performance but not loose any of my flexibility with databases.

I use output buffering and run my program through it. At the end I cache the file to disk and display it to the user.

I also use timestamp files that I can touch in order to force my program to refresh from database. That way if nothing changes for days/weeks on a page then the database is never even connected to. The script sees that there is a valid timestamp file and then simply outputs the cache file.

If something changes I touch the timestamp file. If the timestamp of the file is greater than the cache page then the cache page is regenerated.

The advantage to that is the fact that if the database is down I display the cache page even if it should be regenerated. In my opinion and old page is better than and error page.

The downside to this is that I now have 3.7gigs of cache files. The upside is that there is much less impact on my database. Google can crawl at full force and my server stays happy.

DerekT

10+ Year Member



 
Msg#: 119 posted 5:11 am on Mar 8, 2003 (gmt 0)

daisho

Is the caching solution you use something created in house or was it a commercial offering? I could use a similar performance boost since all pages are PHP and mySQL generated.

Hester

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 119 posted 10:44 am on Mar 10, 2003 (gmt 0)

I came up with a small bit of PHP code to put at the start of a script that detects rapid multiple accesses from a particular ip address, and then blocks that ip until the bombardment stops...

xlcus: sorry to be naive, but where does the script go? At the start of every page? Can it go into a single file such as index.php which then leads to the rest of your site? Or would bots get round that?

xlcus

10+ Year Member



 
Msg#: 119 posted 12:20 pm on Mar 10, 2003 (gmt 0)

where does the script go? At the start of every page?

Yeah, you need to put it at the start of every page you want blocked.
Put the code in a separate file, and then all you need to do is 'include' it at the top of each page.

Hester

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 119 posted 12:36 pm on Mar 10, 2003 (gmt 0)

If your site has a lot of static pages won't that add to the server load?

DrDoc

WebmasterWorld Senior Member drdoc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 119 posted 5:38 am on Mar 27, 2003 (gmt 0)

You don't have to include the page at all. Just add this to your .htaccess file:

<IfModule mod_php4.c>
php_value auto_prepend_file "/path/to/file/block_bad.php"
</IfModule>

And, no, it won't add too much to the server load. It's only a couple of lines of code. Besides, if it stops bad bots, then that traffic will no longer take server resources ;)

Gonzalez

10+ Year Member



 
Msg#: 119 posted 9:42 am on Apr 21, 2003 (gmt 0)

Hi there.

I started using the script and I think it works fine (I also use the .htaccess blacklist).

Based on your experience, dont you think that it should include other spiders (other than Googlebot)? For example I also let FAST-Webcrawler pass straight.

Digger

10+ Year Member



 
Msg#: 119 posted 9:24 pm on May 28, 2003 (gmt 0)

Hi all--

I am a newbie and I would like to use this script, but I don't understand how to implement it. I am on an apache server that can run php.

Do I copy the script to notepad then save it as a php file? Is there anything in the code I need to change to match my site? Where should I save this file? In my public_html directory?

I saw the last post regarding adding to .htaccess file.

I'm getting rapid hits on my website like this. Like 800 or more and it's really bring my bandwidth up.

Giacomo

10+ Year Member



 
Msg#: 119 posted 11:28 pm on May 28, 2003 (gmt 0)

Digger,

You can use this modified version of the script, which is independent from your directory structure. I also changed the variable names so that they will not conflict with other variables in your PHP scripts.

[small]<?php
$botblocker_itime = 10; // Minimum number of seconds between visits
$botblocker_ipenalty = 60; // Seconds before visitor is allowed back
$botblocker_imaxvisit = 45; // Maximum visits
$botblocker_iplogdir = $_SERVER["DOCUMENT_ROOT"]."/botblocker/";
$botblocker_iplogfile = "iplog.dat";
$botblocker_ipfile = substr(md5($_SERVER["REMOTE_ADDR"]), -2);
$botblocker_oldtime = 0;
if (file_exists($botblocker_iplogdir.$botblocker_ipfile)) $botblocker_oldtime = filemtime($botblocker_iplogdir.$botblocker_ipfile);
$botblocker_time = time();
if ($botblocker_oldtime < $botblocker_time) $botblocker_oldtime = $botblocker_time;
$botblocker_newtime = $botblocker_oldtime + $botblocker_itime;
if ($botblocker_newtime >= $botblocker_time + $botblocker_itime*$botblocker_imaxvisit)
{
touch($botblocker_iplogdir.$botblocker_ipfile, $botblocker_time + $botblocker_itime*($botblocker_imaxvisit-1) + $botblocker_ipenalty);
header("HTTP/1.0 503 Service Temporarily Unavailable");
header("Connection: close");
header("Content-Type: text/html");
echo "<html><body><p><b>Server under heavy load</b><br>";
echo "Please wait $botblocker_ipenalty seconds and try again</p></body></html>";
$botblocker_fp = fopen($botblocker_iplogdir.$botblocker_iplogfile, "a");
if ($botblocker_fp)
{
$botblocker_useragent = "<unknown user agent>";
if (isset($_SERVER["HTTP_USER_AGENT"])) $botblocker_useragent = $_SERVER["HTTP_USER_AGENT"];
fputs($botblocker_fp, $_SERVER["REMOTE_ADDR"]." ".date("d/m/Y H:i:s")." ".$botblocker_useragent."\n");
fclose($botblocker_fp);
}
exit();
}
touch($botblocker_iplogdir.$botblocker_ipfile, $botblocker_newtime);
?>[/small]

Instructions:

1. Copy the above code, including the <?php and ?> delimiters, paste it into an empty text file and save it to your web site's document root folder (normally, public_html/) as "botblocker.php" (or any other name you like).

2. Create a new directory inside the public_html folder and name it "botblocker"; make sure the directory is writable by the server (chmod it 777). In case you do not know how to chmod a directory, use the following script:
[small]<?php
chown ($_SERVER["DOCUMENT_ROOT"]."/botblocker/", nobody);
chmod ($_SERVER["DOCUMENT_ROOT"]."/botblocker/", 0777);
?>[/quote]
[/small]

3. Make sure you include the script on top of all of your PHP pages, like this:
[quote]
[small]<?php require_once($_SERVER["DOCUMENT_ROOT"]."/botblocker.php");?>[/small]

Alternatively, as suggested by DrDoc, you may add the following to your .htaccess file:
[small]<IfModule mod_php4.c>
php_value auto_prepend_file "/path/to/file/botblocker.php"
</IfModule>
[/small]
The latter method requires that you replace "/path/to/file" with the actual path to your web site's document root folder. You can easily get the path with the following PHP script:
[small]<?php echo $_SERVER["DOCUMENT_ROOT"];?>[/small]

That's all. ;-)

[edited by: Giacomo at 12:00 am (utc) on May 29, 2003]

Storyteller

10+ Year Member



 
Msg#: 119 posted 11:42 pm on May 28, 2003 (gmt 0)

It's strange no one mentioned mod_throttle. It can easily be used to block/limit bots based on request rate. It's available from [modules.apache.org...]

DerekT

10+ Year Member



 
Msg#: 119 posted 11:45 pm on May 28, 2003 (gmt 0)

Giacomo

Your script will block most friendly web spiders. Look at the above modifications to allow Google to be exempted from this limiting. The code can also me modified for any other crawlers by ip or agent.

Giacomo

10+ Year Member



 
Msg#: 119 posted 11:46 pm on May 28, 2003 (gmt 0)

Storyteller,
Unfortunately not everyone can get Apache modules installed on their box, especially those on shared hosting.

This 32 message thread spans 2 pages: 32 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved