homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque

Webmaster General Forum

Neat little unix/linux command to list top IP's accessing your site
Useful if you are getting hit by a bad bot

 1:35 am on Feb 1, 2010 (gmt 0)

Nice little shell command to get a real-time list of the top IP's hitting your site at the moment...

[b]tail -50000 access_log | awk '{print $1}' | sort | uniq -c | sort -n | tail[/b]

The -50000 says look at the last 50000 lines of the access_log file. You can tweak that if you need to look at more or less entries.

Helped me today when I was getting bombed by "omgilibot" essentially nearly bringing my server down. I was able to pinpoint the little sucker in seconds and add them to my firewall block list.



 1:41 am on Feb 1, 2010 (gmt 0)

Just thought of neat script I'm going to try and put together.

1. Have cron run this say every 15 minutes and dump the output into a file.
2. Have a perl script run to look at the output and run an IP lookup on the top culprits.
3. If there is excessive hits and they are not one of the big dog search bots, issue a IP ban and email myself a notification with the results (so I can confirm it was a bot I wanted blocked).

Wonder if this is going to cause me more head aches than it would solve? Maybe I'll give it a whirl anyway... =)


 6:08 pm on Feb 1, 2010 (gmt 0)

Ok, I know I'm just talking to myself at this point, but I've made some progress on my little monitoring script.

My perl script now does the following:

1. Queries my access-logs and looks at the recent activity from the past few hours and tallies up the most common IP addresses accessing my content.
2. It then checks the top 10 entries against a GEO database to determine the country of origin, and then does a DNS name lookup on the IP.
3. A report is generated with all this info that looks like this:

Lookup: crawl02.exabot.com
Bot IP:
Reads: 751
Country: France

Lookup: crawl-66-249-65-225.googlebot.com
Bot IP:
Reads: 771
Country: United States

Lookup: katy-dsl-76-164-108-162.consolidated.net
Bot IP:
Reads: 1008
Country: United States

Lookup: spider38.yandex.ru
Bot IP:
Reads: 4492
Country: Russian Federation

Lookup: crawl-66-249-65-246.googlebot.com
Bot IP:
Reads: 6197
Country: United States

Lookup: b3091256.crawl.yahoo.net
Bot IP:
Reads: 6397
Country: United States

I am now adding some logic to filter the list based on white-listed bots (google, yahoo, msn, ask, etc). Then adding a email notification if a non-white listed bot is looking at too many pages.

Have not decided yet if I'll auto-block based on the above rules using my firewall. Just not comfortable with doing that.

Anyway, just a status update. When I'm done I'll be sure to post my code and all my scripts in case it helps someone else.


 6:23 pm on Feb 1, 2010 (gmt 0)

Ok, I know I'm just talking to myself at this point

no..some of us are following your musings :)


 6:42 pm on Feb 1, 2010 (gmt 0)

Thanks. Good to know I am at least entertaining a few folks... ;-)


 9:51 pm on Feb 1, 2010 (gmt 0)

I am in too. I am liking the functionality of it. Would love to have something like this for my server even!


 12:03 pm on Feb 2, 2010 (gmt 0)

yes, do continue to post - i'm sure you have several interested, if silent, observers.
good stuff you are doing which could be applied to several similar issues.


 7:40 pm on Feb 2, 2010 (gmt 0)

I use something similar:

netstat -a -n | grep :80 | cut -d : -f2 | awk '{print $2}' | sort | uniq -c | sort

This shows the active connections on port 80 right now / within the timeout parameter.

This method would not need to scan the log file so you should be able to cron it at a faster interval.


 8:20 pm on Feb 2, 2010 (gmt 0)

Nice one Frank. I'm going to work that one into my setup. It looks like a great way of catching something that is bogging down my server, which in most cases is a bad bot.

I got hung up for a bit with some technical issues. Apparently you can't use variables in aliases when run in the background unless you use a fxn as the alias. Long story short, I'm back on track.


 2:48 pm on Feb 25, 2010 (gmt 0)

I know it has been a while, but I thought I would check back in.

I have not automated my scripts yet, but they did come in handy yesterday when my server was getting hit by a DDOS attack.

I actually didn't realize what was going on until I ran one of my new IP reports and looked at the top 50 IP addresses hitting my site. There was a nice and even distribution of international (and a few US) based IP machines/servers all hitting my site.

According to my reports, they had "tested" the attack a few days prior.

Now that I can see some patterns emerging, I'm hopeful that I can put together some of these queries/reports into a early detection system not only for rogue bots/scrapers, but also DDOS attacks.

I ended up blocking all international traffic for the day and the attack seemed to subside (for now).

Anyway, if I get the system automated, I'll report back!


 4:01 pm on Feb 25, 2010 (gmt 0)

alexk wrote a superb script a few years ago which he posted here,
you might like to take a look at it, it's php but is designed to block bad bots and suchlike in their tracks.

it's one of the most useful postings i've ever read here, although i know it takes a different approach to what you are doing


ps. as others have noted i've also followed your thread with interest, thanks!


 6:29 pm on Feb 25, 2010 (gmt 0)

Thanks topr8. Looks very interesting. A lot to read! =)


 9:18 pm on Mar 12, 2010 (gmt 0)

Alright. I'm finally done! My early warning system is now up and running... Here is the final product:

1. Checks memory usage every 15 minutes, looks for swapping or low memory - both signs of problems - sends a text message alert.

2. Checks web server logs (access_log) every 15 minutes for some abuse patterns:

a. If I see an unusual number of international traffic, I'm alerted (text msg) of a potential DDOS attack (a pattern I've noticed in my last attack). I've built-in a switch that can stop all international traffic to my site when turned on. It is not ideal performance wise, but better than letting them hit my database over and over.

b. I check for any IP's that are not Google, MSN, Yahoo, Ask, Amazon, etc that are pulling more than their fair share of pages on my site. If I find one, it gets auto-blocked (via APF) and then I get a text-message. The paper trail is left in the deny_hosts file so I can undo any errors later if needed.

So have I gone over-board? Am I too paranoid? Maybe. =)

It is all compiled in a few shell and perl scripts and then scheduled via cron. So far so good, it has stopped a few scrapers since yesterday... =)

If anyone is interested in more details on the source code for the pieces of the puzzle, let me know!


 2:53 am on Mar 13, 2010 (gmt 0)

Is it anything you could share here?


 4:12 am on Mar 13, 2010 (gmt 0)

it would teh awesome if you could post some of the perl scripts in the "Perl Server Side CGI Scripting" [webmasterworld.com] forum and the shell scripts in the "Linux, Unix, and *nix like Operating Systems" [webmasterworld.com] forum and link all those threads from this thread!

by the way i just used a modified version of your shell command from the OP today.
the way i was doing this before (using a for loop) was so inefficient i'm embarrassed to say how long it ran.


 5:26 am on Mar 15, 2010 (gmt 0)

I just check my adsense account every 15 minutes to see if traffic spiked. /shrug.

All kidding aside, this thread has been helpful. Thank you.


 8:10 pm on Mar 15, 2010 (gmt 0)

Here is the meat of it:

Cronjob runs the following command (every 15 minutes for me):

tail -50000 /etc/httpd/logs/access_log | grep 'GET /filename1.cgi\|GET /filename2.cgi\|GET /filename3.cgi' | awk '{print $1}' | sort | uniq -c | sort -n | tail -50 | /root/jobs/ip_scan_nomail.pl | /root/jobs/ddos_scan.pl

The -50000 says to look at the last 50,000 lines of the access_log. You can tweek it to your tastes.

The filename1.cgi, filename2.cgi, etc are the web pages I wanted to filter my search on in the log files. I wasn't interested in hits to non-content pages, etc. so I specified my big ticket pages that scrapers are always after.

The tail -50 says we want to analyze the top 50 ip addresses found in the log files.

Since I already had my ip_scan script, I used it to generate the output for my ddos_scan script. You can easily combine both scripts if you wish. Both sources are included below (minus the sendmail functions which are typically server specific). The GEO IP file used I get from maxmind.com and have a cron job run monthly to pull the lastest file.



#-- take input of top IPs and do a lookup... report suspicious activity via e-mail alert

require '/var/www/html/constants.pl';

# DNS lookup:
use Socket;

# Display some geo info
use Geo::IP;

my $gi = Geo::IP->open("/usr/local/share/GeoIP/GeoLiteCity.dat", GEOIP_STANDARD);

while (defined($line = <STDIN>)) {

$hostname = "";

$line =~ s/^\s+//; #remove leading spaces
$line =~ s/\s+$//; #remove trailing spaces
@data = split(/ /, $line);

my $record = $gi->record_by_name($data[1]);

$iaddr = inet_aton($data[1]);
$hostname = gethostbyaddr($iaddr, AF_INET);

$output = "Lookup: $hostname\nBot IP: $data[1]\nReads: $data[0]\nCountry: " . $record->country_name . "\n\n";
print $output;




#-- take input of the 'free' command and report memory problems via email...

require '/var/www/html/constants.pl';

$cnt = 0;
$bad_lookup_cnt = 0;

while (defined($line = <STDIN>)) {

$line =~ s/^\s+//; #remove leading spaces
$line =~ s/\s+$//; #remove trailing spaces
$line =~ s!\s+!g;
@data = split(/: /, $line);

# Track the number of non-US sources...
if($data[0] eq "Country" && $data[1] ne "United States" && $data[1] ne "Canada"){
$cnt = $cnt + 1;
$country_list = $country_list . $data[1] . "\n";

#Also check for scrapers... store lookup from this batch
if($data[0] eq "Lookup"){
$tmp_lookup = $data[1];
$tmp_lookup =~ tr/A-Z/a-z/;

# If by chance our ip lookup service is acting up... let's not ban ip's right now... so keep track of bad lookups
if(!$data[1] || $data[1] eq " "){
$bad_lookup_cnt = $bad_lookup_cnt + 1;

if($data[0] eq "Bot IP"){
$tmp_ip = $data[1];

# If not a big search bot, warn if reads are high...
if($data[0] eq "Reads" && $data[1] > 200 && $bad_lookup_cnt < 30){
if($tmp_lookup !~ /(google)|(msn)|(yahoo)|(amazon)|(ask)/){

# Block the scraper for now... and email admin
system("/usr/local/sbin/apf", "-d", "$tmp_ip");
if ( $? == -1 )
$result = "APF Command failed: $!\n";
$result = "APF block executed: $tmp_ip";

# Send text msg

$tmp_lookup = "";
$result = "";
$tmp_ip = "";



# If high number of non-US sources hitting the site, send an alert... potential problem/ddos
if($cnt > 20){
$msg = "High number ($cnt) of international bots hitting the server right now...";


[edited by: phranque at 6:27 am (utc) on Mar 16, 2010]
[edit reason] disabled graphic smileys ;) [/edit]


 11:19 am on Mar 21, 2010 (gmt 0)

I find iftop quite useful but thanks for sharing max, this train of thought makes for interesting reading...


 1:29 pm on Mar 21, 2010 (gmt 0)

Thanks. I'll check that command out.

My "system" has nabbed over 15 scrapers this week alone. I've had to tweak the limits it checks for and how much of the log file it reviews to be sure I'm not catching an heavy pageviewing visitor... Right now I bumped my intervals for the cron job down to 5 minutes, and I only look at 7000 lines of the log file each time at that interval. Then I have a job that runs at a longer interval and does a more broad check of the log files in case there is a slow scraper...

This morning it blocked one that had taken 150 pages in 90 seconds... who knows how many of my 100,000 pages it would have gotten if my program didn't catch it.

I go back and forth on this issue. On one hand, my site has survived 10 years without such a system in place. On the other hand, a few years ago I had a new competitor in my niche scrape 15,000 pages of content from my site to kick start their site... and they got away with it (long story).

So if I had this system in place, it may have stopped them. I guess it helps me sleep better at night knowing there is a good chance it will thwart the next copy-cat site that tries to steal my pages.

Though I know a sophisticated person can get around it, but then again there is not much any of us can do against the most advanced tactics.


 4:55 pm on Mar 25, 2010 (gmt 0)

maximillianos, sure we are reading, so are the hackers and automated bots writers and coders, I have written few security scripts BUT made sure no one wrote them the way I did, it's not like I don't want to share, it's if I share I have to write the whole thing differently all over again. Some Code Breakers and Reverse engineers are some of the most successful software engineers in the world today.

Many had to cut their teeth and learn hands on their dirty trade illegally this way, learn how to break webmasters and sysadmins defenses. They end up with a well sought after expertise, then EITHER work for government/private security firms officially in a legal manner (some are employed to do covert operations with the backing of the law for intelligence requirements etc) OR work for themselves in illegal consultancy work or plain private and highly sensitive info and data brokers....

Brave of you to share, though I know it's not a high risk knowledge, nonetheless that can be added to those people's list of "methods to be aware of" when coding their bots and scamware etc.


 5:13 pm on Mar 25, 2010 (gmt 0)

Max, thanks for this little gem


 8:02 pm on Mar 25, 2010 (gmt 0)

sure we are reading, so are the hackers and automated bots writers and coders, I have written few security scripts...

this is not a security script - it's simply a filter and processor for a standard format web server access log.
there is not much a hacker or bot can do to avoid being logged when a request is made.


 12:42 am on Mar 26, 2010 (gmt 0)

You are welcome Demaestro.

Phranque is correct. This is not rocket science (I'm no rocket scientist...;-). It is just a technique to analyze your logs programmically. Yes, some common patterns of abuse are discussed, but they are common knowledge (in my opinion).

I am not worried that I helped out the bad guys by posting this. I am hopeful that I helped out some good guys though.

Global Options:
 top home search open messages active posts  

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved