homepage Welcome to WebmasterWorld Guest from 54.166.173.147
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Best method of blocking?
Apache conf, iptables or other
jmccormac




msg:4605792
 2:43 pm on Aug 29, 2013 (gmt 0)

Is it better to block potential scrapers at IP level using iptables (Linux box) with a large (approx 40K IP ranges) list of IP ranges or would it be better to use a set of deny statements in Apache's httpd.conf or .htaccess? Would such a large set of deny statements have an effect on Apache? The ranges would be considered data centre/hoster IP ranges rather than user/dialup ranges.

Regards...jmcc

 

incrediBILL




msg:4605915
 9:41 pm on Aug 29, 2013 (gmt 0)

Not a big fan of playing with iptables although I've done it many times. Unless you specify specific ports in iptables you're blocking all access (email, etc.) vs. doing it in Apache which only impacts port 80 usually.

Apache pre-processes and caches stuff so it's pretty fast even with 10s of thousands of DENY statements. There are tricks to make it faster, some not in .htaccess, including RewriteMaps and indexed DBM files.

My preferred method is to do the same in MySQL and PHP at the beginning of all your scripts or files and keep all the data crunching out of Apache because it's flaky at best.

The real problem is blocking IP by IP as you encounter them is just a waste of your time because they have new IPs as fast as you've blocked the old one which is why blocking entire countries and data centers cuts through the chase and gets to the core of the problem.

jmccormac




msg:4606024
 5:30 am on Aug 30, 2013 (gmt 0)

The iptables approach is useful because it can be used to block ports 80/443 but it also reduces unwanted traffic on the site and stops it getting to Apache. The main worry is the upper limit of IP ranges for iptables before it starts impacting the server performance. I've blocked China and a few other problem countries on a relatively low powered server with no major impact.

Using Apache with deny statements would generally result in a 403 page unless it is a customised minimal result page. Many scrapers tend to be quite braindead and ignore 403s and keep right on hammering sites. The indexed DBM files is one that I hadn't thought of though.

The IP range list is essentially a nuclear option as it blocks most data centres. It is a range list rather than individual IPs so while there may be some collateral damage of people using data centre IPs for web proxies, it should kill about 98% of scrapers. The main disadvantage with an iptables approach is a multi-site webserver where each site might need a separate set of iptables rules. That's where iptables, for me at least, begins to change from a simple solution to a complex one.

Regards...jmcc

wilderness




msg:4606105
 11:21 am on Aug 30, 2013 (gmt 0)

approx 40K IP ranges[


Is this 40k in ranges or 40k in file size?

If 40k in ranges than you have the IP tables complied incorrectly (at least in regards to IP ranges).

I have nearly every country outside of the US & Canada denied (as well as other ranges (and custom solutions) within those boundaries), and Data Centers within same boundaries and I don't have 40k in the lines of my htaccess and/or IP ranges.

The main worry is the upper limit of IP ranges for iptables before it starts impacting the server performance.


Your concern for server performance are primarily related to how and with what method the rest of your site (s) functions on (PHP, MySQL, Java, etc.)

bhukkel




msg:4606109
 11:47 am on Aug 30, 2013 (gmt 0)

My preferred method is to do the same in MySQL and PHP at the beginning of all your scripts or files and keep all the data crunching out of Apache because it's flaky at best.


This is also my method. When you download ip info from arin, ripe etc and import it in the database you are more on autopilot for blocking countries. When you combine this with known blocklist like stopforumspam the protection get even better.

Of course this is also scriptable for iptables.

wilderness




msg:4606115
 12:15 pm on Aug 30, 2013 (gmt 0)

When you download ip info from arin, ripe etc and import it in the database you are more on autopilot for blocking countries.


Potentially speaking, simply downloading and importing Country-designated-IP's into anything, explains the possibility of 40k in IP's.

Theoretically, you could utilize about two-Dozen Class A's alone and reduce that 40k by 50-60% perhaps even more.
Course the Class A reduction is all dependent upon the specific of your website (s) and what visitors you cater to.

jmccormac




msg:4606116
 12:19 pm on Aug 30, 2013 (gmt 0)

Is this 40k in ranges or 40k in file size?

If 40k in ranges than you have the IP tables complied incorrectly (at least in regards to IP ranges).
40K in IP ranges. I have the website:ip address mapping of websites in com/net/org/biz/info/mobi/asia/us and approximately another fifteen million sites in various ccTLDs. It is part of a survey that I run. I suppose I could optimise the ranges. What I have noticed is a lot of Chinese and Indian subnets using US and CA IP ranges.

Regards...jmcc

wilderness




msg:4606120
 12:34 pm on Aug 30, 2013 (gmt 0)

40K in IP ranges. I have the website:ip address mapping of websites in com/net/org/biz/info/mobi/asia/us and approximately another fifteen million sites in various ccTLDs.


ABSURD!

You could reduce those numbers by leaps and bounds with a few lines of mod_rewrite

Deny from 1. 2. 5.
RewriteCond %{REMOTE_ADDR} ^12[0-6]\. [OR]
RewriteCond %{REMOTE_ADDR} ^14\. [OR]
RewriteCond %{REMOTE_ADDR} ^141\. [OR]
RewriteCond %{REMOTE_ADDR} ^150\. [OR]
RewriteCond %{REMOTE_ADDR} ^17[5-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^18[0-9]\.
RewriteCond %{REMOTE_ADDR} ^(19019[3-6])\. [OR]
RewriteCond %{REMOTE_ADDR} ^(2020[0-3]20921[0-3])\. [OR]
RewriteCond %{REMOTE_ADDR} ^(5[789]6[012])

Just some to get your rolling.
Course you may need to modify these for your httpd.conf.

What I have noticed is a lot of Chinese and Indian subnets using US and CA IP ranges.


Perhaps their just not interested in "My widgets"?

please note; the forum has gone back to modifying the pipe-character

wilderness




msg:4606125
 12:41 pm on Aug 30, 2013 (gmt 0)

What I have noticed is a lot of Chinese and Indian subnets using US and CA IP ranges.


Additionally, those IP's likely fall into the "server Farm" category of which there are multiple threads in this forum.

jmccormac




msg:4606127
 12:54 pm on Aug 30, 2013 (gmt 0)

You could reduce those numbers by leaps and bounds with a few lines of mod_rewrite
That approach (starting with A ranges) might be ok for country level blocks but sorting genuine human users from data centres and hosting ranges might require a bit more precision. Blocking at the upstream provider IP range might deal with server farms. Those CN/IN subnets are a common thing. Where a country's internet infrastructure isn't well developed a high percentage (possibly 50% or more) of that country's websites might be hosted on IP ranges outside that country. The US, CA, DE and UK tend to be the most popular. Using an A approach is, perhaps, like using a chainsaw when a scalpel is required. That said, I do have a few countries blocked on some of my sites.

Regards...jmcc

[edited by: jmccormac at 1:27 pm (utc) on Aug 30, 2013]

bhukkel




msg:4606130
 1:17 pm on Aug 30, 2013 (gmt 0)

Reducing the numbers is more important for mod_rewrite and iptables blocking method than it is for storing it in a mysql database and block it in PHP. Under the condition that the table has the right structure and indexes.

jmccormac




msg:4606132
 1:27 pm on Aug 30, 2013 (gmt 0)

Publishing out Apache deny statements or iptables rules is probably a better use of MySQL.

Regards...jmcc

wilderness




msg:4606159
 3:21 pm on Aug 30, 2013 (gmt 0)

You could reduce those numbers by leaps and bounds with a few lines of mod_rewrite


Using an A approach is, perhaps, like using a chainsaw when a scalpel is required.


You'll find out over time that it's far easier and less time consuming to keep the blade sharp on your chainsaw than it is to keep your scalpel sharp. And the long range benefits of allowing the wider IP ranges are not that great unless your sales are in the millions.

brotherhood of LAN




msg:4606197
 6:28 pm on Aug 30, 2013 (gmt 0)

Would the "allow" list be smaller?

wilderness




msg:4606200
 6:32 pm on Aug 30, 2013 (gmt 0)

Would the "allow" list be smaller?


Likely not considering how precise he desires to be with the IP's, however it's certainly worthy exploring.

dstiles




msg:4606208
 7:13 pm on Aug 30, 2013 (gmt 0)

I have a little over 8000 IP ranges, large and small (more than /16 to less than /24), listed as Always Ban (ie: server farms and aggressive DSL sub-ranges).

I further ban according to UA and other header fields which currently blocks about 4700 single IPs on a short-term (temporary) basis (that's just this year).

As to DSL ranges: acceptable countries 5300 and undesirable countries 2300.

This is obviously not the entire internet, even for ipv4. I add a few on a daily basis as new ones insinuate themselves (mostly due to DSL and servers contracting viruses!) but it's not unmanageable; certainly fewer than 40,000 ranges.

motorhaven




msg:4606292
 2:45 am on Aug 31, 2013 (gmt 0)

I use a handy Perl script which takes a list of ranges, takes overlaps and adjacent ranges and spits out a smaller optimal list. My iptables entries numbered about 25,000 before the Perl script, about 19,000 after.

As I add ranges, my iptables list actually becomes smaller. The additional ranges tend to be adjacent ranges or fill in the gap between two non-adjacent ranges.

incrediBILL




msg:4606302
 3:45 am on Aug 31, 2013 (gmt 0)

collateral damage of people using data centre IPs for web proxies


That's not collateral damage, that's a bonus.

Never had anything but trouble out of web proxies including 302 hijacking back in the day.

IP ranges are like a firewall and you can still punch holes through it for various services by whitelisting them ahead of the IP range firewalls.

The country IP range database can be downloaded for free from Maxmind which easily allows countries to be blocked by country code with just a couple of lines of code, it's very fast, and avoids the bazillion lines of DENY statements.

jmccormac




msg:4606325
 5:53 am on Aug 31, 2013 (gmt 0)

Would the "allow" list be smaller?
Possibly. It is certainly worth considering.

And the long range benefits of allowing the wider IP ranges are not that great unless your sales are in the millions.
It depends on the audience for the website as much as the sales. Blocking entire countries might be acceptable with some sites, especially if there is no financial argument for allowing traffic from that particular country. However a site that has a localised, country-level market and only sells to that country could benefit from blocking countries outside its market. The important point is that there is no one-size-fits-all approach to blocking.

You'll find out over time that it's far easier and less time consuming to keep the blade sharp on your chainsaw than it is to keep your scalpel sharp.
If I was simply basing the approach on detecting problem ranges as they hit my sites, then the A approach might make sense. However I don't use that approach. As part of the work I do on hoster statistics and domain name tracking, the IPs for about 3.6 million DNSes have to be checked (simple country level resolution in most cases) and that produces a list of approximately 3.3 million distinct IP addresses each month. That's separate from the surveys of the website IPs of com/net/org/biz/info/mobi/asia/us/etc. The website IP survey is part of a full web mapping project and it does produce a lot of IP data. There may be a vast difference between this relatively industrialised approach and the "block on detection" approach.

Regards...jmcc

jmccormac




msg:4606326
 6:03 am on Aug 31, 2013 (gmt 0)

The country IP range database can be downloaded for free from Maxmind which easily allows countries to be blocked by country code with just a couple of lines of code, it's very fast, and avoids the bazillion lines of DENY statements.
Maxmind data can be useful for country level blocks but beyond that it has granularity problems.

Regards...jmcc

phranque




msg:4606350
 10:00 am on Aug 31, 2013 (gmt 0)

just on general principles regarding using the web server to block - you definitely don't want to put 40K IP ranges in .htaccess since the entire file will be interpreted for every request made in that directory or any subdirectories.

in httpd.conf those would be interpreted once upon server startup.

lucy24




msg:4606380
 11:59 am on Aug 31, 2013 (gmt 0)

or would it be better to use a set of deny statements in Apache's httpd.conf or .htaccess

That makes it sound as if config vs htaccess is a trivial and secondary decision. Config vs. some-other-method is one question; htaccess vs. some-other-method is a completely different one.

jmccormac




msg:4606406
 1:45 pm on Aug 31, 2013 (gmt 0)

Thanks Phranque,
It might be better to use the httpd.conf for the bulk blocks but use .htaccess for the temporary blocks (unless they require an IP level block.).

Yep Lucy24. The iptables option is probably the more instinctive solution because it just drops the packets and doesn't return a 403. I was just wondering what was the best method of blocking.

Regards...jmcc

motorhaven




msg:4606430
 4:24 pm on Aug 31, 2013 (gmt 0)

I'd rather just drop the packets. I've seen some pretty stubborn and/or poorly coded bots which will keep slamming the server when it receives a 403 instead of going away. I'd rather they simply get silence then waste any more machine cylces.

incrediBILL




msg:4606620
 9:30 pm on Sep 1, 2013 (gmt 0)

The iptables option is probably the more instinctive solution because it just drops the packets and doesn't return a 403.


There's a difference between blocking a DDOS, hackers, spammers, bad bots and eliminating all the rest of the noise because some of the stuff being blocked might report on the status of you site as each should be treated appropriately. What might be fine for fighting a DDOS or hackers by using IPtables and not returning any status, or a robots.txt, etc. isn't good for handling the rest of the bots with no feedback whatsoever. The crawlers and sometimes users don't know if your site is down, bad internet connection, etc. and that can lead to sites incorrectly listing you as offline or dropping your site altogether from directories, link lists, etc.

Remember, if you do accidentally block a human the "403 forbidden" page can give them instructions on how to unblock their IP if you offer that option but IPtables can't.

I run a directory and if I can't connect to the site the listing gets dropped automatically which is a shame that someone pays just to get listed but instead gets immediately dumped when there's no response to my link checker.

Usually this is because the host, and there are a couple, now include bot blocking at the firewall level which thwarts link checkers as well obviously. Most are nice enough to give me a 403 forbidden which means it's probably being done at the Apache level.

Don't forget to declare port 80 or 443 on IP tables or you can kiss email g'bye.

I just think it's a risky way to go unless you're doing something like firewalling hackers and spammers from China, Vietnam, Nigeria, Russia, etc. and in those cases blocking all ports in IPtables is exactly what I do.

Out of site, out of mind.

But for content protection only, I stick with blocking in Apache or using a PHP script with a table of IP ranges.

Obviously some of this is just philosophical differences in how to handle the situation but I like to avoid a much collateral damage as possible and doing it at the Apache and script level is the only way to really avoid collateral damage IMO.

YMMV

brotherhood of LAN




msg:4606625
 10:00 pm on Sep 1, 2013 (gmt 0)

Great point re: inadvertent blocking & directories Bill, Apache blocking seems to be the more cautious route.

I personally don't have much experience of changing IP tables, but if it's possible to map port->port then I'd map all blocked requests to port 80/443 to a new port, say 1111.

Have a dozen line C program sitting there just dishing out 403 HTTP headers and short response. It'd take 100K of memory and would be very quick. You wouldn't need to read what the clients are asking... just keep serving the same response.

motorhaven




msg:4606632
 10:34 pm on Sep 1, 2013 (gmt 0)

I have found the vast majority of directories are garbage, with all due respect to IncrediBill The link worthiness from them tends to be garbage, along with all the "domain info" and "seo info" sites. There are exceptions of course, and I'm careful to make note of ranges of important thing and punch holes in the firewall for them. Frankly, I work at getting my domains out of many of those, so if they drop my site that's a bonus.

I declare port 80 for the Americas, Eurpoe, parts of Africa and Australia. Most of Asia I block all ports - it cuts down on the noise in my daily server log notification (there's an RBL lookup for email which spits out status).

I use a combination of both. iptables for the vast majority of it, because the vast majority is garbage and not human. Use .htaccess for ranges which are iffy and not 100% confirmed.

slipkid




msg:4607320
 7:02 pm on Sep 4, 2013 (gmt 0)

@ wilderness

I note that your class A blocks reference 209 which includes the Google IP span 209.85.128.0 - 209.85.255.255.

Was this intended? If so, why?

wilderness




msg:4607322
 7:08 pm on Sep 4, 2013 (gmt 0)

Unless you have some specific benefit in requiring all of googles tools (and/or Adwords), all you really need to allow is 66.249.64-95 that is where their bot crawls from.

There are many, many old references in the archives which have discussed this.

lucy24




msg:4607361
 9:56 pm on Sep 4, 2013 (gmt 0)

I have found the vast majority of directories are garbage, with all due respect to IncrediBill

The bare fact that his directories are actively maintained-- which was the point of his post --surely puts him in the 5% :)

Bill, do you have a recognizable UA that isn't subject to blocking by an ordinary responsible webmaster? An obvious example is the w3 link checker. Every time I run it I have to comment-out one line of UA blocks.

:: wandering off to figure out if I can code an exception within mod_setenvif ::

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved