homepage Welcome to WebmasterWorld Guest from 54.204.231.253
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 94 message thread spans 4 pages: < < 94 ( 1 [2] 3 4 > >     
Block non-North American Traffic for Dummies Like Me
Reducing the size of your blocking list.
webcentric




msg:4663917
 6:48 pm on Apr 17, 2014 (gmt 0)

First off, this subject has been discussed before but I felt that there's enough current interest in this board and on other boards here at WebmasterWorld alone, to warrant a fresh top-down discussion of the subject. We'll see if our moderators agree.

The list of CIDRs below was compiled from the Iana IPv4 Address Space Registry report [iana.org]. The list is a compact version of all Allocated non-ARIN /8 blocks (from APNIC, RIPE NCC, AFRINIC, and LACNIC). For example, 58.0.0.0/7 actually merges 58.0.0.0/8 and 59.0.0.0/8 into a single CIDR. The largest block in this list is 80.0.0.0/4 which merges the 80.0.0.0 through 95.255.255.255 address range.

Some of the CIDR's below merge blocks from different registries e.g. combining blocks from both RIPE NCC and APNIC. As such, this does not in any way represent an approach surgical enough to differentiate blocks in one RIR from blocks in another (let alone blocks representing specific countries). The goal here is to arrive at a blocking strategy that keeps people and bots from outside North America off your site.

It should also be noted that the list below is only intended as a good first step where blocking is concerned. There are many holes in the Legacy blocks that this step does not address and proxies are another whole topic of ingress. The intention here is to succinctly narrow the scope of the task with as little effort as possible.

One tangible benefit of this approach can be seen in the 176.0.0.0/5 range which blocks
176.0.0.0 to 183.255.255.255. This CIDR contains some AWS and Rackspace ranges (and probably other server farms as well). Blocking this range means you don't have to identify and separately block those server farm ranges.

1.0.0.0/8
2.0.0.0/8
5.0.0.0/8
14.0.0.0/8
27.0.0.0/8
31.0.0.0/8
36.0.0.0/7
39.0.0.0/8
41.0.0.0/8
42.0.0.0/8
46.0.0.0/8
49.0.0.0/8
58.0.0.0/7
60.0.0.0/7
62.0.0.0/8
77.0.0.0/8
78.0.0.0/7
80.0.0.0/4
101.0.0.0/8
102.0.0.0/7
105.0.0.0/8
106.0.0.0/8
109.0.0.0/8
110.0.0.0/7
112.0.0.0/5
120.0.0.0/6
124.0.0.0/7
126.0.0.0/8
175.0.0.0/8
176.0.0.0/5
185.0.0.0/8
186.0.0.0/7
189.0.0.0/8
190.0.0.0/8
193.0.0.0/8
194.0.0.0/8
195.0.0.0/8
197.0.0.0/8
200.0.0.0/7
202.0.0.0/7
210.0.0.0/7
212.0.0.0/7
217.0.0.0/8
218.0.0.0/7
220.0.0.0/7
222.0.0.0/7

So, I'm hoping that

1.This list is helpful to those looking for a starting point
2.That, if there's a mistake in the list above, that the moderators will see fit to correct the list when the mistake is identified so that the first post can reflect accurate and up-to-date information.
3.That this discussion can move forward with new ranges outside the Allocated blocks to help expand this list even further. Anyone want to block the UK Ministry of Defence (sic)? That /8 block and others are omitted here in this initial list because they are Legacy blocks.

And last for now. It is possible to further reduce the above list to a series of Regular Expressions which would be even more condensed than the list above. For those with access to a rewrite module (Apache or IIS) this list would be valuable but I'll leave up to an expert in that arena to post the list if they care to. I hope this helps someone and can save them the time I (and many others) have spent whittling down the world a bit.

Comments and corrections are most welcome!

 

webcentric




msg:4664285
 9:52 pm on Apr 18, 2014 (gmt 0)

Here's a preliminary assessment of RIPE...

These blocks have no US or Canadian allocations per assigned country codes.

025/8 -- RIPE NCC LEGACY
051/8 -- RIPE NCC LEGACY
141/8 -- RIPE NCC LEGACY single exception 141.0.8.0
145/8 -- RIPE NCC LEGACY
151/8 -- RIPE NCC LEGACY
188/8 -- RIPE NCC LEGACY

And here are the final 41 (plus one Canadian) exceptions that I believe will agree with bhukkel's list of 53.

109.70.88.0
130.26.0.0
141.0.8.0
149.154.0.0
165.218.0.0
176.120.16.0
176.67.80.0
185.40.156.0
185.46.120.0
185.47.84.0
185.51.4.0
185.52.0.0
185.52.136.0
193.138.72.0
193.164.220.0
193.34.36.0
193.58.216.0
194.153.155.0
194.42.216.0
195.190.24.0
195.200.84.0
195.216.225.0
195.230.108.0
195.42.132.0
195.66.102.0
195.66.132.0
212.1.208.0
213.137.64.0
31.170.160.0
37.18.176.0
46.231.240.0
5.152.184.0
87.239.136.0
87.76.16.0
88.151.224.0
91.205.100.0
91.209.57.0
91.209.6.0
91.225.248.0
93.183.0.0
93.188.128.0

and one for Canada
193.28.87.0

Of the 54 exception, several that I've checked are Server Farms so I wouldn't even consider them collateral damage. Again, the above are range starting IP's.

I appreciate the perspectives posted here regarding the wisdom of blocking traffic from these ranges. In my particular situation, blocking these ranges does several things in the context of my business strategy.

Most of my traffic now (like 99.9%) is from the US. Many advertisers see this as a plus. Why pay for impressions in Indonesia, for example, when they can get their ads viewed exclusively by people in America.

Since implementing the above blocks, I've seen a negligible reduction in traffic which has since been replaced by the type of traffic that I do want.

I run a server with 2 processors and 6 megs of RAM on it (for one website) and since blocking the above (and a few major US Server Farms), I've seen the server go from using virtually every single bit of available resources on a constant basis to a point where it's using less that 3 megs of RAM at peak hours and the processors are off playing checkers together because they're so bored.

This blocking strategy didn't stop much actual human traffic at all is my point. What it did greatly reduce was robotic traffic. My hosting provider couldn't explain why my server was so busy with the traffic levels I have. The exorbitant use of resources is what led me down this path and for me, the proof is in the pudding.

I'm going to take a closer look at the exceptions above and see if there is some stuff I should be legitimately letting in so we'll see what this second phase of the process produces for a blocking list.

As for CIDR being any less capable of dealing with this problem than RegEx, I would say that using only /8 blocks is definitely a non-surgical approach but who says you have to stick to /8s? Any IP range you can express with RegEx can also be expressed with CIDR notation or with subnet notation. They're all capable. Some are just more efficient than others depending on the situation. The original list would definitely get longer after adding the LEGACY blocks mentioned above and even longer if you decide to poke holes in the ranges for the aforementioned exception. Still, it would be a pretty efficient list in any situation and quite effective as I've described.

p.s. The techniques and resources already mentioned in this thread could be used to build a country-blocking strategy for specific countries. It just so happens that I'm interested in a North American strategy and have seen interest from others as well. More to come I'm sure.

103.


?

Samizdata




msg:4664292
 11:32 pm on Apr 18, 2014 (gmt 0)

regarding the wisdom of blocking traffic from these ranges

The question is whether there is any wisdom at all in blocking innocent people.

It would be easy enough to serve them a page explaining why the rest of your site is inaccessible.

People are not robots, and shouldn't be treated as such.

Any "collateral damage" is to your reputation.

...

webcentric




msg:4664293
 11:57 pm on Apr 18, 2014 (gmt 0)

@Samizdata -- How a request from any "blocked" IP is handled is separate from deciding what IP's to block. Serve a 404, 403 or a pretty page saying our site is only available to people in the US and Canada. How the request is handled wasn't really the target of this line of reasoning. The term blocking can mean different things to different webmasters in different scenarios.

My site might contain information or technology that is illegal to export outside the United States. I'd say that's a definitive reason for not allowing people from Iran on my site, for example? I don't really care if it hurts my reputation in Iran. That population is not a customer I can cater to so why expose myself to a visit from the Justice Department, let alone a potential attack from a country that will never make me a cent and most likely cause me a significant amount of grief and money over the long haul. You can rightfully say that there are "innocent" people in Iran but that still doesn't mean that I can sell to them. I contend that here are legitimate reasons for taking these kinds of actions but again, every webmaster has to make these kinds of decisions based on their own business requirements.

lucy24




msg:4664295
 12:43 am on Apr 19, 2014 (gmt 0)

103.

No, I really did mean 104. 103 is simply APNIC's answer to 185. And speaking of 185, I wouldn't bother poking holes for /22s that appear to be US. I'm pretty sure those are really just Dutch servers.

And psst!

\b(\d?\d)\b
>>
0$1
Repeat once, sort, and then undo using
\b0+(\d+)\b
>>
$1

Or, of course, \1 depending on your text editor.

wilderness




msg:4664308
 2:09 am on Apr 19, 2014 (gmt 0)

And psst!


<snip>

There ya are,speaking in tongues again ;)

lucy24




msg:4664312
 2:19 am on Apr 19, 2014 (gmt 0)

If you pad each CIDR element with leading zeros to make exactly three digits, an automated sort will put everything in correct numerical order. When done, remove the extraneous zeros.

webcentric




msg:4664319
 2:57 am on Apr 19, 2014 (gmt 0)

Thanks Lucy. I'm working in several different files at the moment (DB and Spreadsheet). In some of the lists I've padded the octets (converted to base 10 of course) with zeros and others not. That was a quick list straight out of a raw sort. The next iteration of the master list, with exceptions noted, should be sorted correctly. I'm on Windows and my text editor is an idiot. Well, I think I have jEdit around here somewhere but generally like working with data, in a database. I can barely abide spreadsheets but sometimes they can be useful. I think most of us love it when you speak in tongues BTW. I think I can safely say that but shouldn't presume I guess. ;)

Samizdata




msg:4664320
 2:57 am on Apr 19, 2014 (gmt 0)

How the request is handled wasn't really the target of this line of reasoning

The line of reasoning has so far been entirely negative.

It cannot increase sales and will probably annoy some potential or existing customers.

That is why I suggested a redirect rather than a block - it is damage limitation.

My site might contain information or technology that is illegal to export outside the United States.

Putting such material on a public website would seem rather negligent.

I'd say that's a definitive reason for not allowing people from Iran on my site, for example?

There is a difference between "people from Iran" and "people in Iran".

People travel between countries all the time, there are plenty of foreigners in USA right now.

Interesting that you also cited Indonesia as an example to block all traffic from.

I understand that the current US President lived there for a few years.

I contend that here are legitimate reasons for taking these kinds of actions

You are legitimately entitled to block anything you want.

That doesn't mean it is good for business.

Others might prefer to spend their time and energy on something positive.

...

webcentric




msg:4664328
 3:54 am on Apr 19, 2014 (gmt 0)

@Samizdata --

The line of reasoning has so far been entirely negative.


Really? I take it as an affirmative process to define my target audience as I see fit. Perhaps you see this as xenophobic but I assure you that this is not about hating the rest of the world. It's about defining the scope of where I want to focus my energies in my day to day business dealings. Many small businesses in America don't even have websites because they don't care about business from the next state, let alone business from another country.

It cannot increase sales and will probably annoy some potential or existing customers.


Who says I'm selling anything? That's an assumption on your part. In the year prior to taking these steps (which I'm now trying to refine) I didn't get a single ad click from any foreign visitor in my Adsense account (except a few from Canadian visitors). The content on my site is virtually irrelevant to the rest of the world (with the exception of scrapers, hackers, spammers and bots) and the ads are too. This is not an overstatement. This site traditionally attracts US visitors and more than 95% of the traffic from outside North America falls into the nefarious category. It's the nature of the site. There's also plenty of nefarious activity coming from inside the US as well but that's another set of problems. I made a decision to draw a line in the sand and the result is that earnings are increasing and expenses are going down.

That is why I suggested a redirect rather than a block - it is damage limitation.


It's a good suggestion.

Putting such material on a public website would seem rather negligent.


It's done every day. Downloads of software are often restricted by country.

There is a difference between "people from Iran" and "people in Iran".


Touche!

Interesting that you also cited Indonesia as an example to block all traffic from.

I understand that the current US President lived there for a few years.


And my brother is currently in Africa.

Others might prefer to spend their time and energy on something positive.


Me too! That's why I chose to take these steps. Keeps me much more focused on the part of my business that makes money as opposed to the part that drains the bank.

You know, perhaps this line of reasoning should be outlined so you can see where it's headed because it's far from complete at this stage.

Step 1-2: Block non-ARIN /8 Blocks
Step 3: See if we need to poke holes in any of those blocks for legitimate North American traffic.
Step 4: Sift through ARIN for any more stuff we want to block
Step 5: Consider whether we want to open up any specific countries or Regions such as Australia, the UK, South Africa or whatever. This /8 block scheme is just a framework that can be tinkered with as you see fit. The approach is most suited when someone desires to block whole regions. It's a top-down approach and can get more granular as the structure is refined. In other scenarios, starting in a granular fashion may be more appropriate (such as blocking server farms as they pop up). You'll find that the granular approach inevitably leads to merging of ranges. I just decided to start with the largest blocks possible and intend to open up things as I see fit rather than constantly shutting down new ranges every day.

Anyway, anyone considering this approach will be able to read the pro's and the con's as mentioned in this thread. Your considerations are valid and I wouldn't suggest anyone implement this without giving it some serious thought first. Hopefully this thread can provide folks with the ammunition necessary to make an intelligent decision in these regards. That's my real intention here. This kind of research takes time and if this thread can save someone some time without simply handing them a plug-and-play list (absent any real explanation as to how it came about) then I hope it can benefit them (even if that benefit derives from an informed decision not to use this particular strategy). More to come but I think that the last I have to say on the merits of this approach. I'd prefer to let the end result speak for itself.

p.s. Others in this thread have already mentioned a number of valuable techniques and resources which are highly relevant to today's webmaster. That, IMHO, is a very positive aspect of this thread.

wilderness




msg:4664349
 6:37 am on Apr 19, 2014 (gmt 0)

How the request is handled wasn't really the target of this line of reasoning


The line of reasoning has so far been entirely negative.


Samizdata,
Would it please you more if the subject line read "Block non-UK Traffic"


For more than a decade, there has been one consistent premise in this forum!
"Each webmaster must determine what is beneficial or detrimental to their own website (s)"

(Bill, my apologies in advance).

Even my indifference with Bill regarding the priority of white-listing vs. black-listing falls within the same premise.
"Each webmaster must determine what is beneficial or detrimental to their own website (s)"

I've been using these same methods (non-US) for more than a decade, and it is very well-known here (in this forum and other forums at Webmaster World), and yet your not attempting to antagonize me for the same lack of agreement.

Don

lucy24




msg:4664359
 7:50 am on Apr 19, 2014 (gmt 0)

"Who" and "how" are probably different questions. One reason for setting up my old-browser handling as a redirect instead of something in 4 or 5 is that it lets me see which users actually follow up on the redirect. Robots generally don't-- although browsers on infected human machines probably do.

Samizdata




msg:4664386
 1:15 pm on Apr 19, 2014 (gmt 0)

"Each webmaster must determine what is beneficial or detrimental to their own website (s)"

I am well aware of that Don, and acknowledged the point in my first post.

I am also aware that there can sometimes be good reasons to restrict access.

What I would like to see in this thread is some explanation of how all this time and effort put into blocking real people (rather than bots) actually benefits a website.

I am grateful to Webcentric for trying above, but the point remains that people - particularly high-spending people - travel around the world a lot, and use the world wide web when they do so.

Simply serving them a 403 is entirely negative and does nothing useful.

Hopefully this thread can provide folks with the ammunition necessary to make an intelligent decision in these regards.

Thank you Webcentric, that is also my concern.

I routinely block bots myself, but I prefer to treat people with more respect.

...

wilderness




msg:4664402
 3:10 pm on Apr 19, 2014 (gmt 0)

What I would like to see in this thread is some explanation of how all this time and effort put into blocking real people (rather than bots) actually benefits a website.


Samizdata,
My own widget data is very unique and focused on a small market.
The majority of my pages are NOT available any where else on the www, nor are the same articles available in libraries or Museums.

I've a detailed TOS that has been in place from when my sites were first created (1999). TOS violators are not treated kindly, and I don't make make exceptions for my closest friends or associates. Rather, all visitors are treated equally (something I pride myself on).

In most instances, when a visitor violates my TOS, it's merely a matter of time before I'm presented with an identity (generally with a referral from another friend/associate).
Once the TOS has been violated, it doesn't matter to me if they are in China, Norway, or the same US-city that I reside, the visitor is denied.
In addition and more importantly, my websites are less than 1% of my archived widget data and the identified person is also denied future requests to inquiries in the other 99+% of the data.

wilderness




msg:4664404
 3:19 pm on Apr 19, 2014 (gmt 0)

What I would like to see in this thread is some explanation of how all this time and effort put into blocking real people (rather than bots) actually benefits a website.


Samizdata,
It's simple, plagiarism.

In this day and age of multiple forums (WordPress and all the others), as well as social media (FB and all the others), most of the general public doesn't even bother reading webmasters TOS. Furthermore, this breed of visitors generally believe everything is "free game". Most don't have a clue as to the perils of in-line-linking, or even outright and complete copying and pasting.

I have many IP's and UA's (combined) in rules that are specifically targeted at people (rather than bots). It's simply a requirement of the times and the transition of the www.

FWIW, the general use of the term "www", also includes variations of the same (see intranet and extranet).

webcentric




msg:4664405
 3:19 pm on Apr 19, 2014 (gmt 0)

I accidentally blocked myself one time with a firewall on the server. Had to get my host to undo it because I couldn't connect to the server to undo the change. Embarrassing to say the least but informative as well. ;) More on the current topic shortly.

tangor




msg:4664412
 3:44 pm on Apr 19, 2014 (gmt 0)

but the point remains that people - particularly high-spending people - travel around the world a lot, and use the world wide web when they do so.

That argument sidesteps the basic premise under discussion. Blocking traffic (any) from specific regions/countries. Makes no difference is that traveler might usually live next door, if they are in another country they won't get in.

One way to deal with that is for the webmaster to be up front in TOS, or Notice "This site is available only in..." and say where. A custom 404 can do the same.... but that means some traffic/bandwidth is still being expended. (cost)

Some of my sites have this restrictive blacklisting by country, others don't. Depends on the scope of the site.

So far I've been illuminated by several ranges I've overlooked, so find this thread very useful.

webcentric




msg:4664425
 5:55 pm on Apr 19, 2014 (gmt 0)

OK, so let's see if I can get this on the page without making a complete mess of if. The following now contains both ALLOCATED and LEGACY blocks outside of ARIN (with the exception of Legacy blocks assigned to specific companies such as Level 3 communications). I did add 25/8 and 51/8 as they are UK blocks and outside the scope of my target area. The list is exploded to the /8 block level so we can see the aforementioned exceptions. Each webmaster can consider the merits of poking a hole in an eight block from here to make room for any given exception. Oh, and hopefully the sort is better this time around ;)

1.0.0.0/8
2.0.0.0/8
5.0.0.0/8
-- 5.152.184.0/21 -- AppRiver, LLC -- Server
14.0.0.0/8
25.0.0.0/8
27.0.0.0/8
31.0.0.0/8
-- 31.170.160.0/21 -- Hostinger International Limited -- Server
36.0.0.0/8
37.0.0.0/8
-- 37.18.176.0/21 - Hostgator.com LLC -- Hostings
39.0.0.0/8
41.0.0.0/8
42.0.0.0/8
43.0.0.0/8
46.0.0.0/8
-- 46.231.240.0/21 - Cisco Media Solutions, Inc. -- Networking
49.0.0.0/8
51.0.0.0/8
58.0.0.0/8
59.0.0.0/8
60.0.0.0/8
-- 60.254.128.0/18 -- Akamai Technologies, Inc. -- cloud
61.0.0.0/8
62.0.0.0/8
77.0.0.0/8
78.0.0.0/8
79.0.0.0/8
80.0.0.0/8
81.0.0.0/8
82.0.0.0/8
83.0.0.0/8
84.0.0.0/8
85.0.0.0/8
86.0.0.0/8
87.0.0.0/8
-- 87.239.136.0/21 -- ELXSI-Security Networking Services -- Hosting
-- 87.76.16.0/20 -- Future Hosting LLC -- maybe hosting ;)
88.0.0.0/8
-- 88.151.224.0/21 -- ITC Global -- Satellite Communications
89.0.0.0/8
90.0.0.0/8
91.0.0.0/8
-- 91.205.100.0/22 - FTEN Inc -- Financial
-- 91.209.57.0/24 -- Cloud Communications and Computing Corp. -- Server
-- 91.209.6.0/24 -- Zix Corporation -- Email encryption and related services
-- 91.225.248.0/22 -- LINKEDIN --
92.0.0.0/8
93.0.0.0/8
-- 93.183.0.0/18 -- Nokia Solutions and Networks Oy -- Seems related to Finland
-- 93.188.128.0/24 -- CDNetworks Inc -- CDN and Cloud services
94.0.0.0/8
95.0.0.0/8
101.0.0.0/8
102.0.0.0/8
103.0.0.0/8
-- 103.246.248.0/24 -- Neodelphi Limited also trading as 'QuickWeb Hosting Solutions' -- Hosting
105.0.0.08
106.0.0.0/8
109.0.0.0/8
-- 109.70.88.0/21 -- NexGen Networks Corp -- Networking
110.0.0.0/8
111.0.0.0/8
112.0.0.0/8
113.0.0.0/8
-- 113.29.0.0/17 -- Level 3 Communications, Inc -- multinational telecommunications and ISP provider
114.0.0.0/8
115.0.0.0/8
116.0.0.0/8
117.0.0.0/8
118.0.0.0/8
119.0.0.0/8
120.0.0.0/8
121.0.0.0/8
122.0.0.0/8
123.0.0.0/8
124.0.0.0/8
125.0.0.0/8
126.0.0.0/8
133.0.0.0/8
141.0.0.0/8
-- 141.0.8.0/21 -- Opera Software ASA
145.0.0.0/8
150.0.0.0/8
151.0.0.0/8
153.0.0.0/8
154.0.0.0/8
163.0.0.0/8
-- 163.60.0.0/16 -- AMADA AMERICA, INC. -- Manufacturing, International
171.0.0.0/8
175.0.0.0/8
176.0.0.0/8
-- 176.120.16.0 -- Bill Me Later, Inc -- No thanks
176.67.80.0
-- OverPlay.NET LP -- Hosting etc
177.0.0.0/8
178.0.0.0/8
179.0.0.0/8
-- 179.60.192.0/22Edge Network Services Ltd -- Cloud, Managed IT Services
180.0.0.0/8
181.0.0.0/8
182.0.0.0/8
183.0.0.0/8
185.0.0.0/8
-- 185.40.156.0/22 -- Eris Network Service LLC -- Servers
-- 185.46.120.0/22 -- IHNetworks, LLC -- IP transit in the Los Angeles area ?
-- 185.47.84.0/22 -- KVH Industries, Inc -- mobile satellite communications
-- 185.51.4.0/22 -- Latisys-Denver, LLC -- Hosting, colo, etc.
-- 185.52.0.0/22 -- RamNode LLC -- Hosting
-- 185.52.136.0/22 -- Peak Web LLC -- Hosting
186.0.0.0/8
187.0.0.0/8
188.0.0.0/8
189.0.0.0/8
190.0.0.0/8
-- 190.103.184/22 -- LAUREN -- Ralph Lauren ?
191.0.0.0/8
193.0.0.0/8
-- 193.138.72.0/24 -- ITXC Global Deutschland GmbH. -- I don't read German
-- 193.164.220.0/23 -- Fotolia LLC -- Image provider
-- 193.34.36.0/22 -- Alfred Karcher GmbH & Co. KGInternational -- Pressure washers
-- 193.58.216.0/21 -- Allianz Managed Operations & Services SE -- in-house services for Allianz companies
-- 193.28.87.0/24 -- Porta One-Chernihiv Ltd -- hosting?
194.0.0.0/8
-- 194.153.155.0/25 -- Mail*Select USA, Inc --
-- 194.42.216.0/24 -- Fortis Bank N.V. --
195.0.0.0/8
-- 195.190.24.0/24 -- AmRest Sp. z o.o. -- fast-food and casual dining
-- 195.200.84.0/23 -- Base IP B.V. -- Network services
-- 195.216.225.0/24 -- Shire Pharmaceuticals LLC --
-- 195.230.108.0/24 -- Aquatix IT-Services e.K.? -- IT Services ?
-- 195.42.132.0/23 -- Belgacom International Carrier -- telecommunications company in Belgium
-- 195.66.102.0/24 -- Keynote Systems Inc -- Web and mobile testing/monitoring
-- 195.66.132.0/23 -- Webair Internet Development Inc -- hosting
196.0.0.0/8
197.0.0.0/8
200.0.0.0/8
-- 200.49.248/21 -- Telmex USA -- telecom
201.0.0.0/8
202.0.0.0/8
-- 202.72.96.0/20 -- Intelsat Global Services Corporation -- Satellite communications and related networking
203.0.0.0/8
-- 203.144.48.0/20 -- VeriSign, Inc -- SSL provider
-- 203.187.128.0/19 -- BT-Infonet, Internet Service Provider -- ISP
210.0.0.0/8
211.0.0.0/8
212.0.0.0/8
-- 212.1.208.0/21 -- Hostinger International Limited -- hosting
213.0.0.0/8
-- 213.137.64.0/19 -- Delta3 Communications -- VOIP Services
217.0.0.0/8
218.0.0.0/8
219.0.0.0/8
220.0.0.0/8
221.0.0.0/8
222.0.0.0/8
223.0.0.0/8

Notes: There are five additional exceptions from the RIR queries that don't apply to the above ranges. I've set them aside for now. Plus I've added one related to Canada if you're counting. Anyway, the above now reflects some exceptions for consideration. Many are server related or about corporate infrastructure and I would tend to dismiss almost all but that's me and that would be for just for one very specific website which I'm applying this strategy to.

wilderness




msg:4664428
 6:08 pm on Apr 19, 2014 (gmt 0)

webcentric,
Many thanks for this very time comsuming effort and thread.

webcentric




msg:4664457
 9:48 pm on Apr 19, 2014 (gmt 0)

My pleasure (even if this most recent step did leave me with a pounding headache ;)

From here it's about finding routes of ingress available in the ARIN blocks. Not sure how far to take that because at some point the research is sure to cross over into Server Farm and Proxy territory. I will be taking a look though and seeing what can be gleaned from the ARIN database. At this stage, one could also take the approach of sitting back and letting the holes in the dyke expose themselves as water starts leaking through. Could probably put forth a condensed version of the above list (like in the opening post but with all the new CIDRs added in). We'll see. Any final form of the list is really dependent upon what exceptions you want to allow for.

Samizdata




msg:4664475
 12:46 am on Apr 20, 2014 (gmt 0)

That argument sidesteps the basic premise under discussion. Blocking traffic (any) from specific regions/countries. Makes no difference is that traveler might usually live next door, if they are in another country they won't get in.

It does not sidestep the premise, it counters it.

The premise excludes everyone who is not in a particular country at a given moment, even if they are citizens.

It also includes everyone in the country at that same moment even if they are foreign nationals.

In the case of USA, both categories number several million people.

The goal here is to arrive at a blocking strategy that keeps people and bots from outside North America off your site.

So several million people you might want on the site are served a 403.

And several million people you definitely don't want are freely allowed in.

The benefits of such a strategy are presumably self-evident.

You are all entitled to do it if you want.

....

webcentric




msg:4664490
 2:37 am on Apr 20, 2014 (gmt 0)

Try suing someone in a foreign country for scraping your entire website and republishing it. Operating in a restricted jurisdictional sphere offers a great many potential benefits and protections.

diberry




msg:4664549
 2:41 pm on Apr 20, 2014 (gmt 0)

Another reason to block a country is if you expend a fair amount of server resources fending off hackers and spammers from a particular country, and you've only received maybe 4 legitimate visitors from that country, ever.

Look, I have seen people get really upset to learn some big chain simply does not ship products to their country. I'm sure the big chain would love to have those individuals' business. But everyone understands that it's not personal, it's not "I hate France, so we won't ship there", it's a matter of law, health and safety, or the business' bottom line.

The fact is, unfortunately, we don't have a more granular solution in every case. Until we do, this is the choice we have to make.

If hacker/spammer visits from SomeCountry are costing me $20/month to fight, and my few legitimate SomeCountry visitors have only ever made me $5 the whole time my site's been online, it's a no-brainer from a profit-and-loss perspective. It's exactly the same kind of consideration as a big chain wondering, "Should we even ship to SomeCountry, with their extra VAT fees and the fact that 80% of our dubious returns come from there?"

Furthermore, with countries continually updating their privacy laws, what if your site just can't comply with the newest law in, say, Germany? By continuing to let Germans visit your non-compliant site, you could be creating a serious legal liability for your business. So even if you get lots of German visitors you have no desire to lose, you might feel compelled by issues of law to block them. While I'm not sure anything this severe has actually happened yet, there was a lot of chatter about some new UK laws a couple of years ago, and many of us weren't sure how to make our sites comply. I would hate to block the UK - I get a decent little chunk of totally legitimate and profitable traffic from there - but if I couldn't comply with a law they passed, then I think I would have to.

dstiles




msg:4664576
 7:26 pm on Apr 20, 2014 (gmt 0)

Scrapers: IP ranges are not the entire picture here. It is fairly easy to scrape contents of web sites by using proxies or botnets. Remember to also block those, many of which are in America and include G and Y proxies and any number of clouds. Header fields are useful in this context - which I know at least most of this thread's particpants are aware of. :)

From my own persepctive, being in the UK, I find a major source of bad traffic comes from USA. Admittedly it is a large country but it's still a hugely annoying fact. So allowing only ARIN countries to access a web site is NOT fixing the real problem, only reducing the attack surface.

I block a few countries only on a few sites. I block all server farms I can find and most proxies, and do the best I can through client header fields for the rest.

Chacun a son gout, as the saying goes - roughly translated: each to his own goat. :)

webcentric




msg:4664592
 9:43 pm on Apr 20, 2014 (gmt 0)

One reason for this thread came about after spending loads of time in the server farm thread on this board and coming to the realization that I was blocking only bits and pieces of parts of the world that have caused me nothing but grief and are costing me money. I could perhaps try to find a way to make money off (pick a country) traffic and I'm sure many do but when 99% of the traffic from a country is bot, hacker and spammer related and the other one percent is just consuming resources I pay for, then the business conclusion is a simple one.

Yes, a great deal of nefarious traffic comes from the US (or through US routes of ingress). You'll get no argument from me on that subject. Proxies and server farms in the US are definitely a big part of the problem. Having said that, with a blocking scheme such as the one above, the range of server farms and proxies blocked world-wide is substantial enough to have cut my server resource needs in half and my proxy and server farm problem is substantially more manageable than it was prior to the measure. All with scant impact on any real people with a legitimate need to be on my website in the first place.

I stated that this was a sledgehammer approach in the beginning. Try to compile all the information in the server farm thread and put it to use if you're just beginning to tackle these problems and you'll quickly see the the surgical approach involves digging ever deeper into a never-ending can of worms. This is much quicker, even when doing it from scratch if you're in need of a quick fix to an overwhelming problem. There's no reason it has to be permanent or draconian in its implementation. Give people a nice 404 page if you feel obliged. Personally, I don't feel obliged to answer my door just because someone has the gumption to knock on it uninvited, or answer the phone just because it's ringing or read every piece of junk mail that lands in my mail box. Life is too short to spend constantly dealing with unsolicited distractions that can only lead to a reduction in my net worth.

This whole topic may be skewed because we hail from different parts of the world. I can see how operating a UK website would require a different approach for example. I can also see the value that webmasters around the globe place on US traffic. It's valuable. If you have lots of it you can probably do well but, as mentioned, you'll have to deal with the trouble-makers we have here too. Still, it's probably worth dealing with the trouble for a shot at the revenue opportunities. The same cannot be said for every country on earth.

I didn't really see this as such a touchy subject when I started out but the sensitivity to the matter is enlightening. I think deep down, everyone is in agreement that there are bad travelers on the information super highway and that avoiding them is a good idea. Some simply prefer to stay off the road as much as possible, others like to drive armor-plated vehicles and other prefer the back-roads as opposed to the Autobahn. I think we all just want to get to the destination safely and can agree to disagree on the best methods. If there were a bullet-proof solution to this problem, I don't think we'd be having this discussion.

Added: if there's a better way to scale my business to my target market, I'm all ears. Please enlighten me.

"Chacun a son gout"

lucy24




msg:4664598
 11:49 pm on Apr 20, 2014 (gmt 0)

Give people a nice 404 page if you feel obliged.

Typo? Nice 403 page, right?

This is important. Nobody but humans ever looks at a 403 page. The server always sends one out, but a robot doesn't care what it says. Only humans will read it. So you may as well make a page that falls all over itself apologizing for the inconvenience.

Like a brick-and-mortar store. You can explain an annoying policy on the grounds that some people are crooks and we need to make sure you're not one of them. Or you can make up an explanation that allows everyone to save face and nobody looks evil. Which one is more likely to make people want to come back?

The tricky part is that most humans never meet a bona fide lockout-- based on IP, UA, referer or whatnot. All they will ever see is the "this directory doesn't have an index page" block that you meet if you're experimentally traveling backward through a long URL. If you don't happen to have a website of your own, you may not even realize that you're meeting the identical 403 that would be served to a malign Ukrainian.* The server just sees "access denied" and acts accordingly.


* There exist humans in the Ukraine. Some of them may read this forum. One time I even met a well-behaved Ukrainian robot. This is a problem.

webcentric




msg:4664672
 1:10 pm on Apr 21, 2014 (gmt 0)

Typo? Nice 403 page, right?


Well, in my experience, the tool you're using to block with has something to say about the type of response returned. And again, I'm using the term "block" in a rather loose fashion here. A firewall gives you certain options whereas various modules may restrict you to others. Heck, you can write code to 301 or 302 someone to a 200, right?

I'm a fan of security through obfuscation. Your average surfer doesn't know the difference between 404.x, 403.x, 200, 301, 302, 500 or whatever but a hacker most likely does and can learn things from specific status codes. I think, handling traffic after you've determined that you don't want it is a worthy discussion. Can't say I've always handled it correctly (either from a best practices perspective or a moral perspective) so I'm very open to learning to do it better if I can. After all, I am the dummy who started this thread and I did so to learn something (if not to attempt to compile some information into a summarized form once and for all).

lucy24




msg:4664708
 3:53 pm on Apr 21, 2014 (gmt 0)

Your average surfer doesn't know the difference between 404.x, 403.x, 200, 301, 302, 500 or whatever

Exactly. Until I had my own site I didn't know what a 403 was. I just thought of 404 vs 403 as "no page" vs "no directory". And I've got a long-standing suspicion that many robots will go away faster if they meet a 404, because then they're not prompted to wonder "what doesn't he want me to see"?

If you were looking purely at server load, probably the lightest possible response is rewriting to a one-pixel gif. The second-lightest is a redirect, because you're just sending out the response header without any content. But that's only efficient if you're redirecting to somewhere other than your own site, so there's never a followup request. And you can't redirect globally to something like google or Interpol, because that's Not Nice.

Any explicitly coded response-- regardless of status-- is more efficient than a natural 404, because the server doesn't have to go physically look for the file. But, of course, you can return a 404 response manually as well. (I'm currently doing this for one error document that a major search engine somehow learned about and asks for by name. Normally I'd use 410 ... only this one is, itself, a 410 document!)

But some responses only work if you're absolutely certain there will be no collateral damage. Or if you genuinely don't care. I can't think of anything other than
175.45.176.0/22
where that would apply universally. That is, every site in the world can say in unison: I don't know and I don't care what's at the other end of that request, because there is zero possibility that it's someone I want to see.

webcentric




msg:4664737
 6:40 pm on Apr 21, 2014 (gmt 0)

So, handling the response in a more granular fashion implies the need to interpret the nature of the request in a better way than can be managed with a country or regional blocking list and I will certainly concede that point.

So, having thrown out here, some details related to the IPv4 address space and a simple strategy for addressing a large chunk of it, the question then arises, is there a better way? Perhaps, people who ask themselves the question posed in the title of this thread actually want a better solution than the sledgehammer offered thus far. Keeping in mind that not everyone is a programmer, not everyone can read and write a Regular Expression, not everyone has access to their server configuration or even knows what that is, how does one go from a state of constant (and costly) assault to a state of relative security in the most efficient manner possible? Start picking off server farms? Dealing with obvious request header anomalies? Where would you start if you were just starting to tackle this problem and wanting to get the most payoff for the time you had to spend dealing with it?

lucy24




msg:4664761
 9:35 pm on Apr 21, 2014 (gmt 0)

... and anything other than a one-size-fits-all response will require some further work on the server level-- which is exactly what you're trying to avoid, right?

You may have to weigh bandwidth against server memory/cpu usage. If you're on the kind of shared hosting that charges for memory instead of bandwidth,* it may well be most cost-effective to continue serving up the 403s.

A "Deny from..." list that's all expressed in units of /8 and up is undeniably efficient.

Deny from 79.0.0.0/4
Deny from 112.0.0.0/4


224-255 (that's /3) is just sitting there, isn't it? Will it ever be used for anything?


* This seems to be the trend. It's a sweet deal for those of us whose pages are 95% static html ;) Not so nice if you're on a cms that's all done in the server. I've never heard of anyone whose price got jacked up purely because their htaccess was too complicated.

webcentric




msg:4664777
 1:06 am on Apr 22, 2014 (gmt 0)

Actually, this is related to a site on a dedicated server and involves querying a fairly large relational database for the majority of page loads. The cost of robotic visits alone can be staggering in terms of server resources required to query the size of database involved (even with a variety of caching mechanisms in place).

I regularly see 50,000 to 75,000 crawler hits a day (or more) from Google, Yahoo and Bing and that number is (or was) dwarfed by the rest of the robotic traffic on the site. It's a dynamic site so, as Lucy intimated, there is a real downside to triggering a query against a table with 7,000,000+ records in it (which becomes trivial in itself when compared to a query on the same table and several other joins) unless you absolutely have to. Caching helps a bit but when you have millions of possible query combinations, the memory required for caching can represent substantial overhead in and of itself.

And yes, that /3 is a wonderment isn't it? IP pool exhaustion, really?

lucy24




msg:4664792
 2:01 am on Apr 22, 2014 (gmt 0)

Edit: Oops, didn't see we were onto a new page. I was responding to:
Actually, this is related to a site on a dedicated server and involves querying a fairly large relational database for the majority of page loads. The cost of robotic visits alone can be staggering in terms of server resources required to query the size of database involved (even with a variety of caching mechanisms in place).

I regularly see 50,000 to 75,000 crawler hits a day (or more) from Google, Yahoo and Bing and that number is (or was) dwarfed by the rest of the robotic traffic on the site. It's a dynamic site so, as Lucy intimated, there is a real downside to triggering a query against a table with 7,000,000+ records in it

Whole new issue then. If the act of building a page consumes far more server resources than evaluating undesired visitors, then you can definitely afford to be granular. And then it becomes more of a "how to..." question. You want to divide your unwanted visitors into various categories. At a minimum:

-- the ones who will never be allowed to set foot in your site, ever,
vs.
-- the ones who look suspicious but might conceivably be legitimate or even desirable.

Dedicated server. Does that mean you have-- or could set up-- a firewall? The unconditional lockouts can stay outside the firewall where they never have to bother the server. The ones who are allowed inside the door are subject to further filtering. Here there's really no limit to what you can do. Even if you've got the world's kindest and most helpful 403 page, even if you redirect all the maybes to a "I'm really sorry, but..." page, even if dubious requests are subjected to a whole script of their own, it's still less work than building a complete page from a vast database.

You can track requests, for example. Did the same IP ask for two consecutive different pages with no intervening request for supporting files? Have they asked for five pages in three seconds? Do all requests come in with an auto-referer? (These are impossible to deal with in mod_rewrite-- except in a very narrow, targeted way-- but become trivial in php. Or language of your choice.)

I realize the original question was simply about identifying North America vs. the rest of the world. But with a massive database at stake, it's worth sifting more carefully.

It works at any level. Sometimes I think about simply de-indexing all my images because, heck, what's the point? Nobody ever ends up on the page, and I don't gain anything from them just looking at the picture. And then I remember the email I got from an Australian who was searching for some wildly improbable combination of furry green shape-shifting widgets, and landed on a page of mine that nobody ever visits. It was exactly what she was looking for.

Everyone gets these, in some form or other. Some of them may even send you money.

This 94 message thread spans 4 pages: < < 94 ( 1 [2] 3 4 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved