homepage Welcome to WebmasterWorld Guest from 54.197.189.108
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 36 message thread spans 2 pages: 36 ( [1] 2 > >     
More Chinanet
119.128.0.0
not2easy




msg:4521841
 10:24 am on Nov 23, 2012 (gmt 0)

119.130.19.83 - - [20/Nov/2012:20:16:03 -0600] "GET / HTTP/1.1" 200 9816 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us; rv:1.9.2.3) Gecko/20100401 YFF35 Firefox/3.6.3"
119.130.19.83

119.128.0.0 - 119.143.255.255
More Chinanet Telecom services (Guangdong)

Robots - NO
It tried another 50+ 403's after it was blocked. Requests included the nonexistant : 119.130.19.83 - - [20/Nov/2012:20:27:33 -0600] "GET /self.location HTTP/1.1"

 

incrediBILL




msg:4521968
 7:25 pm on Nov 23, 2012 (gmt 0)

Unless you have legit customers in China the only solution I've ever found to the China IP problem is to simply bock the whole country.

keyplyr




msg:4521970
 7:58 pm on Nov 23, 2012 (gmt 0)


Here's a good list:

[parkansky.com...]

not2easy




msg:4521972
 8:11 pm on Nov 23, 2012 (gmt 0)

Thank you for that suggestion and the blocklist source. Doing these one at a time as they show up in the traps is a pain.

incrediBILL




msg:4521975
 8:20 pm on Nov 23, 2012 (gmt 0)

You can easily build your own lists with fresh data direct from the source:

ftp://ftp.afrinic.net/pub/stats/afrinic/delegated-afrinic-latest
ftp://ftp.apnic.net/pub/apnic/stats/apnic/delegated-apnic-latest
ftp://ftp.arin.net/pub/stats/arin/delegated-arin-latest
ftp://ftp.lacnic.net/pub/stats/lacnic/delegated-lacnic-latest
ftp://ftp.ripe.net/ripe/stats/delegated-ripencc-latest

Just like everyone else, I use these to create the block country IP lists on my site.

Just remember, they're rough approximations because IP allocations can vary on country borders and there will be IP overlaps such as Canada/US or Mexico/US borders. Same happens in Europe, Asia, etc. but it's close enough with minimal collateral damage unless you want to pay for a better refined source from a GeoIP provider.

Maybe if I lose my mind later I might post the PHP code used to generate a list for any country.

[edited by: incrediBILL at 8:21 pm (utc) on Nov 23, 2012]

keyplyr




msg:4521977
 8:21 pm on Nov 23, 2012 (gmt 0)

Be careful; some of the larger ranges *may* include other Pacific actors in Japan, Australia, etc.

incrediBILL




msg:4521979
 8:26 pm on Nov 23, 2012 (gmt 0)

Obviously, but the lists you link to use the same exact source.

If APNIC says it's China it's good enough for me.

Mainly because I get scraped and spammed from their neighbors as well :)

The only way to really refine it would be to make direct APNIC inquiries for all the ranges listed per country and examine them at a macro level and they'd most likely block me like LACNIC did for making too many requests.

Lucky for me I was using a proxy IP at the time...

not2easy




msg:4521984
 8:35 pm on Nov 23, 2012 (gmt 0)

A few of the sites I run do not do any business outside the US, but these look like huge chunks to add in htaccess files, so I am trying to stick to IPs that are actually being abusive. I don't get to use httpd. I am blocking by CIDR, but cringe at the number of lines it takes.

Long ago I blocked myself with generous blocking so I do check "iffy" ranges, that is good advice.

incrediBILL




msg:4521986
 8:42 pm on Nov 23, 2012 (gmt 0)

I am trying to stick to IPs that are actually being abusive


Useless as most IPs you see coming out of China are using IP pools or some fast flux nonsense or something equally as hard to stop so instead of playing whack-a-mole one IP at a time, which would result in a much bigger list in short time, I just blocked the country and never looked back.

They're the source of the bulk of my scrapers, spam and server attacks so I should've dropped them in the firewall itself but I only blocked them in HTACCESS and in my mail gateway so I could block them yet still monitor their activity.

Kind of like putting up a fence to keep the neighbor kids out of the yard but sitting on the lawn to watch them play.

wilderness




msg:4521993
 9:18 pm on Nov 23, 2012 (gmt 0)

A few of the sites I run do not do any business outside the US, but these look like huge chunks to add in htaccess files,


"Huge chunks" in a properly formatted htacces do not present any real issue, if there is no excessive "server load".


Useless as most IPs you see coming out of China are using IP pools or some fast flux nonsense or something equally as hard to stop so instead of playing whack-a-mole one IP at a time, which would result in a much bigger list in short time, I just blocked the country and never looked back.


In order to "fit in" all you need to do is expand the same registrar's numbers ;)

lucy24




msg:4521999
 9:27 pm on Nov 23, 2012 (gmt 0)

these look like huge chunks to add in htaccess

Huge chunks are good. One sweeping /10 and a whole section of the country is gone. The maddening ones are when a Ukrainian robot is perched on a /24 between two perfectly innocuous pieces of France.

:: quick detour to CIDR segment of htaccess ::

Yup. Out of an almost-900-line file, about 2/3 is China. Add in my site-specific htaccess and the total drops to about half. Oh well.

wilderness




msg:4522014
 10:12 pm on Nov 23, 2012 (gmt 0)

The maddening ones are when a Ukrainian robot is perched on a /24 between two perfectly innocuous pieces of France.


mod-rewrite ;)

Although the lines keyplr provided are useful, IMO using deny from is an organizational nightmare. mode_rewrite is more compact and more readable.

incrediBILL




msg:4522023
 10:38 pm on Nov 23, 2012 (gmt 0)

Another method is to block the language such as:

RewriteCond %{HTTP:Accept-Language} (zh)$ [NC]
RewriteRule .* - [F,L]

keyplyr




msg:4522024
 10:42 pm on Nov 23, 2012 (gmt 0)

IMO using deny from is an organizational nightmare. mode_rewrite is more compact and more readable.

And I feel mod_rewrite puts more of a resource strain on the server, especially if you use a lot of conditions that have to be read redundantly. IMO mod_access is cleaner, faster and better manageable, although I guess it depends on what you're used to working with.

Done correctly, mod_access can be just as surgical as any other method. I switched to it from mod_rewrite a few years ago and like it much netter. I still use mod_rewrite for UA and Referrer management, and for my white list conditions.

UPDATE - After a few months of letting all Chinese ranges in, it has again proven to be a headache, so I have just added ALL the Chinese ranges from my above link to mod_access. Don't notice any slowness in server response time. Difficult to measure though because I block almost all the online speed tools. When I use the linux command line test from my machine, it shows: 0.0136 which is great, but I'm on the very same backbone as my web server, still it shows a generally fast time for server processing.

Continued tests show server processing time: 0.01-0.02

wilderness




msg:4522054
 12:17 am on Nov 24, 2012 (gmt 0)

IMO using deny from is an organizational nightmare. mode_rewrite is more compact and more readable.


And I feel mod_rewrite puts more of a resource strain on the server, especially if you use a lot of conditions that have to be read redundantly.


The diversity of preferences by webmasters combined with the diversity of htaccess is what continues to make this forum a thriving community.

"There's more than one way to skin a cat".

keyplyr




msg:4522057
 12:39 am on Nov 24, 2012 (gmt 0)


"There's more than one way to skin a cat".

...or Chinaman :)

lucy24




msg:4522079
 2:40 am on Nov 24, 2012 (gmt 0)

Done correctly, mod_access can be just as surgical as any other method.

I hope that's Voice Of Habit and not Voice Of Current Apache Installation ;)

I looked it up just a few days ago after my latest mod check (shared hosting, so I can't just ask): mod_access is 1.3 and earlier. The current equivalent is mod_authz_host. I tend to think of it as core because it's just about the last thing to execute but technically it's just another mod.

And I feel mod_rewrite puts more of a resource strain on the server

I've always assumed that a formula like
... 12.34.128.0/20
is less resource-greedy than
... %{REMOTE_ADDR} ^12\.34\.1(2[89]|3\d|4[0-3])\.
But if you backed me into a corner and put bamboo shoots under my nails I would have to admit that I have absolutely no factual basis for this assumption. Just gut feeling.

keyplyr




msg:4522083
 2:58 am on Nov 24, 2012 (gmt 0)

Agreed. The way Jim Morgan explained it to me back in the day was...

The server processes ... %{REMOTE_ADDR} ^12\.34\.1(2[8...
then ... %{REMOTE_ADDR} ^12\.34\.1(2[9...
and so on...so it has to process that one line about 12 times, then goes on to the next line.

Where in mod_authz_host (yes, old habit calling it mod_access)
... 12.34.128.0/20 is processed once.

incrediBILL




msg:4522107
 6:06 am on Nov 24, 2012 (gmt 0)

I've always assumed that a formula like
... 12.34.128.0/20
is less resource-greedy than
... %{REMOTE_ADDR} ^12\.34\.1(2[89]|3\d|4[0-3])\.


You are correct.

The DENY syntax is was more efficient than RewriteRule processing just from the perspective of parsing alone. I know it would be had I written it because that's just the way it works. Whether that's the actually reality in execution speed would need to be tested because not all programmers are created equal. :)

incrediBILL




msg:4522132
 7:14 am on Nov 24, 2012 (gmt 0)


[httpd.apache.org...]
Note that all Allow and Deny directives are processed, unlike a typical firewall, where only the first match is used. The last match is effective (also unlike a typical firewall). Additionally, the order in which lines appear in the configuration files is not significant -- all Allow lines are processed as one group, all Deny lines are considered as another, and the default state is considered by itself.


This little tidbit, if it works as documented, would imply that a very large list could be really slow UNLESS it's preprocessed, cached and indexed at which point it would be very fast and a moot point.

Again, performance testing appears to be appropriate with either method.

wilderness




msg:4522160
 11:16 am on Nov 24, 2012 (gmt 0)

This little tidbit, if it works as documented, would imply that a very large list could be really slow


Recent practices of some folks [webmasterworld.com] in this forum will lead to that precise slow process

eventually the quantity of denies will become so excessive and so unorganized that that a more efficient method becomes a necessity (as it did for me in a very short time almost a decade ago).

wilderness




msg:4522162
 11:39 am on Nov 24, 2012 (gmt 0)

keyplr,
It was Jim who encouraged me to make the transition to mod-rewrite, thus there must have been some benefit that outweighed the the processing deficiency?

How many times would this line be processed (note; single line broken to prevent width display):

deny from 203.100.32.0/20 203.100.80.0/20 203.100.96.0/19
203.100.192.0/20 203.110.160.0/19 203.118.192.0/19
203.119.24.0/21 203.119.32.0/22 203.128.32.0/19
203.128.96.0/19 203.128.128.0/19 203.130.32.0/19
203.132.32.0/19 203.134.240.0/21

203.135.96.0/19 203.135.160.0/20 203.148.0.0/18
203.152.64.0/19 203.156.192.0/18 203.158.16.0/21
203.161.192.0/19 203.166.160.0/19 203.171.224.0/20
203.174.7.0/24 203.174.96.0/19 203.175.128.0/19
203.175.192.0/18 203.176.168.0/21 203.184.80.0/20
203.187.160.0/19 203.190.96.0/20 203.191.16.0/20
203.191.64.0/18 203.191.144.0/20 203.192.0.0/19 203.196.0.0/22

BTW, we should stress that there is a difference in processing speed by a website with its own server (http.conf), Vs a website on shared hosting

blend27




msg:4522212
 4:04 pm on Nov 24, 2012 (gmt 0)

@incrediBILL

Maybe if I lose my mind later I might post the PHP code used to generate a list for any country.

There is a GeoIP plugin for CI(CodeIgniter)on Github(https://github.com/EllisLab/CodeIgniter/wiki/GeoIP-plugin-for-CI) that could be easily molded to your needs if you dont use the framework.

p.s. incrediBILL, i've seen the sticky, will reply a bit later.

keyplyr




msg:4522235
 8:11 pm on Nov 24, 2012 (gmt 0)

@ wilderness, the answer is once.

When you use a mod_rewrite to do the same task, each appearance of a nested condition starts the process back to the start of the line, so it may need to be read many times = greater load on server = greater processing time.

All this is theoretical, but if you have dozens of nested conditions (as I used to have before I switched) it could make a noticeable difference.

And yes, Jim was a big proponent of mod_rewrite, but later switched to mod_authz_host for blocking IP ranges. Somewhere there's a long discussion about this from a few years ago.

not2easy




msg:4522248
 9:50 pm on Nov 24, 2012 (gmt 0)


This little tidbit, if it works as documented, would imply that a very large list could be really slow

Exactly my concern as these are shared hosting sites. I do have all the denys in one environment and in numerical order, but I'm wondering how many individual lines I can replace by using the many CIDRs in the country blocking lists, also whether individual lines process any faster and if not is there a length limit? The host does pull IPs from the htaccess into a different IPblock feature in CP but it would take forever to add them there one at a time. The host told me that once the IPs are blocked in CP I don't need them in every domain in that hosting account. So what I am doing is amending the htaccess for the main domain and skipping those lists for htaccess in the subdomains.

So many scrapers, so little time.

keyplyr




msg:4522269
 1:30 am on Nov 25, 2012 (gmt 0)

@ not2easy see my test results above (msg:4522024.) The site is on a shared server. YMMV

wilderness




msg:4522351
 1:30 pm on Nov 25, 2012 (gmt 0)

@ wilderness, the answer is once.

When you use a mod_rewrite to do the same task, each appearance of a nested condition starts the process back to the start of the line, so it may need to be read many times = greater load on server = greater processing time.

All this is theoretical, but if you have dozens of nested conditions (as I used to have before I switched) it could make a noticeable difference.

And yes, Jim was a big proponent of mod_rewrite, but later switched to mod_authz_host for blocking IP ranges. Somewhere there's a long discussion about this from a few years ago.


keyplr,
I've explained this previously, however here goes.
My site (s) are simple html.
No java, no MySQL, no PHP.

The solitary script I've in place is only run as part of a contact form.

The solitary RewriteCond %{REMOTE_HOST} that was in place, was recently removed.

Thus, where's my excess server load?
My htaccess utilizing mod_rewrite does not present any excessive or slow load due to the simplicity of the pages.

As a result my antiquated mod-rewrite is only detrimental to others that are using options (Java, PHP, MySQL, Host Lookups) that I'm not using.

In addition I recently explained that my own longevity as a webmaster is in maintaining status-quo for an approximate four additional years. After that, I'd care less if the internet blew-up.

In summary and as a benefit to other participants of this forum, I'll confine my participation to heads up notifications of IP's and/or UA's.

Don

not2easy




msg:4522376
 4:17 pm on Nov 25, 2012 (gmt 0)

@ keyplyr I appreciate that you have gone through the time and trouble to actually time the processing and share the results, and I am using mod_access and have not seen any noticeable delay.

My htaccess has two sections for this process. At the top are entries added from a spider trap php script that creates the environment of "getout" as in:
SetEnvIf Remote_Addr ^38\.105\.83\.12$ getout
That list is sometimes over 100 lines long and I remove those lines when I add a CIDR line to the list under that which is set up as:
deny from 5.39.216.0/21
That list is well over 200 lines before looking at the Country Blocking List.

I'm adding the country blocking CIDRs to a color coded spreadsheet to check against my existing lines and since I can't just paste them into htaccess without trying to eliminate overlap I had those general questions about whether the multi CIDR lines format
deny from 49.50.4.0/22 49.50.8.0/22 110.136.176.0/20 110.139.0.0/16 114.79.18.0/24 etc.
are preferable to single deny lines. If that processes any faster, it would make sense to make all my single lines into fewer combined lines (to me anyway). And if I do that, are there limits to line length? I don't think I want to bunch everything into a long line, but it would be easier to maintain if I had a line for Hetzner for example, with everything in one line. If I do it that way, wouldn't it be slower if each line starts a new process? Since I have to reorganize things anyway, I thought maybe someone had experience or is it just try and see what happens? I have looked around (a lot) at the Apache docs site and have not found a definitive answer for that.

keyplyr




msg:4522413
 8:10 pm on Nov 25, 2012 (gmt 0)

where's my excess server load?

Don, I'm not saying you have excess server load :)

All I'm saying is, that using mod_rewrite with nested conditions makes the server go back to the start of the line each time until the entire line is processed. Thus, if you have many of these, it would be prudent to use mod_authz_host (allow, deny) to block ip ranges instead.

I use about 20 lines of mod_rewrite myself for various tasks. It's when these become complex that the processing time could become an issue.

@ not2easy Yes, combining lines is good. I do this for each A class for simple managing. Theoretically, you could combine thousands on a single line.

Bewenched




msg:4522416
 8:40 pm on Nov 25, 2012 (gmt 0)

some of the larger ranges *may* include other Pacific actors in Japan, Australia


especially Australia. We'd blocked some of the china ranges and accidentally blocked an Australian customer, who was nice enough to call us about it.

This 36 message thread spans 2 pages: 36 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved