Forum Moderators: phranque

Message Too Old, No Replies

Blocking Amazonaws in htaccess

Another crawler Bixolabs

         

grandma genie

11:22 pm on Aug 13, 2010 (gmt 0)

10+ Year Member



Hello,

I was checking my server logs today and found a series of posts from this IP:

184.72.133.4 - - [13/Aug/2010:06:15:04 -0400] "GET /osc/?cPath=145 HTTP/1.1" 200 60720 "-" "Mozilla/5.0 (compatible; bixolabs/1.0; +http://bixolabs.com/crawler/general; crawler@bixolabs.com)"
184.72.133.4 - - [13/Aug/2010:06:15:11 -0400] "GET /bugs/bugs.html HTTP/1.1" 200 11722 "-" "Mozilla/5.0 (compatible; bixolabs/1.0; +http://bixolabs.com/crawler/general; crawler@bixolabs.com)"
184.72.133.4 - - [13/Aug/2010:06:15:11 -0400] "GET /osc/?cPath=70 HTTP/1.1" 200 13639 "-" "Mozilla/5.0 (compatible; bixolabs/1.0; +http://bixolabs.com/crawler/general; crawler@bixolabs.com)"
184.72.133.4 - - [13/Aug/2010:06:15:16 -0400] "GET /osc/?cPath=130 HTTP/1.1" 200 59477 "-" "Mozilla/5.0 (compatible; bixolabs/1.0; +http://bixolabs.com/crawler/general; crawler@bixolabs.com)"

The IP is from amazonaws. I've had them visit my site before and would prefer that they stay away. Who is bixolabs.com? Is there a way to block all amazonaws visitors from my site using htaccess?

Thank you.

Jeannie

jdMorgan

2:43 am on Aug 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've used something like this:

# Block Amazon Elastic Compute Cloud except ia_archiver, botmobi, & thumbnailers (Ask Jeeves?)
# Allow Ask Jeeves page thumbnailer
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/5\.0\ \(X11;\ U;\ Linux\ i686;\ en-US;\ rv:1[.0-9]+)\)\ Gecko/20[0-9]{6,8}\ Firefox/[2-9][.0-9]+$
# Allow ia_archiver
RewriteCond %{HTTP_USER_AGENT}>%{HTTP:FROM} !^ia_archiver\ \(\+http://www\.alexa\.com/site/help/webmasters;\ crawler@alexa\.com\)>crawler@alexa\.com$
# Allow botmobi
RewriteCond %{HTTP_USER_AGENT} !\(botmobi\ find\.mobi/bot\.html\ find@mtld\.mobi\)/?$
# Allow xMarks/FoxMarks
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/5\.0\ \(compatible;\ XmarksFetch/[1-9]\.[0-9]+;\ \+http://www\.xmarks\.com/about/crawler;\ info@xmarks\.com\)$
# 67.202.0.0-67.202.63.255, 72.21.192.0-72.21.223.255, 72.44.32.0-72.44.63.255, 75.101.128.0-75.101.255.255,
# 79.125.0.0-79.125.127.255, 174.129.0.0-174.129.255.255, 184.72.0.0-184.73.255.255, 204.177.154.0-204.177.155.255,
# 204.236.128.0-204.236.255.255, 207.171.160.0-207.171.191.255, 216.137.32.0-216.137.63.255, 216.182.224.0-216.182.239.255
RewriteCond %{REMOTE_ADDR} ^(67\.202\.([1-5]?[0-9]|6[0-3])|72\.21\.(19[2-9]|2[01][0-9]|22[0-3])|72\.44\.(3[2-9]|[45][0-9]|6[0-3])|75\.101\.(12[89]|1[3-9][0-9]|2[0-5][0-9])|79\.125\.([1-9]?[0-9]|1[01][0-9]|12[0-7])|174\.129|184\.7[23]|204\.177\.15[45]|204.236.(12[89]|1[3-9][0-9]|2[0-5][0-9])|207\.171\.1([678][0-9]|9[01])|216\.137\.(3[2-9]|[45][0-9]|6[0-3])|216\.182\.2(2[4-9]|3[0-9]))\.
RewriteRule ^ - [F]

Jim

grandma genie

3:27 am on Aug 14, 2010 (gmt 0)

10+ Year Member



Hi Jim,

Nope, this caused an internal server error, so I took it off. I have some of the Amazonaws.com IPs blocked in htaccess. Could that cause the error?

Jeannie

wilderness

6:27 am on Aug 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hey Jim,
Does the following imply something more than 10-59?

[1-5]?[0-9]

TIA

Don

jdMorgan

11:35 am on Aug 14, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It matches 0-59, because the pattern specifies 0-9 or 10-59.

The leading [1-5] (tens digit) is made optional by the "?" quantifier.

Equivalent to, but more efficient than ([0-9]|[1-5][0-9])

Jim

kkrugler

4:23 pm on Aug 14, 2010 (gmt 0)

10+ Year Member



Hi Jeannie,

I can't help with conditional blocking of bots from Amazon's EC2, but I can tell you how to block the Bixolabs crawler. As per details on the [bixolabs.com...] page (in the user agent string) you should add the following to your robots.txt:

User-agent: bixolabs
Disallow: /

-- Ken

grandma genie

5:31 pm on Aug 14, 2010 (gmt 0)

10+ Year Member



Hi Ken,

Thank you. I have done that now.

Jeannie

SevenCubed

2:43 pm on Aug 15, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This comment isn't searching for an answer, it's just sharing an observation. I was picking apart this stuff yesterday to try to digest it and in itself it is difficult enough to understand but then a little seed was dropped into my thoughts; wait until you try to do something like that with the new IPv6 format looming on the horrizon!

IPv4:
67.202.0.0 - 67.202.63.255 = (67\.202\.([1-5]?[0-9]|6[0-3])

IPv6:
3ffe:ffff:101::230:6eff:fe04:d9ff - 6eef:dddd:314::320:7fee:df06:f3ee = ?!

jdMorgan

1:47 am on Aug 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There is no insurmountable problem, except that the number of potential IP addresses is astronomically bigger. However, that doesn't translate into a necessarily-bigger number of "problematic IP addresses."

As far as the regular expressions go, the IPv4 periods become colons, and numeric groups ("octets") which are now 3-digit decimal [1-9][0-9]{0,2} representing 0-255 in IPv4 are changed to 4-digit ranges in hexadecimal ([1-9a-f][0-9a-f]{0,3})? representing 0-65536 in IPv6. The parentheses and "?" quantifier are used because in IPv6, zero-value 16-bit "words" are omitted, as in the "::" part of your example.

Your example range is both too large (representing just under 3,458,764,513,820,540,928 times the size of all currently-definable IPv4 space) and too "power-of-two-misaligned" to be realistic, but it could be coded, given more time than I have to answer a purely-hypothetical coding question... :)

Jim