Welcome to WebmasterWorld Guest from 54.226.62.26

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

odd UA

     

wilderness

12:05 am on Feb 11, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Nice to see that Level3 still loves me after a nearly three year absence ;)

8.218.202.1.static.bjtelecom.net - - [10/Feb/2012:18:25:47 +0000] "GET / HTTP/1.0" 200 1512 "-" "\"Mozilla/5.0"

No idea (and in much of a hurry) if quotes are special characters or not?

Could certainly do it on two lines and likely even one.

BTW, my logs have changed back to that repulsive format.

keyplyr

2:22 am on Feb 11, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I've seen this one and a few other UAs using the backward slash; possibly an attempt to make it difficult to ban.

adrian20

3:28 am on Feb 11, 2012 (gmt 0)



Speaking of backward slash, take a look at this one.

92.249.127.111 - - [21/Jan/2012:18:06:29 -0600] "GET /
HTTP/1.1" 403 25
"http://excel2010.ru/" "\xef\xbb\xbfMozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The quotes, I've never seen here in my log. I need to prepare to receive them.

MxAngel

9:21 am on Feb 11, 2012 (gmt 0)



Those are fake Google Bots, have seen quite a few of those lately, always with a .ru referer.

wilderness

9:48 am on Feb 11, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



takes care of the fakers

RewriteCond %{REMOTE_ADDR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteRule .* - [F]

lucy24

11:24 am on Feb 11, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Woo Hoo, it's my Ukrainians! Was going to mention them in another thread. If you decode it in the right way you'll find that EFBBBF is one kind of zero-width space, presumably meant to be invisible. They've only recently started spoofing the googlebot; the leading \xef\xbb\xbf is a still newer variation.

:: shuffling papers ::

I get 'em from
92.249.0-127, 109.120.128-191, 178.136-137, 193.106.136-139, 213.110.128-159

That's (cut&paste):
Deny from 92.249.0.0/17
Deny from 109.120.128.0/18
Deny from 178.136.0.0/15
Deny from 193.106.136.0/22
Deny from 213.110.128.0/19

Go ahead and block them by IP but don't stress over them. They used to make me absolutely livid but I've grown accustomed to them.

MxAngel

12:36 pm on Feb 11, 2012 (gmt 0)



A little variant with a question mark:

Host:133-221.sunnet.com.ua
IP:213.110.133.221
Country Code:UA
User Agent:?Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

lucy24

10:05 pm on Feb 11, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Are you positive it's a question mark and not an "I can't display this character" sign? Either way, look in the preceding post (list of IPs) and you'll see it's the same guys. Pull them out of your logs, look at them in isolation and you'll start finding a very distinctive pattern within each visit. In my case it focuses on a page that's pointless without its images-- which they have never tried to get.

adrian20

3:35 am on Feb 12, 2012 (gmt 0)



MxAngel, I was wondering if anyone else gets this fake UserAgent.

wilderness, I am copying the RewriteCond. By the way, sometimes I like to use this combination;

!^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\. instead of 66.249.64.0/19. How you've done it, do you get it manually every time. I would like to know if there is any program that does the conversion.

lucy24, Ohh! yes, that is where my little website is popular. I am copying these IPs, including that added (213...), By MxAngel.

Coming back to the issue of wilderness. I think that quotes were generated by the Protocol. Some strange combination that tried to do, and that the protocol "HTTP/1.0" did not understand. Including that could have been a misunderstanding of some proxy.

keyplyr

5:43 am on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I was wondering if anyone else gets this fake UserAgent.

Absolutely, Googlebot is perhaps the most widely spoofed UA besides various browsers. I block them on a daily basis:

RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteCond %{REMOTE_ADDR} !^66\.249\.[6-9][0-9]\.
RewriteRule .* - [F]

Catches about a dozen imposters per day. The same filter is also used on Yandex, Slurp & Bingbot.

adrian20

11:42 am on Feb 12, 2012 (gmt 0)



keyplyr, hehehe; I've broken the head looking for a way to isolate Google IP, along with Googlebot. Thanks for this RewriteCond, I needed this.

wilderness

12:23 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I would like to know if there is any program that does the conversion.


adrian,
I do it manually.
There are some online tools however they are generally pitiful and create bloated ranges.

Many moons ago for 200-255 range many of us at Jim's suggestion began using:
2[0-5][0-9]
as opposed to 2[0-4][0-9]|25[0-5]

the aformentioned being shorter.

Google even has an IP regex converter [google.com], HOWEVER i'd caution and suggest that anybody NOT use the converted output.

wilderness

12:34 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



keyplr,
You've been around here as long as me.
I'm sure your aware that google only goes through the 95 Class C.

Just a heads up, in case your not.

Don

wilderness

12:59 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Here's the google generated synax

^66\.249\.(6[4-9]|[7-8][0-9]|9[0-5])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$

the entire 3rd section (including the ends with) is pure bloat

MxAngel

7:12 pm on Feb 12, 2012 (gmt 0)



Lucy, yes I'm positive on the ? character. Btw, a while ago I found them in the logs with 3 "I can't display characters" just before they started using the \xef\xbb\xbf ones.

I've been blocking those ranges for ages.

In my case they go after popular topics / posts.

adrian, just like keyplyr I get them almost daily.

lucy24

11:27 pm on Feb 12, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Many moons ago for 200-255 range many of us at Jim's suggestion began using:
2[0-5][0-9]
as opposed to 2[0-4][0-9]|25[0-5]


Since life as we know it ends at 255, a simple 2\d\d should work fine too.

wilderness

12:18 am on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



a simple 2\d\d should work fine too.


lucy,
What exactly does that translate to?
^66\.249\.(6[4-9]|[7-8][0-9]|9[0-5])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2\d\d))$

Note; your ending added.

lucy24

2:03 am on Feb 13, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



2\d\d = 2[0-9][0-9]

\d = digit
\w = "word character" (alphanumerics and lowline, but not hyphen)

I haven't done exhaustive testing, but so far htaccess has recognized every RegEx construction I've fed it. I wouldn't try anything like \p{Alpha} or \p{Punct} because there are approximately 700 dialect-specific variants. Same for \x{blahblah}.

But lookaheads/behinds (?=blahblah) (?<!blahblah) etc. and non-capturing groups (?:blahblah) both work-- and both of those can be very useful.

Among things I haven't personally tested, but should:

\s for space (could be either useful or disastrous, depending on whether it counts line-ends as spaces)
\h for hexadecimal (i.e. same as [0-9a-fA-F])
\W \D and so on: negating by capitalizing, so "non-word", "non-digit" and so on.

wilderness

12:59 pm on Feb 18, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Nice to see that Level3 still loves me after a nearly three year absence wink

8.218.202.1.static.bjtelecom.net - - [10/Feb/2012:18:25:47 +0000] "GET / HTTP/1.0" 200 1512 "-" "\"Mozilla/5.0"


In my previous, I was confused by the log format.

This NOT Level3, rather 1.202.218.8
Bejing Telecom.

and this bugger is quite persistent. Been returning about every 90-mins to eat a 403.

dstiles

9:43 pm on Feb 18, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



49 hits in 5 days from that one. I've blocked 1.202.216.0 - 1.202.223.255 as having received several other unwanted hits. It's a static range (at least, some is).

wilderness

12:55 am on Feb 19, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



49 hits in 5 days from that one. I've blocked 1.202.216.0


Your more tolerant than I!
I added the Class A's 1 & 2.

keyplyr

1:59 am on Feb 19, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




If it's China, it's blocked:

deny from 1.202.0.0/15

keyplyr

2:02 am on Feb 19, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



keyplr,
You've been around here as long as me.
I'm sure your aware that google only goes through the 95 Class C

Yes, Googlebot does but other Google utilities use the remainder of that range and others. The code I posted is only part of a larger rule with other conditions and UAs.

lucy24

2:21 am on Feb 19, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



:: detour to htaccess ::

... Yup, I'm on the same Intolerance Level as keyplyr. Mine says 1.202.0.0/15 In fact I think it was one of the first China ranges I ever blocked.

8.218.202.1.static.bjtelecom.net
<snip>
I was confused by the log format.

This NOT Level3, rather 1.202.218.8


I can't find the RegExes I used last summer when my own logs temporarily did the same thing, but there are a few basic patterns that will cover most of them.

^\d+\.\d+\.\d+\.\d+\.\p{Alpha}
is tricky because it can go either way-- that is, \1.\2.\3.\4 OR \4.\3.\2.\1

Looking back, it's mostly
^\p{Alpha}+-\d+-\d+-\d+-\d+([-.]\S+)? (with trailing space)

and those are easy to pull apart. This is direct cut & paste from my text editor, which speaks a different RegEx dialect than htaccess.

dstiles

8:07 pm on Feb 19, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Wilderness, Keyplr - I block China (and a few others) on a site-by-site basis. Some of my clients want traffic from such places. :(

I still block undesirable ranges within the countries, though. If they rack up a lot of hits within an IP range, that range gets blocked.

Igal Zeifman

9:46 am on Jul 23, 2012 (gmt 0)



[If it's China, it's blocked: deny from 1.202.0.0/15 ]

I`m from Incapsula and we just finished Googlebot study (see link below).
While going over the data we noticed a verifiable Google Image bot visit from Chinese IP. We are still investigating this, but so far it seems legit.

<snip>

[edited by: incrediBILL at 5:18 pm (utc) on Jul 23, 2012]
[edit reason] no blog URLs please [/edit]

 

Featured Threads

Hot Threads This Week

Hot Threads This Month