homepage Welcome to WebmasterWorld Guest from 54.211.113.223
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
odd UA
wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 12:05 am on Feb 11, 2012 (gmt 0)

Nice to see that Level3 still loves me after a nearly three year absence ;)

8.218.202.1.static.bjtelecom.net - - [10/Feb/2012:18:25:47 +0000] "GET / HTTP/1.0" 200 1512 "-" "\"Mozilla/5.0"

No idea (and in much of a hurry) if quotes are special characters or not?

Could certainly do it on two lines and likely even one.

BTW, my logs have changed back to that repulsive format.

 

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 2:22 am on Feb 11, 2012 (gmt 0)

I've seen this one and a few other UAs using the backward slash; possibly an attempt to make it difficult to ban.

adrian20



 
Msg#: 4416476 posted 3:28 am on Feb 11, 2012 (gmt 0)

Speaking of backward slash, take a look at this one.

92.249.127.111 - - [21/Jan/2012:18:06:29 -0600] "GET /
HTTP/1.1" 403 25
"http://excel2010.ru/" "\xef\xbb\xbfMozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The quotes, I've never seen here in my log. I need to prepare to receive them.

MxAngel



 
Msg#: 4416476 posted 9:21 am on Feb 11, 2012 (gmt 0)

Those are fake Google Bots, have seen quite a few of those lately, always with a .ru referer.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 9:48 am on Feb 11, 2012 (gmt 0)

takes care of the fakers

RewriteCond %{REMOTE_ADDR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteRule .* - [F]

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4416476 posted 11:24 am on Feb 11, 2012 (gmt 0)

Woo Hoo, it's my Ukrainians! Was going to mention them in another thread. If you decode it in the right way you'll find that EFBBBF is one kind of zero-width space, presumably meant to be invisible. They've only recently started spoofing the googlebot; the leading \xef\xbb\xbf is a still newer variation.

:: shuffling papers ::

I get 'em from
92.249.0-127, 109.120.128-191, 178.136-137, 193.106.136-139, 213.110.128-159

That's (cut&paste):
Deny from 92.249.0.0/17
Deny from 109.120.128.0/18
Deny from 178.136.0.0/15
Deny from 193.106.136.0/22
Deny from 213.110.128.0/19

Go ahead and block them by IP but don't stress over them. They used to make me absolutely livid but I've grown accustomed to them.

MxAngel



 
Msg#: 4416476 posted 12:36 pm on Feb 11, 2012 (gmt 0)

A little variant with a question mark:

Host:133-221.sunnet.com.ua
IP:213.110.133.221
Country Code:UA
User Agent:?Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4416476 posted 10:05 pm on Feb 11, 2012 (gmt 0)

Are you positive it's a question mark and not an "I can't display this character" sign? Either way, look in the preceding post (list of IPs) and you'll see it's the same guys. Pull them out of your logs, look at them in isolation and you'll start finding a very distinctive pattern within each visit. In my case it focuses on a page that's pointless without its images-- which they have never tried to get.

adrian20



 
Msg#: 4416476 posted 3:35 am on Feb 12, 2012 (gmt 0)

MxAngel, I was wondering if anyone else gets this fake UserAgent.

wilderness, I am copying the RewriteCond. By the way, sometimes I like to use this combination;

!^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\. instead of 66.249.64.0/19. How you've done it, do you get it manually every time. I would like to know if there is any program that does the conversion.

lucy24, Ohh! yes, that is where my little website is popular. I am copying these IPs, including that added (213...), By MxAngel.

Coming back to the issue of wilderness. I think that quotes were generated by the Protocol. Some strange combination that tried to do, and that the protocol "HTTP/1.0" did not understand. Including that could have been a misunderstanding of some proxy.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 5:43 am on Feb 12, 2012 (gmt 0)

I was wondering if anyone else gets this fake UserAgent.

Absolutely, Googlebot is perhaps the most widely spoofed UA besides various browsers. I block them on a daily basis:

RewriteCond %{HTTP_USER_AGENT} Googlebot [NC]
RewriteCond %{REMOTE_ADDR} !^66\.249\.[6-9][0-9]\.
RewriteRule .* - [F]

Catches about a dozen imposters per day. The same filter is also used on Yandex, Slurp & Bingbot.

adrian20



 
Msg#: 4416476 posted 11:42 am on Feb 12, 2012 (gmt 0)

keyplyr, hehehe; I've broken the head looking for a way to isolate Google IP, along with Googlebot. Thanks for this RewriteCond, I needed this.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 12:23 pm on Feb 12, 2012 (gmt 0)

I would like to know if there is any program that does the conversion.


adrian,
I do it manually.
There are some online tools however they are generally pitiful and create bloated ranges.

Many moons ago for 200-255 range many of us at Jim's suggestion began using:
2[0-5][0-9]
as opposed to 2[0-4][0-9]|25[0-5]

the aformentioned being shorter.

Google even has an IP regex converter [google.com], HOWEVER i'd caution and suggest that anybody NOT use the converted output.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 12:34 pm on Feb 12, 2012 (gmt 0)

keyplr,
You've been around here as long as me.
I'm sure your aware that google only goes through the 95 Class C.

Just a heads up, in case your not.

Don

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 12:59 pm on Feb 12, 2012 (gmt 0)

Here's the google generated synax

^66\.249\.(6[4-9]|[7-8][0-9]|9[0-5])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$

the entire 3rd section (including the ends with) is pure bloat

MxAngel



 
Msg#: 4416476 posted 7:12 pm on Feb 12, 2012 (gmt 0)

Lucy, yes I'm positive on the ? character. Btw, a while ago I found them in the logs with 3 "I can't display characters" just before they started using the \xef\xbb\xbf ones.

I've been blocking those ranges for ages.

In my case they go after popular topics / posts.

adrian, just like keyplyr I get them almost daily.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4416476 posted 11:27 pm on Feb 12, 2012 (gmt 0)

Many moons ago for 200-255 range many of us at Jim's suggestion began using:
2[0-5][0-9]
as opposed to 2[0-4][0-9]|25[0-5]


Since life as we know it ends at 255, a simple 2\d\d should work fine too.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 12:18 am on Feb 13, 2012 (gmt 0)

a simple 2\d\d should work fine too.


lucy,
What exactly does that translate to?
^66\.249\.(6[4-9]|[7-8][0-9]|9[0-5])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2\d\d))$

Note; your ending added.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4416476 posted 2:03 am on Feb 13, 2012 (gmt 0)

2\d\d = 2[0-9][0-9]

\d = digit
\w = "word character" (alphanumerics and lowline, but not hyphen)

I haven't done exhaustive testing, but so far htaccess has recognized every RegEx construction I've fed it. I wouldn't try anything like \p{Alpha} or \p{Punct} because there are approximately 700 dialect-specific variants. Same for \x{blahblah}.

But lookaheads/behinds (?=blahblah) (?<!blahblah) etc. and non-capturing groups (?:blahblah) both work-- and both of those can be very useful.

Among things I haven't personally tested, but should:

\s for space (could be either useful or disastrous, depending on whether it counts line-ends as spaces)
\h for hexadecimal (i.e. same as [0-9a-fA-F])
\W \D and so on: negating by capitalizing, so "non-word", "non-digit" and so on.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 12:59 pm on Feb 18, 2012 (gmt 0)

Nice to see that Level3 still loves me after a nearly three year absence wink

8.218.202.1.static.bjtelecom.net - - [10/Feb/2012:18:25:47 +0000] "GET / HTTP/1.0" 200 1512 "-" "\"Mozilla/5.0"


In my previous, I was confused by the log format.

This NOT Level3, rather 1.202.218.8
Bejing Telecom.

and this bugger is quite persistent. Been returning about every 90-mins to eat a 403.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4416476 posted 9:43 pm on Feb 18, 2012 (gmt 0)

49 hits in 5 days from that one. I've blocked 1.202.216.0 - 1.202.223.255 as having received several other unwanted hits. It's a static range (at least, some is).

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 12:55 am on Feb 19, 2012 (gmt 0)

49 hits in 5 days from that one. I've blocked 1.202.216.0


Your more tolerant than I!
I added the Class A's 1 & 2.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 1:59 am on Feb 19, 2012 (gmt 0)


If it's China, it's blocked:

deny from 1.202.0.0/15

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4416476 posted 2:02 am on Feb 19, 2012 (gmt 0)

keyplr,
You've been around here as long as me.
I'm sure your aware that google only goes through the 95 Class C

Yes, Googlebot does but other Google utilities use the remainder of that range and others. The code I posted is only part of a larger rule with other conditions and UAs.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4416476 posted 2:21 am on Feb 19, 2012 (gmt 0)

:: detour to htaccess ::

... Yup, I'm on the same Intolerance Level as keyplyr. Mine says 1.202.0.0/15 In fact I think it was one of the first China ranges I ever blocked.

8.218.202.1.static.bjtelecom.net
<snip>
I was confused by the log format.

This NOT Level3, rather 1.202.218.8


I can't find the RegExes I used last summer when my own logs temporarily did the same thing, but there are a few basic patterns that will cover most of them.

^\d+\.\d+\.\d+\.\d+\.\p{Alpha}
is tricky because it can go either way-- that is, \1.\2.\3.\4 OR \4.\3.\2.\1

Looking back, it's mostly
^\p{Alpha}+-\d+-\d+-\d+-\d+([-.]\S+)? (with trailing space)

and those are easy to pull apart. This is direct cut & paste from my text editor, which speaks a different RegEx dialect than htaccess.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4416476 posted 8:07 pm on Feb 19, 2012 (gmt 0)

Wilderness, Keyplr - I block China (and a few others) on a site-by-site basis. Some of my clients want traffic from such places. :(

I still block undesirable ranges within the countries, though. If they rack up a lot of hits within an IP range, that range gets blocked.

Igal Zeifman



 
Msg#: 4416476 posted 9:46 am on Jul 23, 2012 (gmt 0)

[If it's China, it's blocked: deny from 1.202.0.0/15 ]

I`m from Incapsula and we just finished Googlebot study (see link below).
While going over the data we noticed a verifiable Google Image bot visit from Chinese IP. We are still investigating this, but so far it seems legit.

<snip>

[edited by: incrediBILL at 5:18 pm (utc) on Jul 23, 2012]
[edit reason] no blog URLs please [/edit]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved