homepage Welcome to WebmasterWorld Guest from 54.166.66.204
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 50 message thread spans 2 pages: < < 50 ( 1 [2]     
Fake user agent strings
How to recognize...
grandma genie




msg:4462693
 10:21 pm on Jun 7, 2012 (gmt 0)

Hello,

Is it important for a webmaster to be able to recognize a fake user agent? And if it is, how does one do it? User agents come in all shapes and sizes. Some, like the fake Googlebots, are easy to recognize, but what about those really long ones. What do you think of this one?

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; FunWebProducts; GTB7.3; .NET CLR 1.1.4322; FunWebProducts)

Just the duplicate FunWebProducts was odd. But the visitor's behavior was normal. Here is the IP: 79.74.80.nn

Here's a long one: Mozilla/4.0 (compatible; MSIE 8.0; AOL 9.6; AOLBuild 4340.5002; Windows NT 6.1; WOW64; Trident/4.0; GTB7.3; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; BRI/2; MAGW; InfoPath.3; .NET4.0C)

What is this?:
SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

What types of things should we be looking for that would stand out as a potential threat? Should all the components of a ua be in a certain order? Can they be in any order? What difference does it make?

--grandma

 

incrediBILL




msg:4465086
 8:12 pm on Jun 13, 2012 (gmt 0)

The lack of a space between "compatible;"


Yup, improper spacing is standard fare used to block a user agent on all my sites.

However, headers are more important in blocking bots than user agents IMO because most bots don't send a few simple things all browsers send so I test headers first, then user agents second, and as a result I boot things off the site executing a lot less code.

grandma genie




msg:4465211
 2:53 am on Jun 14, 2012 (gmt 0)

I have to run this by you. Have you ever seen anything like this before, note the trailing IP addresses after the ua:

209.131.39.nn - - "GET /example.jpg HTTP/1.1" 200 72343 www.example.com "-"
"Mozilla/5.0 (iPhone; CPU iPhone OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3" "173.227.72.nn, 66.94.233.nnn"

dstiles




msg:4465611
 8:14 pm on Jun 14, 2012 (gmt 0)

The 173.227.72.nn IP belongs to a server farm at TW Telecom.

66.94.233.nn is Yahoo (a range new to me).

209.131.39.nn is also Yahoo.

If the context were different I would say the trailing IPs were actually proxy forwarded-for IPs. I see quite a lot of servers trying to by-pass blocks using proxies; usually broadband botnet ones but also G and Y. Given the mobile connotation that is a feasible scenario but I haven't met proxy forwarded-for IPs in logs before.

grandma genie




msg:4465703
 3:47 am on Jun 15, 2012 (gmt 0)

Would the community find these types of entries helpful, or should we just keep them to ourselves:

5.9.2.nnn - - "GET / HTTP/1.1" 302 - "-" "Ruby"

And in these cases, should we include the whole IP number, or block out that last quadrant?

lucy24




msg:4465716
 4:21 am on Jun 15, 2012 (gmt 0)

Would the community find these types of entries helpful

The community might go off on a tangent and ask why you're serving up 302s instead of 301s ;)

"Ruby" alone is obviously a bogus UA-- or a very simple-minded robot-- but then you can look up the IP, and maybe you'll discover a hitherto unsuspected server farm in Belarus.

The chances that the offender comes from a block smaller than /24 -- meaning that you need the 4th part-- is too small to be worth bothering about. More likely it will turn out to be splat in the middle of a /13 from some country that you never liked anyway.

grandma genie




msg:4465735
 5:22 am on Jun 15, 2012 (gmt 0)

Well, now they will be getting a 403.

Only suspicious visitors end up getting that 302 server response. Most typical real visitors come on the site via a Google search and don't hit the index.php file in the root folder. Even the normal search bots like Google or Yahoo or Bing don't access that page. If I go through the logs and look for those GET / HTTP/1.1 or GET / HTTP/1.0 entries, they always come from suspicious IPs and most, if not all, get blocked.

5.9.2.nn belongs to Hetzner Online AG.

Here is another one:
199.168.138.nn - - "GET / HTTP/1.0" 302 - "-" "-"
This is a mail server from VolumeDrive. I'll block them, too.

I could change the coding on the index page to give a 301, but I've just been too lazy to do it. It doesn't happen to Google, Yahoo, or Bing, so for now, since it only happens to the bad guys, I don't think it is an issue.

g1smd




msg:4465747
 6:21 am on Jun 15, 2012 (gmt 0)

By returning a 302 or 301 you're telling the bot to make a new request for a different URL. If the bot returns and requests that other URL that action more than doubles the work your server is doing.

Where do you redirect these bots to?

keyplyr




msg:4465758
 6:54 am on Jun 15, 2012 (gmt 0)


"Ruby" alone is obviously a bogus UA

Actually, it's a valid UA. Ruby is a language based on Perl, and used much in the way as when you see "Java/1.nn" as a GET tool. I see it all the time.

lucy24




msg:4465813
 9:36 am on Jun 15, 2012 (gmt 0)

I see it all the time.

From humans?

The name "Ruby" is very familiar to me, because my text editor defaults to Ruby syntax for Regular Expressions, so it's staring me in the face every day ;) But it sure isn't a browser.

keyplyr




msg:4465835
 11:19 am on Jun 15, 2012 (gmt 0)


No, not from humans. As I said, it's a program to GET files.

grandma genie




msg:4466033
 9:22 pm on Jun 15, 2012 (gmt 0)

All my files are in a folder in the root directory. My index.php file is in one folder. It is pointing to a different index file in another folder. That is what is causing the 302 code. I think I need to add this piece of code:
header ('HTTP/1.1 301 Moved Permanently');
I just haven't done it yet.

As for the Ruby UA, whenever anyone comes on the site with just one hit, I always check their IP. I've never seen the Ruby UA before. I assume it's a bot from Hetzner. Blocked it. Why wait for trouble?

keyplyr




msg:4466036
 9:33 pm on Jun 15, 2012 (gmt 0)

@ GG

Better yet, remove the index.php and move all the files to root directly, only keeping images, etc in folders.

You can still use the 301 redirect, just edit correctly.

Read here: [webmasterworld.com...]

g1smd




msg:4466046
 10:00 pm on Jun 15, 2012 (gmt 0)

Don't copy the code in that thread as it has a lot of errors as listed in the next post.

There are dozens of errors to fix. Most of them are mentioned in that thread.

grandma genie




msg:4467358
 9:25 pm on Jun 19, 2012 (gmt 0)

OK, I haven't seen this one before.

209.85.224.nnn - - "GET /example/ HTTP/1.1" 200 26391 "-" "Mozilla/5.0 (compatible; GoogleDocs; legacyeditor; +http://docs.google.com)"

Will be blocking 209.85.224 unless someone has something good to say about them. Don't like proxies. Project Honey Pot said it was acting like a comment spammer.

g1smd




msg:4467360
 9:34 pm on Jun 19, 2012 (gmt 0)

Truncated UA:

157.55.17.nnn - - /cat/subcat/product "-" Mozilla/4.0 (compatible
lucy24




msg:4467427
 12:34 am on Jun 20, 2012 (gmt 0)

209.85.224.nnn - - "GET /example/ HTTP/1.1" 200 26391 "-" "Mozilla/5.0 (compatible; GoogleDocs; legacyeditor; +http://docs.google.com)"

Will be blocking 209.85.224 unless someone has something good to say about them. Don't like proxies. Project Honey Pot said it was acting like a comment spammer.

###! I've never broken it down beyond a generic "Preview and Translate". Is there a list somewhere that sorts the range into smaller pieces?

:: huge detour here as I discovered I'd got it misflagged as 209.84.0.0/15 when it should only be 209.85.128.0/17, but luckily nothing undesirable ever came from the mislabeled parts ::

I'll be ###. It's all 209.85.224 except a couple of .85.238s for Site Verification and a scattered handful of others. And the 224s in turn are all in the narrower range .224.80-.99 (not .95).

So what do they do with the rest of 85.224, let alone the rest of 85.128-255?

dupres01




msg:4469285
 3:38 pm on Jun 25, 2012 (gmt 0)

"However, headers are more important in blocking bots than user agents"

please excuse a newbie question, but how do you use headers to block bots?

g1smd




msg:4469287
 3:42 pm on Jun 25, 2012 (gmt 0)

Detect which ones are present, or missing, and send 403 response in return. Certain headers are missing or faked from some bots.

lucy24




msg:4469412
 9:06 pm on Jun 25, 2012 (gmt 0)

Inescapable newbie followup: Can you do that in htaccess or is it a PHP Script Thing?

:: fleeing in terror from all those brackets and parentheses ::

incrediBILL




msg:4469741
 5:29 pm on Jun 26, 2012 (gmt 0)

please excuse a newbie question, but how do you use headers to block bots?


Browsers send certain things in headers that most bots do not.

For instance, I can send you a 100% exact user agent string that matches Firefox 13 and you'll happily let it access your site.

However, had you bothered to also examine the headers being sent, you'll notice that one or two things that Firefox 13 always sends are not present or are incorrectly presented when the bot faking Firefox 13 sends a request.

Simple to check, simple to block.

Nice cup of 403 forbidden served steaming hot.

This 50 message thread spans 2 pages: < < 50 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved