homepage Welcome to WebmasterWorld Guest from 54.227.25.58
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 50 message thread spans 2 pages: 50 ( [1] 2 > >     
Fake user agent strings
How to recognize...
grandma genie




msg:4462693
 10:21 pm on Jun 7, 2012 (gmt 0)

Hello,

Is it important for a webmaster to be able to recognize a fake user agent? And if it is, how does one do it? User agents come in all shapes and sizes. Some, like the fake Googlebots, are easy to recognize, but what about those really long ones. What do you think of this one?

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; FunWebProducts; GTB7.3; .NET CLR 1.1.4322; FunWebProducts)

Just the duplicate FunWebProducts was odd. But the visitor's behavior was normal. Here is the IP: 79.74.80.nn

Here's a long one: Mozilla/4.0 (compatible; MSIE 8.0; AOL 9.6; AOLBuild 4340.5002; Windows NT 6.1; WOW64; Trident/4.0; GTB7.3; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; BRI/2; MAGW; InfoPath.3; .NET4.0C)

What is this?:
SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

What types of things should we be looking for that would stand out as a potential threat? Should all the components of a ua be in a certain order? Can they be in any order? What difference does it make?

--grandma

 

wilderness




msg:4462780
 3:26 am on Jun 8, 2012 (gmt 0)

gg,
IMO the more keywords a UA contains, the better (with some length limits). More words allow crisper honing on multiple conditions (IP & UA) to lessen the innocents ;)

I'm not aware of a list published which provides insight into what should be present per an operating system, although one may exist.

The most common OS-UA restrictions are based upon the versions of MSIE and their compatibility with either "Windows NT" or another MS OS.

Firefox generally contains clean and short UA's.

The latest Chrome UA's are fairly consistent, however there may be some variation for OS.

Linux, Android and a few other OS are certainly easy to focus upon.

The mobile devices (per your Samsung example) number in the thousands.

Not much help in the way of what you were looking for.

lucy24




msg:4462815
 6:29 am on Jun 8, 2012 (gmt 0)

I've recently had a flurry of something calling itself Safari/7534.48.3. I was all set to dismiss this as a ludicrous fake-- even if they've jumped on the "new release every other week" bandwagon with a vengeance, there have to be limits-- and then I noticed that one of those Version Seven Thousands was myself on my iPad.

Huh. Fancy that.

Besides, the version number is on the left. Oops. Never mind, then.

On the other hand, the Genuinely Impossible Combinations include MSIE 5.5 for Mac, because they stopped at 5.23. And anything claiming to be MSIE on an Intel Mac is fake-- not because it won't run on Intel but because they stopped making it before the Intel existed, so the UA line will always say PowerPC. (I have personally tested this.)

And I recently got bitten by blocking MSIE [1-4] without closing anchor. Oops, sorry about that, MSIE 10 cutting-edgers.

grandma genie




msg:4463029
 5:06 pm on Jun 8, 2012 (gmt 0)

Thank you for the helpful info. Some things I have noted are:

The ua's with the [en] on the end:
"Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows NT 5.0) Opera 7.02 Bork-edition [en]"
Ukraine IP: 193.106.136.nn

Here is the same visitor with multiple UA's.

193.106.136.nn - - "Mozilla/4.0 (compatible; MSIE 5.0; Windows 95) Opera 6.01 [en]"
193.106.136.nn - - "Mozilla/4.79 [en] (Windows NT 5.0; U)"
193.106.136.nn - - "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; T312461)"

What about the Mozilla numbers? Most of the ones I see are Mozilla/5.0. A few are Mozilla/4.0.

Leosghost




msg:4463054
 5:47 pm on Jun 8, 2012 (gmt 0)

FunWebProducts means it is Spyware / scumware on the visitors machine ..won't do your site any harm ..just means some user clicked on "would you like smilies with that" ( or similar ) and now has popups all over their machine and sees ads that don't exist for others..

And their machine spends more time talking to funwebs owners ( Iwon is/ was their name ) than to anyone else..search "funwebproducts" and you'll find loads of information about it ..a lot of which has come from this very same spiders forum over the years..

dstiles




msg:4463135
 9:16 pm on Jun 8, 2012 (gmt 0)

Grandma - my take is this:

Anything before MSIE 6 is fake (or otherwise rejectable) UNLESS it is one of the very few mobile UAs - best dump them all now that mobile devices are more "modern" with different UAs. MSIE 6 is obsolete and many can be dumped - although beware: obsolete Windows 2000 cannot support anything later than MSIE 6 (which does not worry me because we use firefox anyway and I personally use Linux for browsing (and most other things)).

Anything Opera before V9 is suspect (can't comment further as I rarely use it and then on linux).

The moz/4.79 looks like Netscape, which I don't think anyone uses now. Your example purports to be running under Windows 2000 (NT 5.0). In general, I would dump it.

bork is rejectable in my opinion.

Multiple MSIE versions in a single UA are usually due to bad updating but I would kill them as being too old - most recent UAs seem much cleaner unless some stupid "toolbar" has been installed.

funwebproducts should be eradicated but sadly I still see a lot of "good" visitors. I'm tempted to tell visitors to the web page how stupid they are - but then, I would like to do that for google UAs and toolbars as well. In all, I think it's their funeral (and likely to be so with funwebproducts) but not usually a bot (except where the UA rotates).

There are a couple of useful sites for determining the more popular UAs:

user-agents.org (bots and browsers)
zytrax.com (mostly browsers)

lucy24




msg:4463147
 9:59 pm on Jun 8, 2012 (gmt 0)

If that's an actual log sample the UA question becomes academic because I'm sure everyone recognized

193.106.136.0/22

on sight as My Ukrainians (terminology may vary). So they can simply be blocked by IP.

Very-very short UAs are almost bound to be robots, although the bare "Mozilla 4.0" also performs some legitimate human-adjunct function. (I forget what. wilderness or someone like him mentioned it once.)

grandma genie




msg:4463199
 12:37 am on Jun 9, 2012 (gmt 0)

Leosghost - I know Funwebproducts is spyware, but what I thought was odd was it appearing twice in the ua string.

lucy24 - I got so desperate about ridding myself of the 193s, that I blocked the whole thing: deny from 193 I really only want visits from the US and Canada, so if I block the rest of the world it won't matter - especially China, though they may own us soon.

dstiles - Thanks for the URLs. Will check them out.

Was hoping to be able to recognize suspicious activity by the ua, but lately the scrapers seem to be using perfectly legitimate uas. I had one visitor today spending 45 minutes looking at all my NEW products, 45 pages worth of stuff. I can't believe anyone would do that, unless they are spending time in a federal prison and have nothing better to do. Maybe a teenager in detention.

Here's their ua:
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)

This type of activity happens every day. Scrape, scrape, scrape.

I do have a sizable listing of user_agent blocks in htaccess. But I can't block legitimate uas. So what does one do to hinder that type of activity?

wilderness




msg:4463222
 1:36 am on Jun 9, 2012 (gmt 0)

lucy24 - I got so desperate about ridding myself of the 193s, that I blocked the whole thing: deny from 193 I really only want visits from the US and Canada, so if I block the rest of the world it won't matter - especially China, though they may own us soon.


gg,
that Class A is a tradition here ;)
Try a search on "mighty mean marie".

There was a gal named Marie and she suggested that rather dinking around with the various 193 ranges, why not simply deny the entire Class A.

This was perhaps tend years ago and seemed rather forward at the time. At least in the SSID forum.

lucy24




msg:4463223
 1:38 am on Jun 9, 2012 (gmt 0)

I had one visitor today spending 45 minutes looking at all my NEW products, 45 pages worth of stuff. I can't believe anyone would do that, unless they are spending time in a federal prison and have nothing better to do. Maybe a teenager in detention.

Was each page load accompanied by instant loading of all adjunct files such as images and css, including a single load of the favicon? Paradoxically, some types of loads happen faster with humans: pages will come at least a few seconds apart, but images load up as fast as your server can dish 'em out. Robots otoh go through at a steady pace. If they get images/css/js at all, they will load with the same time interval as pages.

Unfortunately, you can only tell this after the fact, because either way, the page comes first.

If all you care about is US and Canada, you can pretty well go by IANA /8 blocks.

:: shuffling papers ::

23-24
50
63-76
96-100
104
107-108
184
199
204-209
216

minus about half of that list for internal corporate IPs, and then there's the vexing problem of Early Registrations (most of the 128-172 range) which could be anywhere. Conversely, there's your Brits and Australians and so on, who are scattered throughout RIPE and APNIC. If you personally know someone at 202 or 203 it's all over, because that's the realm of /24 country blocks.

And then you have to start all over again with the 6-piece blocks...

wilderness




msg:4463225
 1:41 am on Jun 9, 2012 (gmt 0)

here ya are gg.

RewriteCond %{REMOTE_ADDR} ^11[0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^12[1-6]\. [OR]
RewriteCond %{REMOTE_ADDR} ^8[0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^9[0-5]\. [OR]
RewriteCond %{REMOTE_ADDR} ^17[5-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^18[0-35-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^19[01]\. [OR]
RewriteCond %{REMOTE_ADDR} ^20[01]\. [OR]
at least for beginners

dstiles




msg:4463462
 9:48 pm on Jun 9, 2012 (gmt 0)

Blocking the complete 193/8 range here would be a bit counter-productive, says this UK lad. :)

wilderness




msg:4463486
 11:51 pm on Jun 9, 2012 (gmt 0)

Blocking the complete 193/8 range here would be a bit counter-productive, says this UK lad.


I thought that continent sunk decades ago ;)

grandma genie




msg:4463516
 4:42 am on Jun 10, 2012 (gmt 0)

I think mighty mean marie had the right idea. I've taken a heavy hand on IP ranges. Would it make more sense to just allow the ranges I want and block the rest?

Lucy, that visitor just clicked on the New Products link and that will bring up page after page of new products all the way to page 45, so they were acting like a person. It was about a page a minute. Is that too fast for a person?

What I would like to do is only allow USA/Canada IPs and get a script to hinder scrapers. That should at least help slow the stinkers down. Is there a script available for novices that just prevents visitors from grabbing too much too fast?

Don, thanks for the list. That will make my list in htaccess a lot shorter, which is the goal.

dstiles, I don't want to block you and I really don't like to have to block at all. I must say I am seeing less of the code injection attacks in the server logs. Most of the stuff I am seeing lately is just scraping. So far the only way I can see to identify a scraper is by the speed with which they scrape. Too fast for a casual visitor.

I guess I could make it so anyone who visits my site has to register to get in. Has anyone tried that?

dstiles




msg:4463681
 8:14 pm on Jun 10, 2012 (gmt 0)

Wilderness - I think even died-in-the-wool patriotic Englishmen would suggest that "continent" was a little grandiose for our little group of islands. :)

Grandma - I hate to say this but here in UK one of the biggest scraping countries (and spam sources) is USA. :)

lucy24




msg:4463696
 9:13 pm on Jun 10, 2012 (gmt 0)

It was about a page a minute. Is that too fast for a person?

Not physically. The key question is whether all the subsidiary stuff was loaded up more-or-less-instantly with each new page load. I don't pretend to understand the people who go hippity-hopping from one page to the next, spending 15 seconds on each; it's the non-page material that gives the information.

I think even dyed-in-the-wool patriotic Englishmen would suggest that "continent" was a little grandiose for our little group of islands

Wasn't it the Economist that won lasting fame by saying that the Chunnel would connect France to "the British mainland"?

I guess I could make it so anyone who visits my site has to register to get in. Has anyone tried that?

It will only work if the site has content that people want really, really badly. And you'll lose some users-- probably including a fair number of WebmasterWorld readers ;) --who flatly refuse to give a site any information of any kind.

Leosghost




msg:4463701
 9:23 pm on Jun 10, 2012 (gmt 0)

IIRC grandma genie, your site is ecommerce ? ..mandatory registration would kill sales on such a site..

How long someone spends on each page of your site depends on what is on each of the pages..and how fast your visitor reads/ takes in information..IMO 60 seconds is not a long time to spend on a page if the page is selling a product, or has less than 1000 words of text..

Restricting access via Geo IP is a "valid" solution if you don't intend shipping to the whole world..

grandma genie




msg:4463933
 2:43 pm on Jun 11, 2012 (gmt 0)

Personally, I think these scrapers are part of a botnet that can use any IP and appear to be from anywhere. I also think many of the bots/people who come to my site are looking for pictures. Just about every picture I've ever uploaded ends up on some blog or forum. But if they are hotlinked, most people can't see them anyway.

OK, forget the registering. Is it possible to make it so you can only access the site by typing in those sequence of letters/numbers that I see on some sites? That way you wouldn't have to input any personal info. Would that stop a bot? Would that stop a customer?

Yes, I have an ecommerce site. The index is by product category in alphabetical order. There are some visitors who come on the site and bounce around in various categories, and then there are the ones who just go through each category, one after the other, page by page. I assume those are bots, but there is no indication by IP that it is. Could just be a compromised machine. But who, what organization, would do such a thing? Could those doing this type of activity be producing a database, making some kind of comparison of what they sell to what I sell, pricing info? And who is my biggest competitor? A-m-a-z-o-n. Aren't they everyone's biggest competitor?

You know, I just typed my URL into Yahoo's search engine, and guess who came up on the first page of results. Yep, the Big A. And it wasn't an ad. It was the fifth listing.

wilderness




msg:4463951
 3:17 pm on Jun 11, 2012 (gmt 0)

Personally, I think these scrapers are part of a botnet that can use any IP and appear to be from anywhere. I also think many of the bots/people who come to my site are looking for pictures. Just about every picture I've ever uploaded ends up on some blog or forum. But if they are hotlinked, most people can't see them anyway.


gg,
I was fortunate to realize early on that using "names" for images is a bad practice (unless your a webmaster selling photo's).
I used an image in a page file that brought hordes of visitors (searching and linking) that went directly to that image and were without any interest in the other content of my websites.

My solution was to number all of my images (that practice has been in place for more than a decade) and place all my images within directories that are excluded in robots.txt.
Thus when on the few occasions that something (person or bot) starts crawling those omitted directories ALONE, their activity jumps out like a sore thumb.

It's a little extra work to keep my images numbered and within a spreadsheet that references their source, however it's worth the extra time.

Don

keyplyr




msg:4464001
 4:29 pm on Jun 11, 2012 (gmt 0)


using "names" for images is a bad practice

Disagree. It's what the HTML standard recommends.

As far as unwanted traffic? LOL, there is none. Up to you to turn those into customers.

wilderness




msg:4464075
 6:39 pm on Jun 11, 2012 (gmt 0)

Disagree. It's what the HTML standard recommends.


keyplr,
We'll need to agree to disagree on this issue.

I'm sure HTMl standards also advise the allowing of cache on web pages, which I also do not.

Just today I had some Oceanic visitors (denied range) looking for a particular image.
Since they couldn't gain access, two hours later here comes the same request from a local universities IP range, of which the later is also denied.

I've approximately 4,000 images on one website. Some pages have fifty or more thumbnails.
To allow mass access to all these images would be absurd.
Redirecting same visitors to a presentation of a fee-based possibility would also be absurd, at least for widget folks.

Seb7




msg:4464253
 3:27 am on Jun 12, 2012 (gmt 0)

I can confirm 'funwebproducts' are real users. Been around for some years now.

lucy24




msg:4464278
 4:49 am on Jun 12, 2012 (gmt 0)

using "names" for images is a bad practice

So name them in Basque or some language the ordinary robot doesn't know. Then you don't have to go pull up your database every time you need that picture of Jake swilling the magic potion. I would have suggested one of those Slavic languages with all the consonants, but our Ukrainians would probably make short work of those.

Besides, names like "DixieTeddy" or "SusannaDoor" or "PumpkinFood" or "FunnyFace" * convey much less information to the robot than it thinks.


* Actual names, or approximations thereof. I'm just pulling examples at random.

grandma genie




msg:4464283
 5:02 am on Jun 12, 2012 (gmt 0)

Can you imagine taking 4000 images and changing the names to numbers? Don, you were smart to begin that up front. I suppose I could do that one directory at a time. I think using a different language would be harder. How about pig latin?

alexwood




msg:4464287
 5:39 am on Jun 12, 2012 (gmt 0)

When a software agent operates in a network protocol, it often identifies itself, its application type, operating system, software vendor, or software revision, by submitting a characteristic identification string to its operating peer

wilderness




msg:4464291
 6:02 am on Jun 12, 2012 (gmt 0)

Then you don't have to go pull up your database every time you need that picture of Jake swilling the magic potion.


lucy,
FWIW, a 4,000 line spreadsheet is rather small.
The text descriptions in the spreadsheet are duplicated from the original photo.
The only time locating an image creates an issue is in the event that I'm adding an already existing image to an additional web page. Or at least for verifying that I'm not duplicating images on a new page.

As gg, said, since I implemented this procedure very-early on, it is no big deal.

wilderness




msg:4464292
 6:06 am on Jun 12, 2012 (gmt 0)

When a software agent operates in a network protocol, it often identifies itself, its application type, operating system, software vendor, or software revision, by submitting a characteristic identification string to its operating peer


Are you referring to "headers" or "packets" or something entirely different?

Where does this fit into this discussion? Not that I'm pushy, rather just curious if your expressing something that is otherwise not apparent and on topic?

grandma genie




msg:4465033
 5:35 pm on Jun 13, 2012 (gmt 0)

Since we are talking about fake user agents, is this one?

209.235.199.nnn - - "GET /example.jpg HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible;netTrekker-Link-Checker-AAMS/1.0)"

The IP is from INETU, managed cloud hosting.

Has anyone seen this one in their logs?

g1smd




msg:4465042
 5:42 pm on Jun 13, 2012 (gmt 0)

It takes very little work to find the PDF describing what the netTrekker-Link-Checker (and goAlexandria) is. Ask any students you know (in the US) if they have heard of it.

If the IP range you quoted belongs to them, then that bot is what it says it is, i.e. it might be unwanted but it isn't fake.

The lack of a space between "compatible;" and "netTrekker" in the User-agent string would be enough to prevent it accessing some of the sites I deal with though.

grandma genie




msg:4465056
 6:52 pm on Jun 13, 2012 (gmt 0)

So, netTrekker is a type of software that any school could use, meaning it could appear using a variety of IPs. So it's not checking links, it's telling the student the site is safe to visit. This particular student grabbed 4 of my images. That IP came from the University of Iowa. So in that instance the user agent is valid.

From now on, when I upload new images, their names will not allow for future image snatchers to find them easily. Thanks, Don, for a great idea.

This 50 message thread spans 2 pages: 50 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved