Forum Moderators: open
I've seen workstations with "Trend Micro" AV to produce thouse. One of my friends had the thingy installed and while visiting one of my sites got banned for requesting too many pages, same pages at once with that UA.
But then again, this is the most popular SCRAPER Used UA that is out there after Java, Nutch and libwww-perl...
I will install the trial version and see if it is similar.
Blend27
The user agents with "SV1" and ";1813" are making a combined request for roughly 3K pages a day at the moment which is outrageous IMO.
[edited by: incrediBILL at 5:39 am (utc) on May 16, 2008]
It does, however, seem to be something similar - and I assume that if one anti-virus vendor does search result pre-fetching then all others will probably follow (though perhaps not as ineptly).
One difference is that while 1813 is always (for me) part of the user-agent that started this thread, SV1 often appears in longer strings than the one you cite, and on these occasions seems to be from actual human visits.
For example, I had this (and others) from human visitors today:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; InfoPath.1)
And later on this beauty turned up:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 1.1.4322; .NET CLR 2.0.50727; SpamBlockerUtility 4.8.4)
So I Googled "SpamBlockerUtility" (on a Mac) and didn't like what I saw.
It seemed to be trying to download something automatically...
[edited by: incrediBILL at 5:42 am (utc) on May 16, 2008]
[edit reason] splicing new thread [/edit]
[added] In my logs, I see that "double UA string" associated with "SV1" but not with AVG or SpamBlocker. I posted this despite the possibility that it might be off-topic, because, well, I'm not yet sure if it is off-topic... [/added]
[added more]
OK, now "SV1" gets *really* interesting:
66.249.84.** - - [15/May/2008:20:08:44 -0700] "GET / HTTP/1.1" 200 31354 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
That looks like someone at the 'plex to me...
[/added]
Jim
[edited by: jdMorgan at 4:03 am (utc) on May 16, 2008]
[edited by: incrediBILL at 5:46 am (utc) on May 16, 2008]
[edit reason] splicing new thread [/edit]
One difference is that while 1813 is always (for me) part of the user-agent that started this thread, SV1 often appears in longer strings than the one you cite, and on these occasions seems to be from actual human visits.
I talk specifically about
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
and nothing else. SV1 stands for some higher IE security and may be a part of regular (human) UA visit.
But this one is as same as 1813. Only two of them have trouble with special characters in my AdWords links, no other UA.
Finally, I’ll use Trend Micro Internet Security Pro and see if my IP shows with that UA in logs. Pro comes with that “extra” anti-phishing protection.
[added]
OK, now "SV1" gets *really* interesting
In regards of SV1, outside of this particular one that looks like 1813.
From MSDN:
SV1 - Internet Explorer 6 with enhanced security features (Windows XP SP2 and Windows Server 2003 only).
My nightmare is that pre-scanning search results with dummy UAs is the new norm on Windows, and that as webmasters we now have to learn about how each anti-virus vendor does it - failing which we risk being greylisted or otherwise flagged as potentially unsafe.
My nightmare is that pre-scanning search results with dummy UAs is the new norm on Windows, and that as webmasters we now have to learn about how each anti-virus vendor does it - failing which we risk being greylisted or otherwise flagged as potentially unsafe.
Perhaps this may be a worry for a NEW website!
For established websites, with visitiors aware of their content and established procedures, these things present nothing more than a UA similar to harvesters.
We few here discussing rogue bots are hardly gathered together within a "concept" of policy that is acceptable to "everybody", much less seeing our collective action presenting an effective action which would assure uniformity in UA's and bot procedures.
IMO it's 2 issues:
1. Exposing their customers to cloaked malicious sites now that we know who they are and,
2. The practice of pre-screening and pre-fetching pages is abusive and borders on a DDoS as the volume of products with this feature increase.
Both issues need to be addressed with the people writing the software causing those problems.
On the other hand yes, that UA is otherwise a significant scraper and I would love to be able to trap it without trapping legit customers. I don't think the absence of other UA extensions can be taken as an indicator.
There isn't the multiple-page access I would expect from a scraper.
Distributed scrapers only do one or a couple of pages per access. Just like the vulnerability probes, I tend to get single hits per IPs, but a lot of IPs are involved.
I'm just speculating here since botnets send spam it would make sense for them to harvest email addresses as well which could account for a larger number of IPs involved so it wouldn't get stopped by your typical bot blocker.
Project Honey Pot claims this UA is used by 7.7% of all spam harvesting bots
They also claim it is used by 30% of comment spammers (and who am I to argue?).
I would love to be able to trap it without trapping legit customers
Whatever it is, cloaking low-bandwidth content would seem to be safest.
I noticed one hit from this today that was immediately followed by an apparently human visitor from the same IP whose referrer entry was a Google search on my primary keywords - and a Google Desktop entry was duly added before they left.
It seems to walk like a duck round these parts.
"To protect users, TrendProtect tags pages that have not been tested by Trend Micro, including pages that may be safe, as Suspicious."
They are not joking - I Googled two of my sites and both were prominently marked "Suspicious".
There was, however, NO HIT from any user-agent, so I looked at Trend FAQ:
"TrendProtect obtains rating information from rating servers."
So which branch of the secret web police runs the "rating servers" I wonder?
This is a Trend that I for one find alarming.
TrendProtect tags pages that have not been tested by Trend Micro, including pages that may be safe, as Suspicious.
So, found guilty until proven innocent ?
On who's authority ?
I have Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) banned for a while now since it seems to be a favorite for the kind of scrapers that use a different Ip and a different UA per page fetched which seems like humans but the date stamp says different and no css nor images fetched.
Legit browsers use proper headers
So what do "rating servers" use?
And who can afford to block them?
As with AVG LinkScanner, the "SV1" user-agent appears to me to be pre-fetching results for searches conducted by real humans (at least in some cases), and if you block it that is naturally the last of your bandwidth they will use - because your site may well be flagged in their SERPs as "Suspicious".
As for associating it with Trend Micro, I would say the jury is still out - I have never blocked "SV1", but TrendProtect still has my sites blacklisted, and I don't know why or how to change it.
All my 403s will need to be re-examined under this "New Order".
[edited by: Samizdata at 10:49 pm (utc) on May 17, 2008]
My tests with TrendProtect eventually produced a visit from this user-agent:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
It took my index page (status 200) but no CSS, javascript or images.
It came from an IP registered to Trend Micro.
They still libel the site as "suspicious" though.
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
4,935,377 hits in May alone causing several GB of traffic.
Most of the requests are for two non existent javascript files. The URL that is requested is really weired. They alone have been requested 2,489,626 times. Over and over again from different IP adresses causing 2 1/2 million 404 errors.
How does this AVG toolbar work anyway? I have blocked the User Agent for the time beeing. The IP is changing every two or three hours. Do the requests come from the toolbars or from a AVG server? I checked the IPs there were some German Telekom IPs, some from Austria.
How does this AVG toolbar work anyway?
To clarify, the UA Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813) is AVG LinkScanner (not the optional toolbar), and it checks results of any searches done on Google/Yahoo/MSN by pre-fetching the listed page and (in some cases) any external JavaScript files.
I have blocked the User Agent for the time beeing.
I would say that is unwise, as your listing in the SERPs will be flagged in such a way that users will be discouraged from clicking it - better to cloak minimal content as in the example given in the AVG thread at [webmasterworld.com ]
Do the requests come from the toolbars or from a AVG server?
The requests come via the browser of real humans who are searching on your keywords.
The UA Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) remains unidentified but as noted elsewhere acts in exactly the same way as AVG and is likely to be some other anti-virus package exceeding its capabilities.
It does not appear to be related to Trend Micro.
The requests come via the browser of real humans who are searching on your keywords.
This seems very unlikely in my case. 4 Million hits by some antivirus checker be it Trend Micro or AVG? In two weeks? Requesting the same two non existing javascript files over several days? This looks more like some spider caught in a strange loop.
This seems very unlikely in my case
The behaviour is easily replicated, but you must draw your own conclusions.
Download and install AVG Free 8.0 (with or without the toolbar), restart and use Google/Yahoo/MSN to search on keywords your site ranks for - but do not click the link in the SERPs.
There should be at least one entry from "1813" in your logs from your IP address.
This is easier to test on a site with little or no traffic (which you may not have now but which you might end up with if you block these useless excuses for tools).
209.239.21.zz - - [31/May/2008:02:33:13 +0100] "GET /MyFolder/MyPage.html HTTP/1.1" 301 242 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)"
64.12.117.205 - - [31/May/2008:02:33:17 +0100] "GET /SameFolder/SamePage.html HTTP/1.1" 200 64870 "MyWebsite" "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.0; Windows NT 5.1; FunWebProducts)"
-- OS is XP Pro/SP3 with IE 6.0, Office 2003, all latest updates applied. --
None of .NET Frameworks Installed?
--end @smallcompany
Lets think about it for a moment here.
When an overage user runs Windows Updates (or Auto Windows Update is enabled), doesn't Microsoft Automatically sends .NET Frameworks to be installed on the client machines with updaters? It's in my experience that is what going on. I might be totally on this one, but :
1. for this UA either the system can not have the latest updates(illegal copy of XP)
2. user has no Automatic Updates or it is disabled
3. user has decided not to install .NET Framework
4. UA is "Genetically Altered" by user or software installed on the users machine
The IE 7 Browser that I am Accesing WW with is:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)
Where Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; was injected in the UA after installing Trend Micro Internet Security 2008. I have uninstaled it yesterday night but the UA remained the same :(
I installed the latest AVG free trial one one of my computers, searched for some terms that my site ranks highly for without clicking on any of the search results. Each time the pages that appeared in the serps came up in my logs, each time the UA was Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1. Therefore, yes, at least some of these spurious log entries we are getting are down to AVG.