homepage Welcome to WebmasterWorld Guest from 54.204.215.209
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
LinkScanner, AVG, Trend Micro, 1813 and SV1
Disambiguation of secretive anti-virus tools
Samizdata




msg:3657955
 2:59 am on May 24, 2008 (gmt 0)

This is an attempt to clear up some confusion on these recent threads:
AVG thread: [webmasterworld.com...]
Trend Micro: [webmasterworld.com...]

Most of the following impose themselves on the SERPs and signal whether or not your link is safe to click - even when they have no information they will flag your site in a way that will discourage visitors, so in this less-than-brave new world it is important to get their seal of approval.

--

This user-agent is used by Exploit Prevention Labs LinkScanner:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

It pre-fetches HTML and JavaScript from searches on Google/Yahoo/MSN done by humans - the IP in your logs will be that of the user who has it installed and is searching on your keywords.

The software was recently acquired by Grisoft AVG but is still available for download on CNET, and if you block it you will be discouraging visitors and filling your logs with 403s for no good reason.

--

This user-agent is used by Grisoft AVG 8.0 LinkScanner:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)

As above, a bandwidth-wasting (and easily fooled) pre-fetcher best dealt with by cloaking minimal content (example given by jdMorgan in the AVG thread). When I had it blocked I lost a lot of traffic and on one of my tests it produced an impressive 120 (one hundred and twenty) 403s in 12 seconds - without me even visiting the site.

--

This user-agent is used by Trend Micro Internet Security and TrendProtect:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Unlike the others it does not pre-fetch SERPs in real time, but can be triggered from the "website authentication" feature of the Internet Security Pro package on demand, and also appears to be doing some general spidering for the Trend Micro "rating server".

Sometimes it comes from the Trend Micro IP range (66.180.80.0 - 66.180.95.255) but more often it comes from the Japan Network Information Center with various IPs in the 150.70.84.xx range - so you can probably expect your site to be classed as "Suspicious" (the outrageous term they use for unknown sites) if you have APNIC blocked.

--

This user-agent is used by the DrWeb plugin for Explorer and FireFox:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon)

This one pre-fetches HTML and JavaScript, but only on a specific request from the user, and in my tests it always came from 81.176.67.173 (the DrWeb server) as advertised - more reasonable than the others, but like them working for real oxygen-breathing humans.

--

I have not been able to identify where McAfee SiteAdvisor gets its information, but I do have an amusing screenshot of the related Yahoo SearchScan flagging google.com as a purveyor of "Dangerous Downloads".

On all my sites it says "We've tested millions of sites but haven't tested this one yet" - and unless McAfee scans the entire web as frequently as GoogleBot it is presumably worthless.

--

None of the other 18 anti-virus packages I tested currently interfere with search results, but it may only be a matter of time, and if you don't appease them and get flagged as "clean" you may find that your ranking is considered irrelevant.

"Paranoia strikes deep - into your SERPS it will creep"

...

 

blend27




msg:3658174
 3:04 pm on May 24, 2008 (gmt 0)

"serps are dynamic - creepy thouse are"

In Any event - many won't know - that is the scary part.

Ocean10000




msg:3658186
 3:26 pm on May 24, 2008 (gmt 0)

I have noticed with my test so far, sending them a 403 or 200 with no content will cause these two to continue to retry the url repeatably anywhere from 15 to 30 times.

I am testing sending a 403 with a short error message, and will later test sending a status 200 with the same error message to see if anything different comes up. And to see if my click thoughts from the ips actually increase or decrease.

jdMorgan




msg:3658189
 3:48 pm on May 24, 2008 (gmt 0)

> sending them a 403 or 200 with no content will cause these two to continue to retry

Which two?

Jim

Samizdata




msg:3658190
 3:51 pm on May 24, 2008 (gmt 0)

One thing I should mention is that the phrase "This user-agent is used by" was carefully chosen, and does not mean that scrapers, comment spammers and other nasties don't use it as well (which in the case of SV1 they clearly do).

What it does mean is that the described behaviour can be replicated by downloading the software.

Umbra




msg:3658266
 6:19 pm on May 24, 2008 (gmt 0)

I have noticed with my test so far, sending them a 403 or 200 with no content will cause these two to continue to retry the url repeatably anywhere from 15 to 30 times

What happens if an error 500 is returned?

Samizdata




msg:3658274
 6:36 pm on May 24, 2008 (gmt 0)

What happens if an error 500 is returned?

I haven't tested it, but the point with all of these "tools" is that if they have no information
(for whatever reason) then they will not flag your site as "clean" and users will naturally be discouraged from visiting it.

If you want to know the definitive answer, download one of them and try it.

Umbra




msg:3658285
 7:06 pm on May 24, 2008 (gmt 0)

I haven't tested it, but the point with all of these "tools" is that if they have no information (for whatever reason) then they will not flag your site as "clean" and users will naturally be discouraged from visiting it.

I do recognize that every webmaster has different priorities; some websites are always on "paranoid mode" and other websites have their gates wide open to every nutch and libwwperl out there. That said, when Google Web Accelerator came out, it created (rightly or wrongly) a firestorm of controversy and various websites, blogs, etc. posted cookie-cutter solutions to blocking GWA. Correct me if I'm wrong, but despite Google's efforts to market GWA, it is now used by a tiny minority of users and perhaps the reason is because so many webmasters had fought back.

So if enough webmasters are fed up with the unwanted noise generated by all these different scanners, then maybe these tools will also go away in time. The Internet IS a dynamic market, new products are tested (and fail) all the time, and I don't see any reason why we must automatically concede to every superfluous scanner.

At least the developers at Grisoft et. al. could take a moment to discuss these issues with the webmaster community. Ask for our feedback, create something like a robots.txt standard, etc. So far, it seems they have been pointedly ignoring this thread.

Ocean10000




msg:3658297
 7:45 pm on May 24, 2008 (gmt 0)

So far I have sent status codes of 403 and 200 with no content, which has caused them to repeatedly retry the page.

As for scrapers I have sorted most of those out long before I check for these two cases.

I am installing AVG 8 on one pc to test it out directly. And will post my results later.

wilderness




msg:3658304
 7:58 pm on May 24, 2008 (gmt 0)

So if enough webmasters are fed up with the unwanted noise generated by all these different scanners, then maybe these tools will also go away in time. The Internet IS a dynamic market, new products are tested (and fail) all the time, and I don't see any reason why we must automatically concede to every superfluous scanner.

Totally agree!

As an aside, the subsequent continuation of this thread in Forum 11 may replace the "Close to Perfect Htaccess as the longest ever.

Samizdata




msg:3658354
 10:02 pm on May 24, 2008 (gmt 0)

maybe these tools will also go away in time

I wish they would, but no matter how much opposition we put up I fear they are here to stay, and the best we can hope for is getting them to modify their behaviour - or getting someone else to do the job properly.

My own primary objection is not about bandwidth (though I appreciate that is also a serious issue) but about the hi-jacking of the SERPs by companies who have a vested interest in promoting fear, uncertainty and doubt. While Google has long flagged pages that are known to be dangerous, they are otherwise neutral - and they inspect a vast number of URLs daily.

The anti-virus companies take the opposite view - everything is suspect until they have proved it innocent - but those such as McAfee and Trend Micro who rely on a "rating server" seem oblivious to the fact that they would have to check every page on the web every day (at least) if their assessment is to be any use at all, while branding sites they haven't checked as "Suspicious" is as absurd as it is offensive.

Grisoft's approach, which at least checks the evidence before the verdict, seems more reliable on one level, but we know how easy it is to fool their LinkScanner, and the software is clearly deficient in other respects. Unfortunately they have introduced it as a free feature and most of their users will probably see it as a good thing, so the pressure will be on other AV vendors to do something similar to keep up.

Bandwidth, of course, is something that webmasters pay for, and statistics are something they rely on. The Grisoft approach wastes a colossal amount of bandwidth and skews statistics, and the McAfee/Trend approach will do the same if they ever get serious about crawling the web.

Then there is the issue of honesty - like many here I take a dim view of robots crawling my sites while masquerading as something else, and that is something all these services have in common. They may argue that they need to conceal their identity to do their job, but if I can identify them then so can every teenage scammer on the planet.

It seems to me that the only people in a position to accurately evaluate webpages are the search engines. Yahoo already have a tie-in with McAfee and how they exchange information is unclear, but if flagging google.com as a drive-by site is any indication then they are not doing it very well.

A move from Google in this area may well be imminent, if only because the practise of second-guessing their results will surely have a negative effect on their image if it becomes widespread - I seriously doubt that they want to become the "web police", but they may have no option.

...

Ocean10000




msg:3658415
 11:51 pm on May 24, 2008 (gmt 0)

I am installing AVG 8 on one pc to test it out directly. And will post my results later.

After doing a little testing with AVG. It will happily accept a 403 status code as long as it has html content sent with it, and show the nice green check mark next to the url. It will only go bonkers when it gets no content no mater the status code returned from my simple tests.

wilderness




msg:3658429
 12:19 am on May 25, 2008 (gmt 0)

It will only go bonkers when it gets no content no mater the status code returned from my simple tests.

Any clue what kind of mark it provides when the resulting request is a redirect back to their own website ;)

willybfriendly




msg:3658451
 2:47 am on May 25, 2008 (gmt 0)

Any clue what kind of mark it provides when the resulting request is a redirect back to their own website...

Oh, you are cruel.

I like it though! Kind of a DDOS attack by karma...

superclown2




msg:3665541
 7:39 am on Jun 3, 2008 (gmt 0)

Oh Dear I seem to be getting into the habit of jumping to conclusions and having to apologise afterwards. A couple of days ago I wrote:
"I installed the latest AVG free trial one one of my computers, searched for some terms that my site ranks highly for without clicking on any of the search results. Each time the pages that appeared in the serps came up in my logs, each time the UA was Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1. Therefore, yes, at least some of these spurious log entries we are getting are down to AVG."

I've now removed the bloatware from the computer in question and re-installed Norton. Just to check up I googled some of my search terms and clicked on my site; in the subsequent log entry the UA was just the same:,
compatible; MSIE 6.0; Windows NT 5.1; SV1.
The indication then is that AVG 8 is not leaving any signature.

The slightly mitigating fact is that when the site is preloaded it is the page only without the accompanying graphics, css etc so the bandwidth hit is a fraction of what a human visitor would cause. I am going to get round the problem by giving every page a small, unique css script with the same name as the page (ie a page called blue-widgets will refer to blue-widgets.css) and I will stop my stats programme from showing hits on .html pages. Not an ideal situation and a pain in the proverbial to set up but at least by checking the number of hits on the .css files I will be able to check up instantly how many human visitors I am getting since they will be the only ones tripping them.

superclown2




msg:3666472
 9:16 am on Jun 4, 2008 (gmt 0)

I have posted this on another thread which seemed relevant, please forgive the duplication.
I have spoken to a person called Adam at AVG technologies in the UK who tells me that he feels that the company's product is the lesser of two evils since he feels that the disruption to millions of webmaster's stats is justified by the extra safety the product gives to surfers. I have pointed out to him that with earlier versions at least it is possible to spoof the pre-fetch search but he commented that the product was still making the web a safer place to visit.

I have brought this thread to his notice so I look forward to hearing his comments here!

[edited by: incrediBILL at 10:07 am (utc) on June 4, 2008]
[edit reason] call to action removed - see tos #26 [/edit]

incrediBILL




msg:3666497
 10:09 am on Jun 4, 2008 (gmt 0)

he feels that the company's product is the lesser of two evils since he feels that the disruption to millions of webmaster's stats is justified by the extra safety the product gives to surfers

He's wrong because:

a) They created a DDOS attack on popular sites with lots of bookmarks and high rankings.

b) It's less secure because everyone and his brother now knows how to spoof it, where's the safety now?

Samizdata




msg:3666587
 11:43 am on Jun 4, 2008 (gmt 0)

I don't know what position this person holds at AVG but I assume it is in public relations.

He should know that Grisoft bought a useless product and made it substantially worse.

His customers may feel safer, but they are being deluded - every script kiddie on the planet can fool this fabulous new "security tool" and get their payload pages marked as safe by AVG.

Meanwhile ordinary webmasters are seeing their statistics rendered useless and their bandwidth charges rocketing as this useless pre-fetcher rampages through their sites.

Grisoft may know all about Windows but they appear to know nothing about the web.

Samizdata




msg:3669430
 9:52 pm on Jun 7, 2008 (gmt 0)

Addendum: this user-agent is used by Finjan Secure Browsing (from Fingan Ltd in Israel):

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1; .NET CLR 1.1.4322)

This pre-fetcher works the same as AVG and came from the the 82.166.163.xx range.

I would not recommend blocking or cloaking this one by user-agent.

Samizdata




msg:3669770
 12:54 pm on Jun 8, 2008 (gmt 0)

Apologies, upon futher testing it appears that Finjan Secure Browsing cunningly uses a wide range of user-agents when pre-fetching your pages, and is obviously a highly sophisticated security tool.

Too bad they always use the same IP address...

superclown2




msg:3671261
 2:49 pm on Jun 10, 2008 (gmt 0)

I have had to query my adwords account because of I'm regularly being charged for more clicks than show in my logs. Today there are four unexplained clicks (it's a niche product so numbers of clicks aren't high); and one of the visitors via adwords had also 'pre-fetched' via AVG link checker into my site four times for a key phrase for which my site ranks 27th in the SERPs. It is possible that the visitor had preferences set to show more that that number of results but I have asked G to tell me just what these missing clicks are. I hope I get more help this time than the usual library answer ....

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved