homepage Welcome to WebmasterWorld Guest from 54.204.231.110
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 131 message thread spans 5 pages: < < 131 ( 1 2 3 [4] 5 > >     
MSN fakes referrers
SEOPTI




msg:3875365
 6:14 pm on Mar 20, 2009 (gmt 0)

This has been discussed in 2007:
[webmasterworld.com...]

They do it again, I see hundreds of fake visitors from MSN IPs across all of my domains.

Are there any news what they try to accomplish by doing this?

 

GaryK




msg:3959887
 6:50 pm on Jul 26, 2009 (gmt 0)

I'm considering taking it to the taxidermist to have it mounted so I can hang it over my fireplace.

On one of my sites we have a TOS that requires photographic proof of such extraordinary claims! :)

whitelist+rDNS-based access control system

The problem I have with using rDNS in real-time is some bots, like the MS ones, simply hit my sites too quickly and too often. And as a result, repeated rDNS lookups would, I think, bog things down worse than just letting them have at it.

posts:20804

You really need a life, Jim! ;) The only site I have that kind of post count on is my main money site that's been going since 1998.

enigma1




msg:3960138
 8:15 am on Jul 27, 2009 (gmt 0)

The problem I have with using rDNS in real-time is some bots, like the MS ones, simply hit my sites too quickly and too often. And as a result, repeated rDNS lookups would, I think, bog things down worse than just letting them have at it.

You don't have to rdns all the time. You can do it once for an IP and store the info temporarily. To get around the problems of the msnbots I've setup a whitelist for the ip range.

GaryK




msg:3960389
 4:54 pm on Jul 27, 2009 (gmt 0)

That's a good idea. Thanks. What's a good TTL for that kind of info? 24 hours? Longer?

enigma1




msg:3960881
 8:54 am on Jul 28, 2009 (gmt 0)

I store the info for several days so I don't have to rdns the same ip often. You can use longer periods of time unless you suspect the client will change the dns.

wilderness




msg:3962706
 8:01 pm on Jul 30, 2009 (gmt 0)

I do realize that it's my fault this insignificant MSN UA could not read robot's text, however consider that other MSN bots coming from two different Class C's and six different Class D's already grabbed robots.txt (all within the previous 2-hours and 40-minutes, and all on the same site).

This UA attempted two add another Class C and two more Class D's.

Utter nonsense.
That robots.txt should be violated (even with later 403's,) by direct requests for images.

65.55.106.220 - - [30/Jul/2009:15:36:30 +0100] "GET /robots.txt HTTP/1.1" 200 4858 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.105.22 - - [30/Jul/2009:16:25:29 +0100] "GET /robots.txt HTTP/1.1" 200 4858 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.106.135 - - [30/Jul/2009:16:25:49 +0100] "GET /robots.txt HTTP/1.1" 200 4858 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.144 - - [30/Jul/2009:16:47:49 +0100] "GET /robots.txt HTTP/1.1" 200 4858 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.231.99 - - [30/Jul/2009:16:59:04 +0100] "GET /robots.txt HTTP/1.1" 403 1159 "-" "Mozilla/4.0"
65.55.231.99 - - [30/Jul/2009:16:59:04 +0100] "GET /jib/1136.jpg HTTP/1.1" 403 - "-" "Mozilla/4.0"
65.55.106.139 - - [30/Jul/2009:17:22:13 +0100] "GET /robots.txt HTTP/1.1" 200 4858 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.231.121 - - [30/Jul/2009:17:22:46 +0100] "GET /robots.txt HTTP/1.1" 403 1159 "-" "Mozilla/4.0"
65.55.231.121 - - [30/Jul/2009:17:22:46 +0100] "GET /Dir/SubDir/image01.jpg HTTP/1.1" 403 - "-" "Mozilla/4.0"
65.55.106.159 - - [30/Jul/2009:17:43:01 +0100] "GET /robots.txt HTTP/1.1" 200 4858 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.231.44 - - [30/Jul/2009:18:16:12 +0100] "GET /robots.txt HTTP/1.1" 403 1159 "-" "Mozilla/4.0"
65.55.231.44 - - [30/Jul/2009:18:16:12 +0100] "GET /Dir/SubDir2/Image02.jpg HTTP/1.1" 403 - "-" "Mozilla/4.0"

Don't recall the saying?
Something about "the arms not knowing what the legs are doing" or "the brain not knowing what the head is doing"?

Can't wait till they take on Yahoo ;)

enigma1




msg:3962741
 9:09 pm on Jul 30, 2009 (gmt 0)

Don, have you seen IPs for the MSN bots outside the 65.55 range?

wilderness




msg:3962769
 9:56 pm on Jul 30, 2009 (gmt 0)

You mean bedsides the referrals that are the topic of this thread and could come from any IP?

131.107. been going on since 2003, although not near as bas as they once were.

2004:
207.46.98.60 - - [30/Aug/2004:07:28:25 -0700] "GET /MyFolder/MyPage.html
HTTP/1.0" 200 10097 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)

2005:
207.46.127.166 - - [30/May/2009:23:53:08 +0100] "GET /robots.txt HTTP/1.1" 200 4777 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; livebot-searchsense/0.1; +http://search.msn.com/msnbot.htm)"

If there's anything recent from the 207.46., I don't keep track of it.

wilderness




msg:3963092
 12:58 pm on Jul 31, 2009 (gmt 0)

Coincidence?

207.46.92.17 - - [31/Jul/2009:11:25:04 +0100] "GET /robots.txt HTTP/1.1" 200 4858 "-" "MSRBOT (http://research.microsoft.com/research/sv/msrbot/"

GaryK




msg:3963175
 3:53 pm on Jul 31, 2009 (gmt 0)

Are you suggesting this was intentional? If so, how do they know your site(s)?

BTW, I've been seeing variations on this bot since at least 2000. The one you posted first visited me in October 2007.

MSRBOT
MSRBOT (http://research.microsoft.com/research/sv/msrbot)
MSRBOT (http://research.microsoft.com/research/sv/msrbot/
MSRBOT (http://research.microsoft.com/research/sv/msrbot/)
MSRBOT/0.1
MSRBOT/0.1 (http://research.microsoft.com/research/sv/msrbot/)

wilderness




msg:3963239
 5:29 pm on Jul 31, 2009 (gmt 0)

Are you suggesting this was intentional?

Precisely.

When we began seeing three or more of these net updates, we wondered if the end was ever in sight?

Count seven!

.NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

GaryK




msg:3963250
 5:40 pm on Jul 31, 2009 (gmt 0)

Precisely.

OK, but intentionally targeting you, or webmasters in general? Cause I don't see how they could be targeting anyone here specifically.

Count seven!

Regarding the .net updates, I need to craft a database query to see what the maximum number of updates I've seen within one UA string is. Be back after lunch and grocery shopping.

wilderness




msg:3963254
 5:53 pm on Jul 31, 2009 (gmt 0)

OK, but intentionally targeting you, or webmasters in general? Cause I don't see how they could be targeting anyone here specifically.

Gary,
Once upon a time this forum was quite active and un-moderated.

Participants could add information and many other participants (lurkers; good and bad) were able to take immediate action.

Somewhere, I've a link to a thread saved in which I offered an example of a bot and within minutes, the bot (lurking in this forum) mass crawled (or at least attempted) one of my sites, based on the criteria I had explained.

Thus it's not so far fetched to assume that MS has the capability to "play Shamus" the same as your or I.
Too assume otherwise IMO, is simply naive. Whether anybody at MS actually has the time for such "shamus" antics is another issue.

(This is NOT meant to "dis Bill" as he's doing a wonderful job of quickly approving the moderated submissions).

jdMorgan




msg:3963261
 6:02 pm on Jul 31, 2009 (gmt 0)

Targeting? No-ones being targeted by anybody, this is just Microsoft updating the .NET runtime library and being very sloppy about the updates, in that they really should use a backwards-compatibility system that doesn't require every single update to appear in the UA, just in case a .NET server needs to check version compatibility by looking at the UA string.

This is akin to having Firefox publish its user-agent string like
"Mozilla/5.0 blah-blah-OS-blah-blah Phoenix/0.1 Phoenix/0.2, Phoenix/0.3, Firebird/0.4, Firefox/0.0.1, Firefox/0.0.2, Firefox/0.0.3," etc., etc., etc. -- At some point, you need to do a roll-up, enforce compatibility, and quit publishing an ever-lengthening UA string.

Jim

wilderness




msg:3963268
 6:12 pm on Jul 31, 2009 (gmt 0)

Targeting? No-ones being targeted by anybody, this is just Microsoft updating the .NET runtime library and being very sloppy about the updates, in that they really should use a backwards-compatibility system that doesn't require every single update to appear in the UA, just in case a .NET server needs to check version compatibility by looking at the UA string.

Jim,
My apologies for the confusion.
Gary and I were discussing the possible likelihood of the 207.46. Research bot being targeted after I have mentioned the Class B, previously.

I added the NET update UA's as a footnote.

Don

GaryK




msg:3963269
 6:14 pm on Jul 31, 2009 (gmt 0)

I've been here since the "once upon a time" days of this forum before we got littleman as our first moderator sometime in 2000. I posted under a different name back then. I think it was Trail Blazer. I recall what it was like. But I have trouble believing anyone, much less MS, would bother to keep the kind of information you're talking about on file from ten years ago, and actually use it against us now. Rather, I think MS/Bing is going so over-the-top with their crawling that it's inevitable we're going to see things like this happening.

ADDED: Don, thanks for your private note. I see what you're referring to. But I didn't see any reference to your site(s). So I stand by what I've said. :)

wilderness




msg:3963306
 7:14 pm on Jul 31, 2009 (gmt 0)

So I stand by what I've said

That's certainly your perrogrative.

GaryK




msg:3963375
 8:54 pm on Jul 31, 2009 (gmt 0)

I know it's my prerogative. But please don't let it go at that. Where have you posted anything on this site that would give MS any idea how to crawl one of your sites?

wilderness




msg:3963384
 9:19 pm on Jul 31, 2009 (gmt 0)

Gary,
I've been participating in this forum since 2001 (perhaps earlier under another name).

In that time, there have certainly been submissions (even if only for a short while) that included my domain names and/or page names.

Eventually these overlooked submissions would have been edited by myself or the forum moderators ( I recall making a request to Brett one time for an edit that was beyond the allowed time frame by a user).

It's not unreasonable that some simply overlooked submissions (contain either domain names or page names) simply slipped through the cracks and exist still today.

The possibility of gathering a participants identity (domain or otherwise) is simply NOT as impossible as your attempting to make it appear.

GaryK




msg:3963389
 9:30 pm on Jul 31, 2009 (gmt 0)

Based on what you've stated here and elsewhere I have to agree with you now.

It's certainly possible for someone persistent enough to find your sites and at least one of mine based on publicly available information right here on WebmasterWorld.

Although I'm still not sure how likely it is.

I have to ask myself, why would anyone at MS want to send me, an insignificant webmaster, any kind of message.

Especially when we can't even get the likes of msndude to keep his promises to address the primary topic of this thread that's been ongoing since I think March 20, 2009.

wilderness




msg:3963393
 9:35 pm on Jul 31, 2009 (gmt 0)

A superb movie is Men of Honor, with Cuba Gooding and DeNiro.
DeNiro another superb role.
Near the end of the movie Gooding asks DeNiro:

"Why are you helping me"
DeNiro relies, "to piss people off" ;)

sidney1310




msg:3968927
 10:38 pm on Aug 9, 2009 (gmt 0)

There is a recent thread on this topic in the Bing Community forums.

[bing.com...]

In it the guy who posted here as ms_dude assures people "I am working with the crawler group to get this figured out. I'll let you know as soon as I get more information." He posted that less than a week ago, almost four months after saying the same thing here, and makes it seem there like it is some new problem just being brought to their attention.

I just can't figure out whether it is incompetence or some devious scheme to inflate the statistics on usage of Bing. In any case, standard practices would require that msnbot ip addresses not be used by web crawlers that do not identify themselves in the User Agent field as web crawlers, and that they not fake the referer field. It is just like Microsoft to mess with our web stats by breezily violating standard practices for their own convenience.

sidney1310




msg:3968929
 10:49 pm on Aug 9, 2009 (gmt 0)

By the way, unlike carib_guy at the time that he posted, most of my hits that are coming from the 65\.5[2-5\. ip address range have a referer that looks like a bing.com search link, no longer search.live.com, so my .htaccess looks like

# block access from buggy MSN bot
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\.
RewriteCond %{HTTP_REFERER} ^http://www\.bing\.com/search [NC]
RewriteRule .* - [F]

tangor




msg:3968939
 11:27 pm on Aug 9, 2009 (gmt 0)

I have a number of entries ala msn/Bing. Some annoying. But I'm not about to block them at this time. Monitor, of course, but block? That would be self-defeating these days...

sidney1310




msg:3968959
 12:25 am on Aug 10, 2009 (gmt 0)

Maybe the right thing to do would be to see if I can filter all 65.5[2-5].* ip addresses in Google Analytics to just remove those hits from my stats there instead of blocking them from my site. But I get really annoyed that Microsoft ignores standards in a way that impacts my site and then has their tech support people keep replying as if they are happy that you have brought a new issue to their attention and that they will work with their web crawler team to resolve it.

Notice that the .htaccess lines I posted only block access that looks like someone clicking on a Bing search result for my site when they are browsing from an ip address that is supposed to be the msnbot. It will not block real people who find my site in a Bing search. It will not block the msnbot when it crawls through identifying itself as the msnbot. It only blocks access that should never occur in the first place. If Microsoft chooses to blacklist sites who do this, well I would rather not have them distort my stats and increase my bandwidth costs for no good reason. I'll accept that my site will be found by the vast majority of web users who choose to use Google. If enough webmasters shared my attitude then Microsoft would have to start following accepted standard behavior or else fall even farther behind Google as Bing fails to index more and more sites that will not stand for this.

I just found yet another thread in the Bing community forum

[bing.com...]

where someone asked yet again about these one-word Bing searches from 65.55.* in their access logs and the same ms_dude Brett guy answered yet again that it must be a glitch in the new robot they are testing and please send him details so he can relay it to the crawler team. This same behavior has been going on since at least 2007. Gimme a break!

tangor




msg:3968962
 1:09 am on Aug 10, 2009 (gmt 0)

All varieties BOT traffic on bandwidth, my little site (top 3 bots, of course):

Google 6%
Bing/Live 3%
Yahoo 10%

Referrals from those bots:

Google 30%
Bing/Live 42%
Yahoo 3%

"Just the facts, ma'am, just the facts."

YMMV

sidney1310




msg:3968972
 2:17 am on Aug 10, 2009 (gmt 0)

Here is the key question that would help in understanding this whole discussion... Of the 42% referrals for Bing/Live, how many are from 65.5[2-5].* ip addresses?

Whatever percentage is that traffic, that is faked referals from Microsoft. If it is a significant part of the 42%, that is exactly what we are complaining about.

So your site would be a good example and I'm curious as to what your numbers are. Can you easily come up with the percentage for, say the last month, if you separate out the 65.5?.* ip addresses from the Bing/Live referals?

wilderness




msg:3968976
 2:40 am on Aug 10, 2009 (gmt 0)

Here is the key question that would help in understanding this whole discussion... Of the 42% referrals for Bing/Live, how many are from 65.5[2-5].* ip addresses?

MS (nor any other SE) provide such stats.
I kinda doubt MS themselves could even trace what bot IP range their end result data (SERPS) is a result of.

tangor




msg:3968982
 2:51 am on Aug 10, 2009 (gmt 0)

if you separate out the 65.5?.* ip addresses from the Bing/Live referals?

Of the 42% referrals, 1.4% are from the 65.5* range. Small potatoes for this site, (about 4M hits/year) which is one reason why I'm not hot and bothered.

I CAN see that if hit/bandwidth was a mil a month or greater it might make a difference. After all, it is a numbers game. For me, my numbers more than comfortably fit in my host choice.

PS. Quick numbers reply is because I dump my logs for this site into Access each week and run custom reports.

The commercial sites are managed with whitelisting to keep customers happy and when bad behavior bots are found, they are NUKED dead dead dead. Keeps customer happy (save bandwidth) and I don't feel bad charging monthly service fees. :)

edit: I see about the same percentage of 65.5* on those sites as well. Haven't put the brickbat to them yet. end edit.

sidney1310




msg:3968994
 3:37 am on Aug 10, 2009 (gmt 0)

tangor, 1.4% of 4 million hits annually is around 150 per day, not much more than how many I get on my tiny site. I can see how Microsoft might not care if they hit every site about 100 times per day with these bogus search referals. People who have big sites like yours are not going to care about so few bogus hits. People administering small sites like mine are most likely not to notice and can be ignored when we do, or rather put off with a "I'm discussing this with the crawler team now, and will get back to you".

Looking in detail in both Google Analytics and my hosting provider's Webalizer reports for my site, it appears that web crawler ip addresses are filtered out, so these hits don't show. That makes it even less important as an annoyance. It still bothers me that Microsoft will keep doing this and will continue to in effect lie about it in tech support forums.

Pfui




msg:3970055
 7:39 pm on Aug 11, 2009 (gmt 0)

Speaking of 65.5*, here's a resource-wasting symphony o' redundancy where MSN uses 1 IP and 3 Hosts with 4 different fake and real UAs in 5 minutes. The fake ref was in the now-typical format:

http:// www bing.com/search?q=keyword

Note how the cloaked IP+fake UA sounds out the site, then the only official msnbot Host+UA requests a specific directory (x3!), then the 2 cloaked UAs request and fake-ref the exact same directory.

-----
65.55.217.43
Mozilla/4.0

robots.txt? YES
Fake ref? NO
Hits: 1

-----
msnbot-65-55-104-163.search.msn.com
msnbot/1.1 (+http://search.msn.com/msnbot.htm)

robots.txt? NO
Fake ref? NO
Dir req: /keyword
Hits: 3

-----
msnbot-65-55-104-70.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)

robots.txt? NO
Fake ref? YES
Dir req: /keyword
Hits: 1

-----
msnbot-65-55-104-60.search.msn.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)

robots.txt? NO
Fake ref? YES
Dir req: /keyword
Hits: 1

Spare me.

Ocean10000




msg:3974157
 1:44 pm on Aug 18, 2009 (gmt 0)

New version of the Referrer spam bot using invalid IE User-Agent Strings coming from 65.55.165.*

Mozilla/4.0+(compatible;++MSIE+6.0;++Windows+ NT+5.2;++SV1;+ +.NET+CLR+1.1.4325;++.NET+CLR+2.0.40607;++.NET+CLR+3.0.04506.648)
Mozilla/4.0+(compatible;++MSIE+6.0;++Windows+ NT+5.1;++SV1;+ +.NET+CLR+1.1.4325;++.NET+CLR+2.0.50727;++.NET+CLR+3.0.04506.648)
Mozilla/4.0+(compatible;++MSIE+6.0;++Windows+ NT+5.1;++SV1;+ +.NET+CLR+1.1.4322;++.NET+CLR+2.0.40607;++.NET+CLR+3.0.30729;++.NET+CLR+3.5.30707)

[edited by: Brett_Tabke at 12:50 pm (utc) on Jan. 12, 2010]
[edit reason] (fixed formatting) added space before NT [/edit]

This 131 message thread spans 5 pages: < < 131 ( 1 2 3 [4] 5 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved