Forum Moderators: mack
Just saw this guy, fell into a spider trap:
131.107.137.47 - - [11/Apr/2003:01:31:08 -0600] "GET /a/deep/link.html HTTP/1.1" 200 12589 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"
No referer, came in on a deep link (like from a SE), and d/l pages but no images. After about 5 hits, he tried to grab a trap, and got banned. Grabbed a page every 5 secs or so...
IP resolves to Redmond.... did Bill just get himself banned?
dave
There were a couple of posts where someone eluded to 'the next big thing' (or similar) as though they perhaps knew something we don't.
Mr. Birney apparently did not see fit to post here, even though I sent him the thread and suggested he do so. We have communicated twice to date.
There are other mentions on the boards about MS going after Google and etc.
Logic dictates a certain amount of legitimacy especially when one considers how could an employee of MS obtain that IP Number and not get caught during the course of events, such as server draw running crawls without someone at MS tracking him down.
Then again, without Mr. Birney adding 'personal legitimacy' by posting here, tends to sway me the other way.
Having said that, since my domain hasn't been 'pummeled' too badly, I'm going to wait and see using cautious optimism.
Pendanticist.
There is a possibility that he disguised his browser type and changed his IP. Like I said, they may be competing with me soon. Just my luck.
131.107.65.225 - - [19/Apr/2003:17:10:03 -0500] "GET /links.html HTTP/1.1" 200 33341 "-" "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.2;+.NET+CLR+1.1.4322)"
Notice the ‘+’s. So I have deny from 131.107.
I've had them denied from their first visit and each time the IP expands? I'll expand my deny range.
It is NOT logical for a legitimate company like MS to disguise and even misrepresent themselves in such a manner.
It's just NOT good business.
So the "surf nazi" suggests letting them "eat 403's"
Just ran 131.107.163.49 thru SpamCop and it renders this: postmaster@[131.107.163.49].
131.107.137.47 ditto postmaster@[131.107.137.47].
131.107.65.225 ditto postmaster@[131.107.65.225].
131.107.163.49 - - [23/Apr/2003:11:09:17 -0700] "GET /robots.txt HTTP/1.1" 200 220 "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:newbiecrawler@hotmail.com)"
131.107.163.49 - - [23/Apr/2003:11:09:17 -0700] "GET /blahblah.html HTTP/1.1" 200 8620 "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:newbiecrawler@hotmail.com)"
131.107.163.49 - - [23/Apr/2003:11:09:17 -0700] "GET /blahblah.html HTTP/1.1" 200 13642 "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:newbiecrawler@hotmail.com)"
It looks like this is the third IP Number and the second ' message '.
Hmmmmmm. Beginning to look more bogus all the way around.
Ok, then how do we go about shutting this thing down?
Fire off an abuse@msn/hotmail.com message?
If they're spoofing IP Numbers, (and I'm ignorant here) can't that be tracked down and reported? Or, are we simply looking at the .htaccess ban?
I've searched MS's MSDN finding nothing there and then Googled "MicrosoftPrototypeCrawler" [google.com] shows only one more this week than it did last week and that's this thread.
Too bad there no one from here works for MS. <Hint! Hint!>
Pendanticist.
If they're spoofing IP Numbers, (and I'm ignorant here) can't that be tracked down and reported? Or, are we simply looking at the .htaccess ban?
Even though I've banned ' deny from 131.107.', I'm still interested in learning more about tracking down spoof'd IP Numbers.
Must be time for another thread....
Pendanticist.
Have you guys seen this mention of newbiecrawler on microdoc news?
Maybe I’m cynical, and no doubt I’m paranoid, (I grew up in the 70’s), and while it could be a new bot, that does not necessarily mean that it is a SE bot. It could be a spy bot just as well, or doing both. Spying while acting as SE bot or visa-versa.
Don’t get me wrong. I sell software written in a MS language, and have since 1990, and I always have been a pro-MS person, but, it looks funny and unethical to me. And lets face it, MS has been sued in the past on several questionable business practices.
I’m not even convinced that it is a bot all the time. Somewhere in my log the original IP that was posted came via google and subscribed to my newsletter, just as many of my competitors have in the past. Now I publish all the graphics for my newsletter on the server where my competitors are banned so all the get is the text until they get home. And they all have a REFERER of hotmail ,yahoo, etc. So I know it goes on.
can't that be tracked down and reported.
It can be just not very easily not to mention not economically. You need a sniffer and you need to sit on it 24/7. Banning is the most economical way I think.
Spoofing just doesn’t make sense. No reason to spoof to sign up for my newsletter. They could just do it via an ISP instead of going to all that trouble. Could it be a firewall thing adding to this confusing issue? I’ll bet it’s a new hire or something at MS, and they don’t realize that when they go out on the web with a MS IP, they are representing MS for better or for worse. i.e. a fresh-out or intern.
I think the name of the bot/crawler should be...
SwissCheese/madeinfrance...."Hack me, hack me.."
Chalupee
not from Gaudahlupee
#MSN
#tide01.microsoft.com
#131.107.3.11
#tide02.microsoft.com
#131.107.3.12
#tide03.microsoft.com
#131.107.3.13
#tide04.microsoft.com
#131.107.3.14
#tide05.microsoft.com
#131.107.3.15
#tide06.microsoft.com
#131.107.3.16
#tide07.microsoft.com
#131.107.3.17
#tide08.microsoft.com
#131.107.3.18
#tide09.microsoft.com
#131.107.3.19
#tide10.microsoft.com
#131.107.3.20
#tide11.microsoft.com
#131.107.3.21
#tide12.microsoft.com
#131.107.3.22
#tide14.microsoft.com
#131.107.3.24
#tide15.microsoft.com
#131.107.3.25
#tide16.microsoft.com
#131.107.3.26
#tide17.microsoft.com
#131.107.3.27
#tide18.microsoft.com
#131.107.3.28
#tide19.microsoft.com
#131.107.3.29
#tide20.microsoft.com
#131.107.3.30
#tide21.microsoft.com
#131.107.3.31
#tide22.microsoft.com
#131.107.3.32
#tide23.microsoft.com
#131.107.3.33
#tide24.microsoft.com
#131.107.3.34
#tide25.microsoft.com
#131.107.3.35
#tide26.microsoft.com
#131.107.3.36
#tide27.microsoft.com
#131.107.3.37
#tide28.microsoft.com
#131.107.3.38
#tide29.microsoft.com
#131.107.3.39
#tide30.microsoft.com
#131.107.3.40
#tide33.microsoft.com
#131.107.39.12
#tide34.microsoft.com
#131.107.3.44
#tide35.microsoft.com
#131.107.3.45
#tide36.microsoft.com
#131.107.3.46
#tide70.microsoft.com
#131.107.3.70
#tide71.microsoft.com
#131.107.3.71
#tide72.microsoft.com
#131.107.3.72
#tide73.microsoft.com
#131.107.3.73
#tide74.microsoft.com
#131.107.3.74
#tide75.microsoft.com
#131.107.3.75
#tide76.microsoft.com
#131.107.3.76
#tide77.microsoft.com
#131.107.3.77
#tide78.microsoft.com
#131.107.3.78
#tide79.microsoft.com
#131.107.3.79
#tide82.microsoft.com
#131.107.3.82
#tide83.microsoft.com
#131.107.3.83
#tide84.microsoft.com
#131.107.3.84
#tide85.microsoft.com
#131.107.3.85
#tide86.microsoft.com
#131.107.3.86
#tide87.microsoft.com
#131.107.3.87
#tide93.microsoft.com
#131.107.3.93
#tide94.microsoft.com
#131.107.3.94
#tide110.microsoft.com
#63.64.43.138
#tide111.microsoft.com
#63.64.43.137
#tide112.microsoft.com
#208.249.151.138
#tide113.microsoft.com
#208.249.151.139
#tide114.microsoft.com
#192.237.67.205
#tide115.microsoft.com
#192.237.67.206
#tide116.microsoft.com
#207.46.104.80
#tide117.microsoft.com
#207.46.125.16
#tide118.microsoft.com
#208.147.66.138
#tide119.microsoft.com
#208.147.66.139
#tide120.microsoft.com
#207.46.71.10
#tide121.microsoft.com
#207.46.71.11
#tide122.microsoft.com
#203.127.3.12
#tide123.microsoft.com
#203.127.3.14
#tide124.microsoft.com
#203.41.151.8
#tide125.microsoft.com
#203.41.151.9
#tide130.microsoft.com
#207.46.36.9
#tide131.microsoft.com
#207.46.36.10
#tide132.microsoft.com
#207.46.36.11
#tide133.microsoft.com
#207.46.38.9
#tide134.microsoft.com
#207.46.38.10
#tide135.microsoft.com
#207.46.11.19
#tide136.microsoft.com
#207.46.11.20
#tide137.microsoft.com
#207.46.11.21
#tide138.microsoft.com
#207.46.44.9
#tide139.microsoft.com
#207.46.44.10
#tide140.microsoft.com
#207.46.46.9
#tide141.microsoft.com
#207.46.46.10
#tide142.microsoft.com
#207.46.40.9
#tide143.microsoft.com
#207.46.40.10
#tide144.microsoft.com
#207.46.48.9
#tide145.microsoft.com
#207.46.48.10
#tide146.microsoft.com
#207.46.42.9
#tide147.microsoft.com
#207.46.42.10
http ://www.clearwaterbeachcam.com/d--skinner/spiders.html
This is ironic, when I go there I end up at…
[search.msn.com...]
(hehehehehehe)
#MSN
Does that mean the IP’s belong to msn.com and not microsoft.com? If that’s the case, I just banned all msn users. Oops. If some of that block is msn.com and not microsoft.com, that would explain a lot of this.
Does anyone know what msn.com IP’s are? Is there anyway of getting the IP block for anydomain.com?
#tide119.microsoft.com
#208.147.66.139
I get - Cable & Wireless
while it could be a new bot, that does not necessarily mean that it is a SE bot.
You will see that they are still active (at least registered with MS.) whether that is Microsoft or MSN IMO, is really irrelavant.
I've removed the denies from 131.107. with "egg on my face"
having gone through Arin-Whois on all those ranges I'm in the process of allowing some of those MS IP ranges back into (from denied) to my FarEast blocks.
Most everybody realizes how over-bearing I am in these matters and I believe this should resolve this issue.
Although as Pendanticist points out, there still exists the possibility of it being a sppof'd range in our logs.
Due to the recent PERSISTENT activity (131.107) and the related ranges, I'm going to accept that chance.
Hopefully in the process I won't end up with even more egg on my face ;)
Don
<BTW Jim, that page comes right up for as soon as I omit the blank space I purposely left in the URL so the link would be broken>
Most everybody realizes how over-bearing I am in these matters
Yea, so am I. When you do a search for my keywords you see Motorola, GE, Honeywell, and I have had so many spy bots, that I get a little trigger happy sometimes. Banned a customer once even. (GRIN)
Pendanticist was right then. Complain to abuse@microsoft and abuse@msn.com and let them figure it out?
<BTW Jim, that page comes right up for as soon as I omit the blank space I purposely left in the URL so the link would be broken>
Yea, it was an attempt @ humor. Obviously a poor attempt! But once I hit Mr. Button, it was toooooooo late. :-))
Doesn’t msn do dynamic IP’s? If they do, then wouldn’t that make 131.107.137.47 microsoft.com because it was consistent? And I still have a problem with the ‘+’ thing that showed up.
I stopped emailing IP's and backones some time ago. Generally your only response is automated. In the event you find somebody lucky enough to email with? They are not aware of any web log pattern nor, do they have the ability to comapre those patterns to their User Agreeements.
Their only concern is bandwith.
I'm not all the keen on the variations in UA's either. However just denying a visitor access because of UA with out comparing that to IP is TOOOO overbearing. IMO anyway.
These logs, like the internet are an always changing thing and though we are required perception? We should also remain open-minded. Hopefully creating a worthwhile balance of both which benefits both our websites and our visitors.
<off the soap box> ;)
Don
2003-04-24 20:08:56 131.107.163.50 - myserverip 80 GET /robots.txt - 404 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:newbiecrawler@hotmail.com) - -
2003-04-24 20:08:57 131.107.163.50 - myserverip 80 GET /Default.asp - 200 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:newbiecrawler@hotmail.com) - -
2003-04-24 20:09:24 131.107.163.50 - myserverip 80 GET /robots.txt - 404 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:newbiecrawler@hotmail.com) - -
2003-04-24 20:09:24 131.107.163.50 - myserverip 80 GET /whatsnew.asp - 200 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:newbiecrawler@hotmail.com) - -
2003-04-24 10:40:05 131.107.163.47 - GET /robots.txt 200 11744 355 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:newbiecrawler@hotmail.com) -
2003-04-24 10:40:05 131.107.163.47 - GET /browsers/notes.asp 200 0 363 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:newbiecrawler@hotmail.com) -
131.107.163.47 - - [18/Apr/2003:11:33:33 +0200] "GET /index.html HTTP/1.1" 404 6124 "-" "MicrosoftPrototypeCrawler (please rep
ort obnoxious behavior to newbiecrawler@hotmail.com)"
131.107.163.47 - - [18/Apr/2003:11:36:13 +0200] "GET /index.en.html HTTP/1.1" 404 6163 "-" "MicrosoftPrototypeCrawler (please
report obnoxious behavior to newbiecrawler@hotmail.com)"
131.107.163.47 - - [18/Apr/2003:11:37:53 +0200] "GET /index.html HTTP/1.1" 404 6203 "-" "MicrosoftPrototypeCrawler (please rep
ort obnoxious behavior to newbiecrawler@hotmail.com)"
131.107.163.47 - - [18/Apr/2003:11:43:50 +0200] "GET /index.html HTTP/1.1" 404 6203 "-" "MicrosoftPrototypeCrawler (please rep
ort obnoxious behavior to newbiecrawler@hotmail.com)"
And then it falls into my e-mail harvester trap (maitlo links written mAilto):
131.107.163.47 - - [24/Apr/2003:00:06:41 +0200] "GET /guestbook/old/m& HTTP/1.1" 404 6277 "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:newbiecrawler@hotmail.com)"
If it wasn't for its interest in my Bill Gates page, I would have just said this was one of all the e-mail harvesting bots, but now I'm not so sure...
Let see.
It hits a site that talks about GOOGLE and MS SE.
It hits my site, which could be a competitor.
It hits a site that talks about MS UA’s.
It hits an anti-Uncle Bill site.
Now, I don’t work in R&D at Morton Thiokol, but I do do statistics. And I think I’m starting to see a trend here. However, I must admit that the sample size is much too small to have a real confidence level in any theories.
Anyone have a URL that it hasn’t hit where they could set-up a page about how MS does/has done this, that, or the utter? Or are there as many people without any MS references that it has come to on more than one occasion?
I may go back to the deny mode until it gets sorted out. Heck I haven’t sold anything to anyone on msn anyway.
[edited by: jim_w at 9:45 pm (utc) on April 25, 2003]
Jim_w I'm assuming that you've looked at the pages in Google's index containing the U/A string?
OK, if I understand the question, you mean IE UA’s? I was talking about pages with content.
If that wasn’t it, Huh? Remember it’s Friday and there is a higher probability for a human to make a mistake on Mondays and Fridays. at least that’s the theory I’m sticking to
OK, if I understand the question, you mean IE UA’s? I was talking about pages with content.If that wasn’t it, Huh?
Sorry, I wasn't trying to be cryptic. I meant if you search google for newbiecrawler@hotmail.com [google.com] you can see a number of pages hit by the spider unrelated to MS queries.