I've not heard any more from him at all, bobmark.
There were a couple of posts where someone eluded to 'the next big thing' (or similar) as though they perhaps knew something we don't.
Mr. Birney apparently did not see fit to post here, even though I sent him the thread and suggested he do so. We have communicated twice to date.
There are other mentions on the boards about MS going after Google and etc.
Logic dictates a certain amount of legitimacy especially when one considers how could an employee of MS obtain that IP Number and not get caught during the course of events, such as server draw running crawls without someone at MS tracking him down.
Then again, without Mr. Birney adding 'personal legitimacy' by posting here, tends to sway me the other way.
Having said that, since my domain hasn't been 'pummeled' too badly, I'm going to wait and see using cautious optimism.
There is a possibility that he disguised his browser type and changed his IP. Like I said, they may be competing with me soon. Just my luck.
18.104.22.168 - - [19/Apr/2003:17:10:03 -0500] "GET /links.html HTTP/1.1" 200 33341 "-" "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.2;+.NET+CLR+1.1.4322)"
Notice the Ď+ís. So I have deny from 131.107.
Just the "surf nazi" checking here ;)
I've had them denied from their first visit and each time the IP expands? I'll expand my deny range.
It is NOT logical for a legitimate company like MS to disguise and even misrepresent themselves in such a manner.
It's just NOT good business.
So the "surf nazi" suggests letting them "eat 403's"
Well then, he/she changed IP Numbers again?!?
Just ran 22.214.171.124 thru SpamCop and it renders this: email@example.com.
126.96.36.199 ditto firstname.lastname@example.org.
188.8.131.52 ditto email@example.com.
|184.108.40.206 - - [23/Apr/2003:11:09:17 -0700] "GET /robots.txt HTTP/1.1" 200 220 "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:firstname.lastname@example.org)" |
220.127.116.11 - - [23/Apr/2003:11:09:17 -0700] "GET /blahblah.html HTTP/1.1" 200 8620 "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:email@example.com)"
18.104.22.168 - - [23/Apr/2003:11:09:17 -0700] "GET /blahblah.html HTTP/1.1" 200 13642 "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:firstname.lastname@example.org)"
It looks like this is the third IP Number and the second ' message '.
Hmmmmmm. Beginning to look more bogus all the way around.
Ok, then how do we go about shutting this thing down?
Fire off an abuse@msn/hotmail.com message?
If they're spoofing IP Numbers, (and I'm ignorant here) can't that be tracked down and reported? Or, are we simply looking at the .htaccess ban?
I've searched MS's MSDN finding nothing there and then Googled "MicrosoftPrototypeCrawler" [google.com] shows only one more this week than it did last week and that's this thread.
Too bad there no one from here works for MS. <Hint! Hint!>
My suggestion is to stop toying with the deciever and deny 131.107.
If MS doesn't have the proper regard in proyecting their subcribers? Whe should I?
|If they're spoofing IP Numbers, (and I'm ignorant here) can't that be tracked down and reported? Or, are we simply looking at the .htaccess ban? |
Even though I've banned ' deny from 131.107.', I'm still interested in learning more about tracking down spoof'd IP Numbers.
Must be time for another thread....
I make it a habit to ban any robot that does not put down a valid contact method in the user agent (unless I know who they are). I don't consider a hotmail account to be valid. It managed to crawl 1700 of my pages before I got it though.
|Have you guys seen this mention of newbiecrawler on microdoc news? |
Maybe Iím cynical, and no doubt Iím paranoid, (I grew up in the 70ís), and while it could be a new bot, that does not necessarily mean that it is a SE bot. It could be a spy bot just as well, or doing both. Spying while acting as SE bot or visa-versa.
Donít get me wrong. I sell software written in a MS language, and have since 1990, and I always have been a pro-MS person, but, it looks funny and unethical to me. And lets face it, MS has been sued in the past on several questionable business practices.
Iím not even convinced that it is a bot all the time. Somewhere in my log the original IP that was posted came via google and subscribed to my newsletter, just as many of my competitors have in the past. Now I publish all the graphics for my newsletter on the server where my competitors are banned so all the get is the text until they get home. And they all have a REFERER of hotmail ,yahoo, etc. So I know it goes on.
|can't that be tracked down and reported. |
It can be just not very easily not to mention not economically. You need a sniffer and you need to sit on it 24/7. Banning is the most economical way I think.
Spoofing just doesnít make sense. No reason to spoof to sign up for my newsletter. They could just do it via an ISP instead of going to all that trouble. Could it be a firewall thing adding to this confusing issue? Iíll bet itís a new hire or something at MS, and they donít realize that when they go out on the web with a MS IP, they are representing MS for better or for worse. i.e. a fresh-out or intern.
What MS sued... when did that happen? Im from the 60's and 70's spaced out and paranoid.
From the microdot message link above...
Assuming this new platform runs on Microsoft technology, there is going to be an interesting comparison between a Microsoft search engine and a Linux Search Engine (Google). Since we know Google has about 54,000 computers in what is a mammoth supercomputer made out of PC parts, it will be interesting to see how many NT Servers it takes to make a comparable search engine, or a better one than Google. ++
I think the name of the bot/crawler should be...
SwissCheese/madeinfrance...."Hack me, hack me.."
not from Gaudahlupee
I was going through some old saved IP and other inforamtion which I had saved for reference concerning IP identification and stumbled across the following (which I had from http ://www.clearwaterbeachcam.com/d--skinner/spiders.html, although the page is still there the referecnes below are not. My saved file is dated 03/25/02 ):
|http ://www.clearwaterbeachcam.com/d--skinner/spiders.html |
This is ironic, when I go there I end up atÖ
Does that mean the IPís belong to msn.com and not microsoft.com? If thatís the case, I just banned all msn users. Oops. If some of that block is msn.com and not microsoft.com, that would explain a lot of this.
Does anyone know what msn.com IPís are? Is there anyway of getting the IP block for anydomain.com?
I get - Cable & Wireless
|while it could be a new bot, that does not necessarily mean that it is a SE bot. |
I should have said ĎTHE SE BOTí.
If you go to " georgegg " page and use any one of those
(not sure what their called?)tide28.microsoft.com
You will see that they are still active (at least registered with MS.) whether that is Microsoft or MSN IMO, is really irrelavant.
I've removed the denies from 131.107. with "egg on my face"
having gone through Arin-Whois on all those ranges I'm in the process of allowing some of those MS IP ranges back into (from denied) to my FarEast blocks.
Most everybody realizes how over-bearing I am in these matters and I believe this should resolve this issue.
Although as Pendanticist points out, there still exists the possibility of it being a sppof'd range in our logs.
Due to the recent PERSISTENT activity (131.107) and the related ranges, I'm going to accept that chance.
Hopefully in the process I won't end up with even more egg on my face ;)
<BTW Jim, that page comes right up for as soon as I omit the blank space I purposely left in the URL so the link would be broken>
|Most everybody realizes how over-bearing I am in these matters |
Yea, so am I. When you do a search for my keywords you see Motorola, GE, Honeywell, and I have had so many spy bots, that I get a little trigger happy sometimes. Banned a customer once even. (GRIN)
Pendanticist was right then. Complain to abuse@microsoft and email@example.com and let them figure it out?
|<BTW Jim, that page comes right up for as soon as I omit the blank space I purposely left in the URL so the link would be broken> |
Yea, it was an attempt @ humor. Obviously a poor attempt! But once I hit Mr. Button, it was toooooooo late. :-))
Doesnít msn do dynamic IPís? If they do, then wouldnít that make 22.214.171.124 microsoft.com because it was consistent? And I still have a problem with the Ď+í thing that showed up.
<snip>Complain to abuse@microsoft</snip>
I stopped emailing IP's and backones some time ago. Generally your only response is automated. In the event you find somebody lucky enough to email with? They are not aware of any web log pattern nor, do they have the ability to comapre those patterns to their User Agreeements.
Their only concern is bandwith.
I'm not all the keen on the variations in UA's either. However just denying a visitor access because of UA with out comparing that to IP is TOOOO overbearing. IMO anyway.
These logs, like the internet are an always changing thing and though we are required perception? We should also remain open-minded. Hopefully creating a worthwhile balance of both which benefits both our websites and our visitors.
<off the soap box> ;)
I'm on the phone with MS at this very moment and it appears as though Mr. Birney is indeed an employee of theirs.
Be right back....
Ok. I've spoken to a receptionist at tech and she is going to determine the legitimacy of this bot, once and for all.
She asked for my phone number and I gave her my ISP addy, so I do expect to hear from her.
I will let you know as soon as I hear anything at all.
(Thanks! to NeoTrace)
Interesting saga--the hotmail address is a little strange. Keep us posted, pendanticist. :)
All our sites got spidered yesterday by the same bot. Here is a sample of the IIS log file:
2003-04-24 20:08:56 126.96.36.199 - myserverip 80 GET /robots.txt - 404 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:firstname.lastname@example.org) - -
2003-04-24 20:08:57 188.8.131.52 - myserverip 80 GET /Default.asp - 200 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:email@example.com) - -
2003-04-24 20:09:24 184.108.40.206 - myserverip 80 GET /robots.txt - 404 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:firstname.lastname@example.org) - -
2003-04-24 20:09:24 220.127.116.11 - myserverip 80 GET /whatsnew.asp - 200 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:email@example.com) - -
I got hit too by this spider.
Is it a Microsoft loyalty probe?
Anyone running anything other than IIS?
Yea, I have a sun unix box.
It read my robots.txt file and then went right to a file where I have some not so nice things to say about various Microsoft-related user agents. Those are the only two files it read.
2003-04-24 10:40:05 18.104.22.168 - GET /robots.txt 200 11744 355 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:firstname.lastname@example.org) -
2003-04-24 10:40:05 22.214.171.124 - GET /browsers/notes.asp 200 0 363 MicrosoftPrototypeCrawler+(How's+my+crawling?+mailto:email@example.com) -
The 2nd page it hit doesn't have too much good to say about a lot of user-agents, Gary ;)
I'm so curious about this bot. Haven't seen it on a site yet.
Second the Unix Box thing.... FreeBSD and Apache.
|I've seen this too. It seems to be very interested in my mailing list archives, my guest book, and in my pages where I say some not-so-nice things about Bill Gates. First it came in without User-Agent or referer, which I then blocked. Then it tried to retrieve pages on URLs that have never ever existed, and are not linked from anywhere:|
126.96.36.199 - - [18/Apr/2003:11:33:33 +0200] "GET /index.html HTTP/1.1" 404 6124 "-" "MicrosoftPrototypeCrawler (please rep
ort obnoxious behavior to firstname.lastname@example.org)"
188.8.131.52 - - [18/Apr/2003:11:36:13 +0200] "GET /index.en.html HTTP/1.1" 404 6163 "-" "MicrosoftPrototypeCrawler (please
report obnoxious behavior to email@example.com)"
184.108.40.206 - - [18/Apr/2003:11:37:53 +0200] "GET /index.html HTTP/1.1" 404 6203 "-" "MicrosoftPrototypeCrawler (please rep
ort obnoxious behavior to firstname.lastname@example.org)"
220.127.116.11 - - [18/Apr/2003:11:43:50 +0200] "GET /index.html HTTP/1.1" 404 6203 "-" "MicrosoftPrototypeCrawler (please rep
ort obnoxious behavior to email@example.com)"
And then it falls into my e-mail harvester trap (maitlo links written mAilto):
18.104.22.168 - - [24/Apr/2003:00:06:41 +0200] "GET /guestbook/old/m& HTTP/1.1" 404 6277 "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:firstname.lastname@example.org)"
If it wasn't for its interest in my Bill Gates page, I would have just said this was one of all the e-mail harvesting bots, but now I'm not so sure...
Hummm, are we starting to see a correlation with the sites it hits? I think I maybe about Ĺ as paranoid as I think I am, but then again, I have been doing way too much thinking lately.
It hits a site that talks about GOOGLE and MS SE.
It hits my site, which could be a competitor.
It hits a site that talks about MS UAís.
It hits an anti-Uncle Bill site.
Now, I donít work in R&D at Morton Thiokol, but I do do statistics. And I think Iím starting to see a trend here. However, I must admit that the sample size is much too small to have a real confidence level in any theories.
Anyone have a URL that it hasnít hit where they could set-up a page about how MS does/has done this, that, or the utter? Or are there as many people without any MS references that it has come to on more than one occasion?
I may go back to the deny mode until it gets sorted out. Heck I havenít sold anything to anyone on msn anyway.
[edited by: jim_w at 9:45 pm (utc) on April 25, 2003]
Jim_w I'm assuming that you've looked at the pages in Google's index containing the U/A string?
|Jim_w I'm assuming that you've looked at the pages in Google's index containing the U/A string? |
OK, if I understand the question, you mean IE UAís? I was talking about pages with content.
If that wasnít it, Huh? Remember itís Friday and there is a higher probability for a human to make a mistake on Mondays and Fridays. at least thatís the theory Iím sticking to
|Or are there as many people without any MS references that it has come to on more than one occasion? |
That'd be my site...albiet nomothetically.
Btw - still awaiting that call/e-mail. Being late afternoon on the East Coast, I don't think I'll hear anything until perhaps Monday.
|OK, if I understand the question, you mean IE UAís? I was talking about pages with content. |
If that wasnít it, Huh?
Sorry, I wasn't trying to be cryptic. I meant if you search google for email@example.com [google.com] you can see a number of pages hit by the spider unrelated to MS queries.
You know, pixel_juice? I did that same search a few days ago, yet dispite what 'Phoenix' (?) posts, I'm not so altogether sure that what he/she stated is based on any kind of fact.
I kinda think they just assumed the validity based solely on the bots appearance in their access_log files.
| This 111 message thread spans 4 pages: < < 111 ( 1  3 4 ) > > |