homepage Welcome to WebmasterWorld Guest from 54.227.146.68
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Microsoft / Bing Search Engine News
Forum Library, Charter, Moderators: mack

Bing Search Engine News Forum

This 111 message thread spans 4 pages: < < 111 ( 1 2 [3] 4 > >     
Someone at MS just got banned!
Was Bill Gates Surfing My site?
carfac




msg:1536704
 5:21 pm on Apr 11, 2003 (gmt 0)

Hi:

Just saw this guy, fell into a spider trap:

131.107.137.47 - - [11/Apr/2003:01:31:08 -0600] "GET /a/deep/link.html HTTP/1.1" 200 12589 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"

No referer, came in on a deep link (like from a SE), and d/l pages but no images. After about 5 hits, he tried to grab a trap, and got banned. Grabbed a page every 5 secs or so...

IP resolves to Redmond.... did Bill just get himself banned?

dave

 

pendanticist




msg:1536764
 2:08 am on Apr 26, 2003 (gmt 0)

You know, pixel_juice? I did that same search a few days ago, yet dispite what 'Phoenix' (?) posts, I'm not so altogether sure that what he/she stated is based on any kind of fact.

I kinda think they just assumed the validity based solely on the bots appearance in their access_log files.

<shrug>

Pendanticist.

jim_w




msg:1536765
 7:48 am on Apr 26, 2003 (gmt 0)

pixel_juice

Sorry, I wasn't trying to be cryptic.

It’s me, I’m a old programmer and if you don’t spell stuff out to me real plain I have a tendency to not get it! (GRIN) I also apparently tend to leave out major points in conversations, at least so I’ve been told.

OK, well not that I had time to check it out, but I found some interesting items and I did get curious.

90% or more of all pages mentioned something about a MS product or a item with the same name as a MS product. Of course that could go for 99.99997% of the web so that proves nothing one way or the other. Most of the time something like 'Best Viewed', etc.

I knocked down the search to just newbiecrawler and got a few more hits, 31 total. I also did not dig down into the sites, I just checked the 1st pages. Here they are in the order that was on google when I checked them.

WebMasterWorld was the 1st couple.

Some site in an unknown language.

Site in Australia for the Deaf with a magazine called AAD's quarterly magazine AAD Outlook

Some German page with something about ‘die Spammer’ and ‘Security Warning’

Some unknown language page

A page about .NET, Microsoft Visual SourceSafe (VSS), etc.

Some French page

Some japan page

A software house

columbia.edu bio lab

Delphi page

Ripe

The next one was interesting it said ‘be Microsoft's biggest bitch.’ I know I have felt like that sometimes.

Some unknown language page

Some japan page

Linux page

microdocs-news

UCSC.Associate Professor .org

internet.watch japan

Engineering Workstations University of Illinois.

A car site

A ac.id

A .NET site

A site with a link to Microsoft SOAP Toolkit v2.0

A Unix software site

Perl Scripts site

Techy: New Microsoft Search

Same car site as above

German page, but has a Google-Verzeichnis aktualisiert link (I don’t have a clue)

So I don’t think that says much, but it could look like a geeks favorites list. Also they just started using newbiecrawler, I never got that on my site because I banned them before they started doing that. So if I had a crawled log page, it would not show up in the search.

Beats Me! (not literally of course) And these are just the pages that have their logs online where google can get to them, i.e. with a link to them I would guess.

mipapage




msg:1536766
 9:59 am on Apr 26, 2003 (gmt 0)

This is a bit off topic, so someone could sticky me an answer, but why all of the bot blocking? Solely to save bandwidth?
I have a feeling that I am stepping into a Pandora's box with many more sleepless night of reading and learning in my future...

FWIW, I was hit by the 'newbie' last night.

pixel_juice




msg:1536767
 11:10 am on Apr 26, 2003 (gmt 0)

One thing that strikes me as having a big effect on the newbiecrawler results in the serps is that the only ones that will show are:

Forums and blogs that are discussing the crawler (likely to be technical sites and also mention microsoft) or they are sites that are publishing their server stats for all and sundry to see. I suppose the fact that they have (mostly) set up their own log stats means that they are more likely to be technical/mention microsoft (not sure of the conclusion here ;)...)

But that said, I checked the logs of some reasonably big commercial sites with no mention of newbie, so it does seem to be grabbing particular sites, and not just those it would ordinarily hit in a wide-reaching crawl.

Glacai




msg:1536768
 11:38 am on Apr 26, 2003 (gmt 0)

It's hit three of my sites and all mention MS, one of them only once. One other site I'd expect it to hit but hasn't, doesn't mention MS, maybe just a coincidence.

bull




msg:1536769
 12:33 pm on Apr 26, 2003 (gmt 0)

131.107.163.48 - - [23/Apr/2003:20:18:07 +0200] "GET /robots.txt HTTP/1.1" 403 - www.me.net "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:newbiecrawler@hotmail.com)" "-"

Two of my subpages in dir1/ contain links to the Microsoft System Journal. However, the crawler did not want them yet and _will_ not get them.

jan

carfac




msg:1536770
 3:24 pm on Apr 26, 2003 (gmt 0)

I have a couple of sites... NONE of which is remortely technical, has ANY mention of MS (or best viewed with IE).

All of my IP's are sequencial (pretty much)- I did NOT see it hopping from 123.123.123.001 to 123.123.123.002, like SOME scanners do.

Some of my sites cross-link. I did not see it follow cross links.

I did see it go through my most popular site, and a second site, but not my second most popular site.

Oh, and it keeps trying to come back to my main site (but can't!). That site has a LOT of pages (50,000+) and is all over Google, so that might be why. (But the other sites represent well in G?)

I am clueless as to what this thing is REALLY after.

Is it a log spammer? That is another theory I have heard thrown about, but I discount that theory now...

dave

kwngian




msg:1536771
 4:17 pm on Apr 26, 2003 (gmt 0)


Something to do with a link Slashdot.org? They tend to bring in the weird traffic like lately, sudden surge in MSIECrawler which never happens before.

Out of the 6-7 sites that I have access to, only one was hit and all the sites have a link to one or more of the other sites.

NorthernStudio




msg:1536772
 6:22 pm on Apr 26, 2003 (gmt 0)

Newbie was hitting one of my sites yesterday when this spider: 194.242.43.73 began calling for two or three pages per second. The actions of the two seemed to block each other and effectively shut down the site for about 20 minutes returning "System_resource_exceeded" to all page calls. The second spider was discussed previously and attributed to Artprice.com.

I offer this to those who consider "spider watching" and blocking an obsession of those with something to hide or concerned with a few pennies of bandwidth.

Wayne

jim_w




msg:1536773
 7:17 pm on Apr 26, 2003 (gmt 0)

If I had to reckon, which is of course all I can do, I still reckon that it is a fresh-out or intern new hire by MS. Ergo, (newbiecrawler) newbie – is new employee and crawler - is my crawler. This could also be one reason for the hotmail addy, they hadn’t got the MS mail account set-up yet, or had problems with their new MS Exchange mail account. (and we all know that hardly ever happens) And/or they weren’t sure of what email the MS email police might or might not be looking at. So hotmail was safe and reliable.

Being a programmer and knowing that a program is never really finished until it is obsolete, I would reckon that it keeps evolving. For example, when it 1st came to my site, they came via google and signed up for the newsletter. About 6 hours later, the bot started. This was before the ‘newbiecrawler’ stuff was added.

They found out that this was too time intensive for some reason or they were getting into trouble for surfing the net, so a little automation was added, as well as the newbiecrawler@hotmail. They then started looking at the links on the pages crawled and following them. They then found out that that idea was producing too many results that were unrelated to the desired topic(s). This would explain some of the hits in the list that seem to be outliers. So it was changed again to search, well google for example, for a set of keywords, probably in several queries, and then follow the top results. No doubt still producing too many hits to unrelated topics, (this would explain carfa’s results), because while they may know how to program, they don’t know how to use google to narrow the search results. If this is so, expect to see another change.

Of course if they had any experience, they would hit all the big SE’s, and build a link list of the top 5 results that match on all SE’s and then a link list of sites that were not matches in the results of ALL SE’s used. This way they would be getting just the top results from more than one SE algorithm. But it would get stale after the 1st 2 or 3 crawls. I don’t want to give them any ideas because I am sure they are now reading this. Just a hunch! (GRIN)

If it is a new hire at MS, they may have padded their resume, not that anyone would do that of course, and are now trying to learn some new stuff. This would explain the PERL and UNIX hits. Or they are trying to convert some free UNIX and PERL stuff to MS .NET to look like a hero in their new job. To me anyway, this seems like the most logical scenario.

I am 99.999% sure that 131.107.137.47 is not a MS SE bot, but some Bozo at their desk.

it does seem to be grabbing particular sites, and not just those it would ordinarily hit in a wide-reaching crawl.

I agree with pixel_juice. The scope of the logs is much too small to be a legitimate bot. There would have been at least 100 pages of logs and not 31.

Is it a log spammer? That is another theory I have heard thrown about, but I discount that theory now...

I also agree with carfac on this. And they are not email harvesting. The last 3 hits on my site, some they got through and some they didn’t because I keep changing the deny, so hopefully this is confusing them and making them fix bugs that don’t exist, was robots.txt, my links page and my policy page. The links page would have the keywords they would be searching for and the policy page is a new one, no doubt after reading this forum or getting a hand slap by MS admin. The only good thing is it seems to obey robots.txt which the person probably thinks will keep them out of hot water.

I wonder what MS would do if every Web Master denied 131.107. This would pretty much make their internet connection useless. Or better yet, redirect all 131.107.xx.xx back to microsoft.com and literally let them crawl all over themselves. (He say’s evilly while moving his eyebrows up and down rapidly)

jim_w




msg:1536774
 9:36 pm on Apr 26, 2003 (gmt 0)

Just for grins, I went to microsoft.com and did a search for birney. I got

Bill Birney
Microsoft® Windows® Movie Maker Handbook
Bill Birney has a background in the film and video industry

pixel_juice




msg:1536775
 9:39 pm on Apr 26, 2003 (gmt 0)

Could be a relation, but the other guy was called Keith wasn't he?

jim_w




msg:1536776
 9:48 pm on Apr 26, 2003 (gmt 0)

Yes, it was Keith.

I worked at Motorola for 11 years, and it was common to look in the local inplant phone book and see several people with the same last name. And if the last name was like something other than Smith, etc. 9 times out of 10 they were related. Typically a parent would get their child a job there, etc. A lot of marriages also.

I saw a show on PBS once and they were talking to a VP at Motorola and the interviewer ask about inbreeding because big companies work like that a lot. The thought never crossed my mind until he ask the question. Heck, I married 2 women that work at Motorola in that 11 years, and it was common for ex’s of one person or another to have to be moved to another department because of potential problems. My guess is they are, but not married.

pendanticist




msg:1536777
 10:03 pm on Apr 26, 2003 (gmt 0)

If I had to reckon, which is of course all I can do, I still reckon that it is a fresh-out or intern new hire by MS. Ergo, (newbiecrawler) newbie – is new employee and crawler - is my crawler. This could also be one reason for the hotmail addy, they hadn’t got the MS mail account set-up yet, or had problems with their new MS Exchange mail account. (and we all know that hardly ever happens) And/or they weren’t sure of what email the MS email police might or might not be looking at. So hotmail was safe and reliable.

I am 99.999% sure that 131.107.137.47 is not a MS SE bot, but some Bozo at their desk.

[webmasterworld.com...]

If you re-read the link above, you'll note I said Mr Kieth Birney is a confirmed employee of MS.

In fact, the first receptionist I spoke with Friday afternoon asked if I wanted to speak to him directly - to whit I said "No". Ergo, since his legitimacy within MS as an employee has already been established the ... MS email police... is kinda moot. Employees get lots of perquisities, including e-mail accounts.

Whether he's operating within the ethical constructs (job description) of MS, is the quesion at hand and the primary reason for the direct-to-the-horses-mouth phone call as noted in an earlier post of mine. Dontcha just love older technology? <-Rhetorical Question.

  • Be advised (if Neotracing www.msn.com) the tech number (1.425.882.8000 {?}) is no longer working. A little digging around will get you a toll-free (1.800.642.7676).

    As for your doing that search. Well, the only thing that indicates to me is that nepotism may be alive and well even at MS.

    Pendanticist.

  • jim_w




    msg:1536778
     10:28 pm on Apr 26, 2003 (gmt 0)

    Agreed. The question I had on my mind was, is the IP 131.107.137.47 a computer dedicated to running a SE bot and he was in charge of developing the SE bot, or is it the IP of that computer sitting on someone’s desk that they are playing with .NET. i.e is he the engineer in charge of a SE bot, or just an engineer.

    MS email police

    Not if he is violating MS company policy. There was a case several years back where a Xerox employee got canned because of email police at Xerox. So I was thinking along those lines.

    Dontcha just love older technology? <-Rhetorical Question.

    hehehehehehehe.

    nepotism may be alive and well even at MS.

    When I worked at .M. and that would happen, they never said this person is starting and they are a graduate of, xyz, or they have a degree in, xyz, it was always, such and such is starting and they are someone’s son, daughter, etc. and that was it. I think this kind of points to the new hire theory personally. But, that is just my opinion.

    pendanticist




    msg:1536779
     3:33 am on Apr 27, 2003 (gmt 0)

    Well, so far this things used three IP Numbers and if he was in violation of corporate policy, you'd think he'd be nabbed by now, eh?

    If he is a 'new-hire' he's an awfully bodacious one! Like, the size of grapefruit.

    :)

    ZZzzzz...

    Pendanticist.

    jim_w




    msg:1536780
     4:09 am on Apr 27, 2003 (gmt 0)

    you'd think he'd be nabbed by now

    And you would think that every 2 weeks I wouldn’t be getting a new security fix for IE either. So I think we have to consider the source. Of course that every 2 week security fix thing could also correlate to the ‘+’s in the UA as well. :-))

    The second IP number I got didn’t have the ‘newbiecrawler’ as the UA, but had +’s, so between having a different UA and different IP, I’m not convinced it was the same person. But it may have been. Who knows besides him?

    eh?

    Canadian by any chance or have you just watched ‘Strange Brew’ one too often as I have?

    'new-hire' he's an awfully bodacious

    In my experience at the big M, I’ve seen worse. But it would take a small book to explain. I’m probably wrong.

    GunnerM




    msg:1536781
     11:00 pm on Apr 27, 2003 (gmt 0)

    does anyone know of a good freeware/shareware, preferably perl product that is a reliable "spider trap"?

    thanx, gunner m :)

    wilderness




    msg:1536782
     11:15 pm on Apr 27, 2003 (gmt 0)

    does anyone know of a good freeware/shareware

    [webmasterworld.com...]

    pendanticist




    msg:1536783
     11:58 pm on Apr 27, 2003 (gmt 0)

    Well, I've not gotten any of the "+ " in my UAs and Strange Brew for sure.

    :)

    Once tomorrow comes, I should hear something back from MS. Until then, we'll all speculate our weekends away...er, ah, we may have already done that...eh?

    <chuckle>

    There is always the possibility that Microsoft wants Google traffic [webmasterworld.com] too.

    Pendanticist.

    bunltd




    msg:1536784
     3:57 am on Apr 28, 2003 (gmt 0)

    Just came across this thread and checked my logs: FWIW starting around the 17th through the 26th

    131.107.163.49 - MicrosoftPrototypeCrawler (please report obnoxious behavior to newbiecrawler@hotmail.com

    and

    131.107.65.225 - Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.2;+.NET+CLR+1.1.4322)

    Although I am at a loss as to what exactly it is doing there, doesn't seem to follow any particular pattern, it will be interested to see what pendanticist learns.

    LisaB

    jim_w




    msg:1536785
     6:12 am on Apr 28, 2003 (gmt 0)

    <quote>I should hear something back from MS</quote>

    Well I hope so, but being the cynic we are, it would be my perception that the only time MS called me back was after I gave them a CC# and they charged me $175.00 to prove to them they had a bug in one of their compilers. (wondering how MS got so rich)

    <quote>we'll all speculate our weekends</quote>

    Actually the biggest concern I have right now is, are any of the IP’s in 131.107. used by MSN. I already know I need to block their corporate HQ, but I don’t really want to block their SE bot or MSN users. Although, that would probably be fine with AOL and google. God knows I’ve shot myself in the foot enough for one lifetime.

    The alleged ‘MicrosoftPrototypeCrawler’ hasn’t been back to see us since 25/Apr/2003:19:25:59 –0500 and they read robots.txt. 2 days before that they tried to snag our policy page and got 403’ed, but it may have already been in a cache somewhere, so it looks like they may be leaving us alone.

    If google traffic was their #1 goal, they would probably already have it. They have so much money that they can throw at their top goals, (I don’t have a clue though how they got so much though), that it isn’t even funny. Look at what happened to Netscape. And the Apple law suite was so costly for Apple, I’ll bet Apple wishes they would have spent that money on R&D for their OS in hindsight. They even figured out how to get the upper hand with IBM and WARP.

    Someone may find this funny. Even MS uses google. (GRIN)
    131.107.3.86 - - [08/Apr/2003:11:34:35 -0500] "GET / HTTP/1.0" 200 31279 "http://www.google.com/search?q=xxx+xxxx&hl=en&lr=&ie=UTF-8&oe=UTF-8&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312461; .NET CLR 1.1.4322)"

    If they call, see if you can find out what IP’s their SE bot will use. Tell them it so that 90% of the professional webmasters won’t block their SE bot by mistake. (GRIN) I’ve had to contact several SE’s to get the IP’s they use because I also have an AXS log and I like to filter the SE bots out of so that I have stats on just the eyeballs that see the page, and of course the ‘evil doers’ that I can spot right away and ban. I wish all SE bits would publish what IP’s their bots used.

    And may your beer not be warmer or yellowier.

    Lisa-
    131.107.65.225 - Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.2;+.NET+CLR+1.1.4322)

    Do you have time and date this happened? I want to correlate it with mine to see if it was about the same time frame. This was obviously a coding error by someone whom no doubt fixed it right away, but I am curious.

    bunltd




    msg:1536786
     4:55 pm on Apr 28, 2003 (gmt 0)

    Do you have time and date this happened?

    Jim, yes... here's the gist:

    Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.2;+.NET+CLR+1.1.4322
    showed requests around:
    17/Apr/2003:14:09:55
    18/Apr/2003:11:58:09
    19/Apr/2003:19:58:44
    22/Apr/2003:17:24:31
    25/Apr/2003:01:03:13
    26/Apr/2003:17:41:09

    jim_w




    msg:1536787
     5:11 pm on Apr 28, 2003 (gmt 0)

    This was obviously a coding error by someone whom no doubt fixed it right away

    Or maybe they didn't?

    AAnnAArchy




    msg:1536788
     8:24 pm on Apr 28, 2003 (gmt 0)

    131.107.163.50 MicrosoftPrototypeCrawler (How's my crawling? mailto:newbiecrawler@hotmail.com) 04/28/03 01:20 PM Viewing a user's profile

    So, has anyone found out what the deal is yet? My site has nothing to do with MS - it's a fansite board that it's crawling right now.

    pendanticist




    msg:1536789
     9:16 pm on Apr 28, 2003 (gmt 0)

    So, has anyone found out what the deal is yet?

    Yes and no. Being somewhat impatient, I called them just a few minutes ago (5:00 EST). The receptionist 'did' remember me and stated that whomever is in charge (I assume publicity/legal) underwent some form of surgery Thursday.

    As she said to me today, 'when we spoke Friday she had no idea this particular individual was the one who 'clears' any forthcoming information' therefore she couldn't tell me he/she was out.

    I believe the 'in charge' speaks to no one in particular, just that this individual must be in the loop with respect to any public discussions/admissions whatever. So, for the moment, let's call him/her Public Relations.

    I did re-stress our concerns as unilaterally as I could, saying that "Webmasters from around the World are more than concerned as to the authenticity of the relationship w/MS based solely on the moniker 'MicrosoftPrototypeCrawler as we've seen in our log files."

    I wish I had more definitiveinformation, but I do not.

    Of course, if Brett wants to make that phone call too.....

    Pendanticist.

    pendanticist




    msg:1536790
     10:03 pm on Apr 28, 2003 (gmt 0)

    Here's the scoop!

    This is indeed a Microsoft sanctioned crawler!

    ...it is something Microsoft created and soon instead of having the newbiecrawler@hotmail.com contact for question, it will have a microsoft email address to avoid confusion.

    There it is, folks.

    Take it and ruuuuunnnnnnn.

    Pendanticist.

    pixel_juice




    msg:1536791
     10:37 pm on Apr 28, 2003 (gmt 0)

    Wow! Where'd you hear that pendanticist? I was just about to post to agree with "Webmasters from around the World are more than concerned..."

    jdMorgan




    msg:1536792
     10:40 pm on Apr 28, 2003 (gmt 0)

    pendanticist,

    Thanks for chasing this down...

    You might suggest to them - if you haven't already - that putting up a web page with their crawler particulars on it, and including that URL in their UA string would be a good idea. That way, webmasters don't have to wait for an answer, and they won't have thousands of e-mails to answer every day.

    Thanks again,
    Jim

    pendanticist




    msg:1536793
     10:42 pm on Apr 28, 2003 (gmt 0)

    As I said earlier in this thread, I called Microsoft Friday and again today.

    See also It's Official!
    MicrosoftPrototypeCrawler is legitimate!
    [webmasterworld.com]

    <added>
    The response I got (after the phone call as noted above) was the snippet posted which came to me as e-mail. It is by no means comprehensive.

    Jim, I'll answer your question thru the above thread, as best I can.

    As far as I'm concerned this thread is dead.
    </added>

    Pendanticist.

    AAnnAArchy




    msg:1536794
     12:41 am on Apr 29, 2003 (gmt 0)

    Well, we banned it anyway. It was slowing down our board...and there's nothing I dislike more than when my own sites don't load quickly.

    This 111 message thread spans 4 pages: < < 111 ( 1 2 [3] 4 > >
    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Microsoft / Bing Search Engine News
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved