Forum Moderators: mack
Just saw this guy, fell into a spider trap:
131.107.137.47 - - [11/Apr/2003:01:31:08 -0600] "GET /a/deep/link.html HTTP/1.1" 200 12589 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"
No referer, came in on a deep link (like from a SE), and d/l pages but no images. After about 5 hits, he tried to grab a trap, and got banned. Grabbed a page every 5 secs or so...
IP resolves to Redmond.... did Bill just get himself banned?
dave
I kinda think they just assumed the validity based solely on the bots appearance in their access_log files.
<shrug>
Pendanticist.
Sorry, I wasn't trying to be cryptic.
It’s me, I’m a old programmer and if you don’t spell stuff out to me real plain I have a tendency to not get it! (GRIN) I also apparently tend to leave out major points in conversations, at least so I’ve been told.
OK, well not that I had time to check it out, but I found some interesting items and I did get curious.
90% or more of all pages mentioned something about a MS product or a item with the same name as a MS product. Of course that could go for 99.99997% of the web so that proves nothing one way or the other. Most of the time something like 'Best Viewed', etc.
I knocked down the search to just newbiecrawler and got a few more hits, 31 total. I also did not dig down into the sites, I just checked the 1st pages. Here they are in the order that was on google when I checked them.
WebMasterWorld was the 1st couple.Some site in an unknown language.
Site in Australia for the Deaf with a magazine called AAD's quarterly magazine AAD Outlook
Some German page with something about ‘die Spammer’ and ‘Security Warning’
Some unknown language page
A page about .NET, Microsoft Visual SourceSafe (VSS), etc.
Some French page
Some japan page
A software house
columbia.edu bio lab
Delphi page
Ripe
The next one was interesting it said ‘be Microsoft's biggest bitch.’ I know I have felt like that sometimes.
Some unknown language page
Some japan page
Linux page
microdocs-news
UCSC.Associate Professor .org
internet.watch japan
Engineering Workstations University of Illinois.
A car site
A ac.id
A .NET site
A site with a link to Microsoft SOAP Toolkit v2.0
A Unix software site
Perl Scripts site
Techy: New Microsoft Search
Same car site as above
German page, but has a Google-Verzeichnis aktualisiert link (I don’t have a clue)
So I don’t think that says much, but it could look like a geeks favorites list. Also they just started using newbiecrawler, I never got that on my site because I banned them before they started doing that. So if I had a crawled log page, it would not show up in the search.
Beats Me! (not literally of course) And these are just the pages that have their logs online where google can get to them, i.e. with a link to them I would guess.
Forums and blogs that are discussing the crawler (likely to be technical sites and also mention microsoft) or they are sites that are publishing their server stats for all and sundry to see. I suppose the fact that they have (mostly) set up their own log stats means that they are more likely to be technical/mention microsoft (not sure of the conclusion here ;)...)
But that said, I checked the logs of some reasonably big commercial sites with no mention of newbie, so it does seem to be grabbing particular sites, and not just those it would ordinarily hit in a wide-reaching crawl.
131.107.163.48 - - [23/Apr/2003:20:18:07 +0200] "GET /robots.txt HTTP/1.1" 403 - www.me.net "-" "MicrosoftPrototypeCrawler (How's my crawling? mailto:newbiecrawler@hotmail.com)" "-"
Two of my subpages in dir1/ contain links to the Microsoft System Journal. However, the crawler did not want them yet and _will_ not get them.
jan
All of my IP's are sequencial (pretty much)- I did NOT see it hopping from 123.123.123.001 to 123.123.123.002, like SOME scanners do.
Some of my sites cross-link. I did not see it follow cross links.
I did see it go through my most popular site, and a second site, but not my second most popular site.
Oh, and it keeps trying to come back to my main site (but can't!). That site has a LOT of pages (50,000+) and is all over Google, so that might be why. (But the other sites represent well in G?)
I am clueless as to what this thing is REALLY after.
Is it a log spammer? That is another theory I have heard thrown about, but I discount that theory now...
dave
I offer this to those who consider "spider watching" and blocking an obsession of those with something to hide or concerned with a few pennies of bandwidth.
Wayne
Being a programmer and knowing that a program is never really finished until it is obsolete, I would reckon that it keeps evolving. For example, when it 1st came to my site, they came via google and signed up for the newsletter. About 6 hours later, the bot started. This was before the ‘newbiecrawler’ stuff was added.
They found out that this was too time intensive for some reason or they were getting into trouble for surfing the net, so a little automation was added, as well as the newbiecrawler@hotmail. They then started looking at the links on the pages crawled and following them. They then found out that that idea was producing too many results that were unrelated to the desired topic(s). This would explain some of the hits in the list that seem to be outliers. So it was changed again to search, well google for example, for a set of keywords, probably in several queries, and then follow the top results. No doubt still producing too many hits to unrelated topics, (this would explain carfa’s results), because while they may know how to program, they don’t know how to use google to narrow the search results. If this is so, expect to see another change.
Of course if they had any experience, they would hit all the big SE’s, and build a link list of the top 5 results that match on all SE’s and then a link list of sites that were not matches in the results of ALL SE’s used. This way they would be getting just the top results from more than one SE algorithm. But it would get stale after the 1st 2 or 3 crawls. I don’t want to give them any ideas because I am sure they are now reading this. Just a hunch! (GRIN)
If it is a new hire at MS, they may have padded their resume, not that anyone would do that of course, and are now trying to learn some new stuff. This would explain the PERL and UNIX hits. Or they are trying to convert some free UNIX and PERL stuff to MS .NET to look like a hero in their new job. To me anyway, this seems like the most logical scenario.
I am 99.999% sure that 131.107.137.47 is not a MS SE bot, but some Bozo at their desk.
it does seem to be grabbing particular sites, and not just those it would ordinarily hit in a wide-reaching crawl.
I agree with pixel_juice. The scope of the logs is much too small to be a legitimate bot. There would have been at least 100 pages of logs and not 31.
Is it a log spammer? That is another theory I have heard thrown about, but I discount that theory now...
I also agree with carfac on this. And they are not email harvesting. The last 3 hits on my site, some they got through and some they didn’t because I keep changing the deny, so hopefully this is confusing them and making them fix bugs that don’t exist, was robots.txt, my links page and my policy page. The links page would have the keywords they would be searching for and the policy page is a new one, no doubt after reading this forum or getting a hand slap by MS admin. The only good thing is it seems to obey robots.txt which the person probably thinks will keep them out of hot water.
I wonder what MS would do if every Web Master denied 131.107. This would pretty much make their internet connection useless. Or better yet, redirect all 131.107.xx.xx back to microsoft.com and literally let them crawl all over themselves. (He say’s evilly while moving his eyebrows up and down rapidly)