Forum Moderators: mack

Message Too Old, No Replies

Is MSNBot the most stupid, ignorant search-bot?

Going mad on my site these last weeks, so maybe I am prejudiced

         

AlexK

5:46 am on Nov 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<rant>

In early October I implemented a bot-blocking routine [webmasterworld.com] that finally worked in blocking long-term scrapers (bots that take a page each and every second or so until they have scraped an entire (maybe 100,000-page) site). It worked very well. Too well! The Google Adsense-bot (21,109 in October) and Inktomi Slurp (55,726) were both caught taking more than the max 777 pages allowed in a 24-hour period. However, both seem to have learned--or been re-programmed--and it no longer happens.

A week or more ago the MSNBot started to get caught in the same snare. I upped the limit to 1,000 pages per day, but still it went on, and on, and on. This bot took 15,972 pages in October. In the first 28 hours of November it has taken 1,110 pages and tried to take another 1,767 (which were refused due to the bot-blocking). It:

  1. Does not use If-Modified-Since
  2. Does not recognise Expires
  3. (therefore keeps asking for pages which have not changed)
  4. Cannot accept compressed pages

This wretched bot takes more bandwidth than all the other bots put together. I vote it the most stupid and ignorant bot out in the wild.

</rant>

jmccormac

6:24 am on Nov 2, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is not really a bot - it is a maggot. The failure to use the If-Modified option is a big drain on bandwidth resources for large sites. The MSNbot is not stupid or ignorant - it is the fools who wrote it. I have seen worse but most of them are maggots - webscrapers and downloaders.

Microsoft should stop jerking around with their own search technology and buy a real search engine.

Regards...jmcc

AlexK

3:32 am on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A magbot. Yes! I like it.

Lord Majestic

3:46 am on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



# Does not use If-Modified-Since

Do you know any other web scale crawlers that support If-Modified-Since? I think Googlebot might support it now, but it is certainly a very recent (2005) change. If-Modified-Since is trivial for browsers for not for bots that have to deal with billions and billions of urls.

AFAIK MSNbot supports Crawl-Delay parameter so you could have easily limited number of requests per day, say make it 120 seconds or higher, or better -- exclude some or all of your URLs via robots.txt and you won't have problem with bot(s).

Microsoft should stop jerking around with their own search technology and buy a real search engine.

Like which one? Who is much better than Microsoft in search engines? Google, Yahoo and then who? Pretty much nobody, and only fool would license core technology from key competitors. Its not an option for Microsoft and its good that they invest money because there is nothing worse than a monopoly - be it Microsoft's or Google's.

AlexK

4:18 am on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lord Majestic:
Do you know any other web scale crawlers that support If-Modified-Since?

Standard Google-bot since October 2002. Also, Inktomi Slurp! and Baiduspider. However, not the Adsense bot nor the G-Mozilla-bot [webmasterworld.com].

It is trivial for the search technology to implement. They simply need to have the will to do it.

Lord Majestic

4:27 am on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Can't find info on Googlebot but Yahoo's supports this only from this year: [ysearchblog.com...]

It is trivial for the search technology to implement.

Nothing is trivial when you deal with billions of pages - MSNbot was in development far less than any other major bot so its not suprising its not as advanced as other bots.

Your main issue seems to be with number of requests rather than support for If-Modified-Since, this can be controlled using Crawl-Delay parameter that MSNbot was the first (I think) to support.

The reason that some bots don't support compressed pages is most likely due to very few webservers actually supporting it.

AlexK

5:51 am on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can't find info on Googlebot

Follow the link in msg#5.

Nothing is trivial when you deal with billions of pages

The bots store the date cached, plus receive all other info in the headers. It is trivial for them to store the Modified date together with everything else, and include it in the request header. It is a matter of will, not technology.

Your main issue seems to be with number of requests

My issue is with the bandwidth that the bots take, and specifically the server resources that they consume.

The reason that some bots don't support compressed pages is most likely due to very few webservers actually supporting it.

Nonsense (sorry). Compression has been available since the formulation of HTTP/1.0 and all webservers + browsers now support it without problems. I've provided a PHP class to implement it [webmasterworld.com], and that typically gives 80% compression on standard HTML - a vast saving for both the webmaster and user.

PS The Teoma crawler (Ask Jeeves) also supports compression.

Lord Majestic

2:14 pm on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Nonsense (sorry).

AlexK - I have written a bot that crawled 2 bln pages so far, it supports gzip among other things and from this very representative sample I concluded that number of servers that support gzip is extremely low: I was very disappointed. The fact that its easy to support and that you have PHP is not relevant to simple fact that very few servers support gzip, that's fact.

AlexK

3:40 pm on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



OK, Lord Majestic, we are using the same word ("supports") from different points of view, hence the disagreement. You are using the word from the "it is enacted, active and is working" point of view, and I am using it in the "it is there if you just switch it on, dummy" point of view.

I have to admit that there is merit in *your* POV; I hope that you can also see the merit in mine. My former hosts went goggle-eyed on me when I specified dual-multi-Gig-Xeons for the Linux box that they were going to build for me - after all, who needs such specs when Linux can run on a 486, right? Wrong! There is a reason that human kind is built with eyes that face forward.

It's OK, I feel better now, nurse.

Lord Majestic

3:59 pm on Nov 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"it is enacted, active and is working" point of view

True - I am merely commenting on the way things are. I wish more people like you switched on support for gzip because believe me - bots want to use as little bandwidth as possible and if any significant number of people supported gzip then most bots would have supported it too.

It's OK, I feel better now, nurse.

That's good - invoice is in the post ;)