Forum Moderators: mack
Great feedback all around and thanks for the warm welcome! Rosalind that is an interesting idea on supporting a revisit after tag.
I also want to make folks aware of a feature that MSNBot supports, but which is not yet documented. We do support what we call a crawl delay. Basically it allows you to specify via robots.txt an amount of time (in seconds) that MSNBot should wait before retrieving another page from that host. The syntax in your robots.txt file would look something like:
User-Agent: msnbot
Crawl-Delay: 20
This instructs MSNBot to wait 20 seconds before retrieving another page from that host. If you think that MSNBot is being a bit aggresive this is a way to have it slow down on your host while still making sure that your pages are indexed.
Have a good weekend.
-msndude (msd)
just remember it's often their wages so they may appear slightly rude to the reps from SE's on here but they do appreciate your time as well
look forward to tips how every webmaster can achieve number 1 position for every target keyword they desire
not much to ask on your first week is it?
and once again welcome to webmaster world
steve
[webmasterworld.com...]
I think that the meta revisit-after tag has a great potential as a means for webmasters to control the behavior of Internet bots. The tag needs to be further developed though with regards to the syntax and the format (see messages 21 and 25 in the thread). If MSN would be willing to drive the development of this tag I'm sure that the use of the tag will mutually benefit both webmasters and Internet search engine bots.
User-Agent: msnbot
Crawl-Delay: 20
Wow. Might I suggest an alternate way of doing it?
How about something simpler like putting an msnbot.txt file into the root of the domain. Messing around with the robots.txt file may have other robot code suggestions. I would rather not have another standards war over the robots.txt file, if you know what I mean.
This will also give you the opportunity to basicly add more in the future if you wish without disrupting this file.
Why do you think there is A Standard for Robot Exclusion [robotstxt.org]? I guess that's like a red rag to a bull in Redmond ;)
I found one lone reference on the entire net about the crawl delay by slurp!
[webmasterworld.com...]
...lets stay on topic folks.
> breaking the standards
Robots.txt was never approved by any "standards" committee on the internet.
There are lot of 'well-behaved' spiders out there that intelligently distribute their activies based on server/ip-blocks as well as individual sites. Why does M$ feel the need to start every project (web server, mail client, ...) as if there was nothing to learn from the competition?
The reason Slurp needed this tag was because their spider(s) can request robots.txt from a single site 30-40 times in an hour. Their response on this was that it was impossible to avoid as the crawler is 'distributed' ;)
Regards...jmcc
Any idea how to get added to MSNbots list of sites to crawl?
Well, if it operates like other Microsoft products the best thing to do would be to try to avoid it and it will find it's way to you :) *cough*MSNMESSENGER*cough*
It would be even more useful if the main SE spider guys got together at the W3 office and worked on a standard, in stead of adding propietary features to the current robots exclusion protocol.
Nice thought, but IMHO the W3C is a poorly-managed organization and they're a little slow to act. They don't seem to have what it takes to organize a standard, they're "behind the curve", and I would really hate to give them anything else to do. I do agree that a standard for all search engines to adhere to would be a good thing, but the W3C should have no part of it if we want to see it in our lifetimes.
Google probably won't implement this, as they allready have their own mechanism to prevent overloading a site.
Don't know when and where GoogleGuy said it though.
Google monitors the time a site takes to respond to a request from the crawler.
If it reply's very slow, it will back off and come back later.
If it reply's very quickly, the crawler `knows` that the server is able to handle the request quite easily and it keeps on crawling (with a reasonable time in between requests).
So there is no need for such an option in robots.txt, the crawler itself can handle this (Google's crawler can, so the others should be able to do the same).
Letting the crawler determine if the server is busy or not also helps reducing server stress when the site is experiencing an unexpected peak in visits, the crawler will back-off because of the slow response and not use the much needed server-capacity at that moment.
So, why is this crawl-delay-option needed in the robots.txt file? can anyone tell me that? I don't see the point.
digitalv, I disagree. I think the W3C has done a lot to help us out. Have you coded a few sites following the W3C's web standards yet? If so, please backup your opinion. Though I would recommend a new post and leave this one to M$.
<< Edited >>
Therefore, based on customer feedback we decided to provide the crawl-delay feature which gives webmasters more control. In most cases we think that our adjustments based on download speed will be roughly accurate. If they are not, however, we give webmasters some control over how fast they are crawled.
-msndude (msd)
(eg: Google does NOT respect this ban and will load those files [but not list them in results] and use that data in it's crawl and index activity/data mining)
Does the msn bot/search service respect a robots.txt ban? Specifically, if we say tha directoryX is banned in robots.txt, will it attempt and download files in directoryX?Based purely on msnbot's actions on one of my main directory sites, msnbot downloads banned directories. It also ignores HTTP results codes and was indexing the site every five days without bothering to use 304 results to indicate that pages had not changed. Considering that hostmasters pay for their bandwidth, letting an incompetently designed spider like msnbot loose on their site is a bad thing.
Regards...jmcc
Also, for clarification, we are not yet doing conditional GET's. As we improve our crawler this may be something that we would consider adding.
-msndude (msd)
Webmasters have to pay for bandwidth. To date, Microsoft's spidering strategy has been utterly incompetent and completely oblivious to the cost of bandwidth to webmasters. Consequently, many have banned msnbot.
I'd like to see competition in the search engine operations market. However Microsoft may just not be able to compete with the quality of Google or Yahoo. Search is not an area in which Microsoft has a killer product. It is one where the webmasters hold the ultimate fates of the search engines in their hands. Irritate them and they will kill your search engine by banning its spiders and making your competition's search engine indices superior.
Regards...jmcc
This has been up for 8 days now but MSNbot is NOT paying attention to this instruction and is hitting me like a d**n machine gun.
Is anyone seeing msnbot follow the crawl delay instruction?
It's about 24 hours from going back on the ban list.