Microsoft Introduces new Robots.txt Cmds

Forum Moderators: mack

Message Too Old, No Replies

Microsoft Introduces new Robots.txt Cmds

A much welcomed "bot delay"

msndude

1:41 am on Jun 12, 2004 (gmt 0)

Continued from : [webmasterworld.com...]

Great feedback all around and thanks for the warm welcome! Rosalind that is an interesting idea on supporting a revisit after tag.

I also want to make folks aware of a feature that MSNBot supports, but which is not yet documented. We do support what we call a crawl delay. Basically it allows you to specify via robots.txt an amount of time (in seconds) that MSNBot should wait before retrieving another page from that host. The syntax in your robots.txt file would look something like:

User-Agent: msnbot
Crawl-Delay: 20

This instructs MSNBot to wait 20 seconds before retrieving another page from that host. If you think that MSNBot is being a bit aggresive this is a way to have it slow down on your host while still making sure that your pages are indexed.

Have a good weekend.

-msndude (msd)

sidyadav

2:11 am on Jun 12, 2004 (gmt 0)

Excellent!

Thanks msndude for informing us..I'll definately be using that from now on, will even remove my MSNbot ban!

Cheers,
Sid

steve40

2:40 am on Jun 12, 2004 (gmt 0)

WELCOME msndude ,
please forgive some of the webmasters who may sometimes seem abrupt , I am sure they appreciate your time to join the board and provide feedback

just remember it's often their wages so they may appear slightly rude to the reps from SE's on here but they do appreciate your time as well
look forward to tips how every webmaster can achieve number 1 position for every target keyword they desire

not much to ask on your first week is it?

and once again welcome to webmaster world
steve

WebDon

2:57 am on Jun 12, 2004 (gmt 0)

Thanks for the heads up on the new Robots.txt command. That could come in handy.

Go2

8:49 am on Jun 12, 2004 (gmt 0)

Hello msndude. I read your comment on the meta revisit after tag. There was some discussion about the use of that tag in the thread:

[webmasterworld.com...]

I think that the meta revisit-after tag has a great potential as a means for webmasters to control the behavior of Internet bots. The tag needs to be further developed though with regards to the syntax and the format (see messages 21 and 25 in the thread). If MSN would be willing to drive the development of this tag I'm sure that the use of the tag will mutually benefit both webmasters and Internet search engine bots.

onfire

9:17 am on Jun 12, 2004 (gmt 0)

Welcome msndude

Enjoy the ride, I for one am glad to see that there is going to be some more competition, and I gladly welcome MSN into the arena of the SE world, let battle commence.

Your presence here on this board in providing feedback & tips to our questions and problems is valued greatly.

ronnie the dodger

10:16 am on Jun 12, 2004 (gmt 0)

User-Agent: msnbot
Crawl-Delay: 20

Wow. Might I suggest an alternate way of doing it?

How about something simpler like putting an msnbot.txt file into the root of the domain. Messing around with the robots.txt file may have other robot code suggestions. I would rather not have another standards war over the robots.txt file, if you know what I mean.

This will also give you the opportunity to basicly add more in the future if you wish without disrupting this file.

racer_x

10:51 am on Jun 12, 2004 (gmt 0)

They are not exactly new commands the Inktomi crawler supported Crawl-Delay at least one year ago, probably earlier. And the Yahoo crawler seems to have inherited support for this command from Inktomi.

Although it is good to see MSN support it as well.

RonPK

12:00 pm on Jun 12, 2004 (gmt 0)

It probably is a useful feature. It would be even more useful if the main SE spider guys got together at the W3 office and worked on a standard, in stead of adding propietary features to the current robots exclusion protocol.

dcrombie

12:58 pm on Jun 12, 2004 (gmt 0)

No sign of a MS search engine but they're already breaking the standards. Way to go msndude!

Why do you think there is A Standard for Robot Exclusion [robotstxt.org]? I guess that's like a red rag to a bull in Redmond ;)

Brett_Tabke

1:00 pm on Jun 12, 2004 (gmt 0)

Nice catch Racer!

I found one lone reference on the entire net about the crawl delay by slurp!

[webmasterworld.com...]

...lets stay on topic folks.

> breaking the standards

Robots.txt was never approved by any "standards" committee on the internet.

Chris_D

1:11 pm on Jun 12, 2004 (gmt 0)

No sign of a MS search engine but they're already breaking the standards

The best thing about standards is that there are just so many to choose from....... :)

brandboerge

1:45 pm on Jun 12, 2004 (gmt 0)

> Robots.txt was never approved by any "standards" committee on the internet.

Still - does it "confuse" other crawlers (i.i. GoogleBot) to insert a "crawler-delay" in robots.txt? What does GoogleGuy have to say about this? (LOL)

dcrombie

1:49 pm on Jun 12, 2004 (gmt 0)

As far as I'm concerned robots.txt is a standard because it's been followed almost universally since 1994. The Slurp page [help.yahoo.com] that described Crawl-delay (using identical text to msndude above) clearly states that it is a "Slurp-specific" tag and not part of the standard supported by other crawlers. Now it seems its a "Slurp/msnbot-specific" tag.

There are lot of 'well-behaved' spiders out there that intelligently distribute their activies based on server/ip-blocks as well as individual sites. Why does M$ feel the need to start every project (web server, mail client, ...) as if there was nothing to learn from the competition?

The reason Slurp needed this tag was because their spider(s) can request robots.txt from a single site 30-40 times in an hour. Their response on this was that it was impossible to avoid as the crawler is 'distributed' ;)

jmccormac

5:14 pm on Jun 12, 2004 (gmt 0)

It might solve a lot of loading problems if spidering was confined or at least timed to take advantage of off-peak time across the different timezones. Thus for Ireland/UK spidering after 0100 and before 0800 Hrs would be more effective as it would not coincide with the main public/business times of the websites. An adaptive spidering schedule coupled with the crawl time could make the impact of spidering lighter apart from the bandwidth/transfer issue.

Regards...jmcc

vbjaeger

5:50 pm on Jun 12, 2004 (gmt 0)

Any idea how to get added to MSNbots list of sites to crawl?

digitalv

12:57 am on Jun 13, 2004 (gmt 0)

Any idea how to get added to MSNbots list of sites to crawl?

Well, if it operates like other Microsoft products the best thing to do would be to try to avoid it and it will find it's way to you :) *cough*MSNMESSENGER*cough*

It would be even more useful if the main SE spider guys got together at the W3 office and worked on a standard, in stead of adding propietary features to the current robots exclusion protocol.

Nice thought, but IMHO the W3C is a poorly-managed organization and they're a little slow to act. They don't seem to have what it takes to organize a standard, they're "behind the curve", and I would really hate to give them anything else to do. I do agree that a standard for all search engines to adhere to would be a good thing, but the W3C should have no part of it if we want to see it in our lifetimes.

DoppyNL

12:13 pm on Jun 13, 2004 (gmt 0)

I don't see why both Yahoo and MSN need to use a delay-setting in the robots.txt.

Google probably won't implement this, as they allready have their own mechanism to prevent overloading a site.
Don't know when and where GoogleGuy said it though.

Google monitors the time a site takes to respond to a request from the crawler.
If it reply's very slow, it will back off and come back later.
If it reply's very quickly, the crawler `knows` that the server is able to handle the request quite easily and it keeps on crawling (with a reasonable time in between requests).

So there is no need for such an option in robots.txt, the crawler itself can handle this (Google's crawler can, so the others should be able to do the same).

Letting the crawler determine if the server is busy or not also helps reducing server stress when the site is experiencing an unexpected peak in visits, the crawler will back-off because of the slow response and not use the much needed server-capacity at that moment.

So, why is this crawl-delay-option needed in the robots.txt file? can anyone tell me that? I don't see the point.

Marcia

10:10 pm on Jun 13, 2004 (gmt 0)

>>I don't see why both Yahoo and MSN need to use a delay-setting in the robots.txt.

They do need to, because unless and until there's another way implemented it'll keep it well behaved and respectful of webmasters.

Welcome to WebmasterWorld, msndude!

jmccormac

10:23 pm on Jun 13, 2004 (gmt 0)

I wish msnbot would actually reload robots.txt at intervals. It has started to download a directory that has always been excluded in robots.txt on one of my sites. The result was an immediate 403 ban.

Regards...jmcc

circuitjump

9:18 pm on Jun 14, 2004 (gmt 0)

I agree, it took how many years for your browser and a few others to comply with W3C's Web Standards (or recommendations). Please don't let this be the same situation. The previous few posters have a good point, help us out, don't mess it up.

digitalv, I disagree. I think the W3C has done a lot to help us out. Have you coded a few sites following the W3C's web standards yet? If so, please backup your opinion. Though I would recommend a new post and leave this one to M$.

<< Edited >>

msndude

3:27 am on Jun 15, 2004 (gmt 0)

I just want clarify why we have the crawl-delay feature in robots.txt. As we mention in our FAQ on MSNBot ( [search.msn.com...] ) we do factor in the amount of time it took us to download a page from a host and then adjust accordingly. Based on this, however, we cannot predict exactly the crawl rate that a webmaster may want.

Therefore, based on customer feedback we decided to provide the crawl-delay feature which gives webmasters more control. In most cases we think that our adjustments based on download speed will be roughly accurate. If they are not, however, we give webmasters some control over how fast they are crawled.

-msndude (msd)

overworkedunderpaid

2:47 am on Jun 29, 2004 (gmt 0)

ok. now so this topic was started with the thread of MSNbot being a frequent visitor and eating bandwith....with the new msnbotrobots.txt delay in place...what has been the general experience? Are we all still getting 'pounded' by MSNbot like a dog in heat?

Brett_Tabke

2:15 pm on Jun 29, 2004 (gmt 0)

MSN Dude - Does the msn bot/search service respect a robots.txt ban? Specifically, if we say tha directoryX is banned in robots.txt, will it attempt and download files in directoryX?

(eg: Google does NOT respect this ban and will load those files [but not list them in results] and use that data in it's crawl and index activity/data mining)

jmccormac

7:51 am on Jun 30, 2004 (gmt 0)

Does the msn bot/search service respect a robots.txt ban? Specifically, if we say tha directoryX is banned in robots.txt, will it attempt and download files in directoryX?

Based purely on msnbot's actions on one of my main directory sites, msnbot downloads banned directories. It also ignores HTTP results codes and was indexing the site every five days without bothering to use 304 results to indicate that pages had not changed. Considering that hostmasters pay for their bandwidth, letting an incompetently designed spider like msnbot loose on their site is a bad thing.

Regards...jmcc

msndude

6:11 pm on Jun 30, 2004 (gmt 0)

At this point MSNBot does not crawl any directory or file that is disallowed by robots.txt. If MSNBot appears to be crawling files or directories disallowed by robots.txt please send mail to msnbot@microsoft.com. The one caveat I would offer is that if a change is made to robots.txt it may not be picked up immediately. It can take up to a day for us to process a change.

Also, for clarification, we are not yet doing conditional GET's. As we improve our crawler this may be something that we would consider adding.

-msndude (msd)

jmccormac

1:26 am on Jul 4, 2004 (gmt 0)

Msndude, if your msnbot hasn't the ability to do something as elementary as a GET that observes 304 HTTP results you should get some proper programmers to write your crawlers or stay out of the search business. Even the most basic of Open Source crawlers do this.

Webmasters have to pay for bandwidth. To date, Microsoft's spidering strategy has been utterly incompetent and completely oblivious to the cost of bandwidth to webmasters. Consequently, many have banned msnbot.

I'd like to see competition in the search engine operations market. However Microsoft may just not be able to compete with the quality of Google or Yahoo. Search is not an area in which Microsoft has a killer product. It is one where the webmasters hold the ultimate fates of the search engines in their hands. Irritate them and they will kill your search engine by banning its spiders and making your competition's search engine indices superior.

Regards...jmcc

4eyes

5:01 pm on Jul 7, 2004 (gmt 0)

I have the crawl delay on my range of sites, setting it at 30 seconds.

This has been up for 8 days now but MSNbot is NOT paying attention to this instruction and is hitting me like a d**n machine gun.

Is anyone seeing msnbot follow the crawl delay instruction?

It's about 24 hours from going back on the ban list.