Define allowable spider times

Forum Moderators: goodroi

Message Too Old, No Replies

Define allowable spider times

MSN is going nutz!

Reno

8:51 pm on Dec 29, 2005 (gmt 0)

I have a relatively small website -- under 15 MB in size. The MSN bot comes every day and hits it hundreds of times -- today alone msnbot accounted for 43 MB of bandwidth. Multiply that over the course of a month and you can see the problem.

Is there a way for me to utilize robots.txt to limit the spidering of the site by msnbot to only once or twice per week? For example, it would be allowed on any Tuesday and/or any Friday?

At this rate I will have to buy more bandwidth just to account for what this one bot is using, which seems silly. Any advice is appreciated....

..........................

Dijkgraaf

12:12 am on Dec 30, 2005 (gmt 0)

No, robots.txt does not allow you to define anything like that. The closest thing that some spiders do allow is crawl-delay: which tells a spider not to fetch a page more than evern N seconds.
e.g. [help.yahoo.com...]
[search.msn.com.my...]

However since this is not currently part of the robots.txt standard, it has to be defined on a per User-Agent basis.

If your site is only 15Mb, how come MSN is fetching 43Mb? a day? Is it repeatedly asking for the same pages?
Are your pages dynamically generated, e.g. PHP? If so you may want to look at some sort of cache control.
For example
[alexandre.alapetite.net...]

Reno

12:28 am on Dec 30, 2005 (gmt 0)

Thanks Dijkgraaf. I have no idea why MSN is hitting my site that hard -- today there were OVER 2,000 requests! I have almost no PHP and no meta cache tags.

I found the info on the MS FAQ, so have added this to my robots.txt file:

# Limit msnbot to 400 requests per day
User-agent: msnbot
Crawl-delay: 216

# Prevent certain file types
User-agent: msnbot
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$

I'll watch that for a couple days, and if it is still too much bandwidth, I'll take it down to: Crawl-delay: 864 (100 requests)

..........................

Lord Majestic

12:40 am on Dec 30, 2005 (gmt 0)

Does MSNbot support wild-cards like *? They certainly are explicitly not supported by the robots.txt standart.

Additionally it is not really a good idea to have 2 definitions for the same bot - no need to separate Crawl-Delay from Disallow statements, its best to avoid ambiguity (sp?).

Reno

1:22 am on Dec 30, 2005 (gmt 0)

Thanks Lord Majestic for the heads-up -- I have joined the 2 statements in question.

Re the wild card, I took the format directly from MS:

search.msn.com.my/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAccessToSite.htm

...................

Pfui

1:45 am on Dec 30, 2005 (gmt 0)

Good luck with your robots.txt tweaks. (Although robots.txt-atypical, those plus others work fine for me.*) If for some reason they don't appear to work for you...

Do your logs show other than msnbot UAs or non-MS IPs/Hosts?

Mine show msnbot accesses as Host/UA:

msnbot.msn.com
msnbot/1.0 (+http://search.msn.com/msnbot.htm)

msnbot64041.search.msn.com
msnbot/1.0 (+http://search.msn.com/msnbot.htm)

sasch1031308.phx.gbl
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
(Also: by1sch4041906.phx.gbl; sasch1031204.phx.gb; etc.)

And here's an older variation on the UA theme:

msnbot/0.9 (+http://search.msn.com/msnbot.htm)

I just wanted to add this info in case someone's spoofing the UA (as they do with Googlebot on occasion).

*FWIW:
User-agent: msnbot
Crawl-delay: 300
Disallow: /*?
Disallow: /*?$
Disallow: /*.cgi$
Disallow: /*.pl$
Disallow: /*.PDF$
Disallow: /*.exe$
Disallow: /*.txt$
Disallow: /*.hqx$
Disallow: /*.zip$
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /cgi-bin
(etc.)

Key_Master

1:47 am on Dec 30, 2005 (gmt 0)

There is no robots.txt standard (although there should be):
[robotstxt.org...]

Revised robots.txt draft specification:
[robotstxt.org...]

Reno

5:37 pm on Dec 30, 2005 (gmt 0)

I have a quick follow-up question...

In my original robots.txt file (prior to these recent changes), I had the standard:

User-agent: *
Disallow:

Based on what I learned from this thread, I then added beneath those 2 lines the following:

User-agent: msnbot
Crawl-delay: 216
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$

So now it looks like the lines below, and I'm wondering if that is contradictory, or, is that acceptable?

User-agent: *
Disallow:
User-agent: msnbot
Crawl-delay: 216
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$

Thanks again for everyone's advice....

.....................................

Lord Majestic

5:52 pm on Dec 30, 2005 (gmt 0)

There is no robots.txt standard (although there should be): [robotstxt.org...]
Its de facto standard.

jdMorgan

5:56 pm on Dec 30, 2005 (gmt 0)

You should reverse the two sections, since the first one allows *all* robots access to *all* resources. msnbot may accept that '*' record and stop processing the robots.txt file, giving you unexpected results.

Generally, it's best to put more-specific records first, followed by the least-specific 'default' at the end.

Many robots have gotten smarter, and many now parse the entire file looking for a 'best match' instead of quitting as soon as they find any record that matches their user-agent name or '*' -- but I would not count on that behaviour.

Jim

Reno

6:31 pm on Dec 30, 2005 (gmt 0)

You should reverse the two sections...

Thanks very much jd -- that is exactly the question I was wondering about. Appreciate all the help...

Key_Master

9:11 pm on Dec 30, 2005 (gmt 0)

Its de facto standard.

No it's not. If it were, there would well defined rules for all robots to understand and follow, and yes, violate. There really is no such thing as a "valid robots.txt" file.

So msnbot understands crawl-delay and Googlebot doesn't. Both use wildcards but other bots do not. What's valid for one isn't necessarily true for the other. No wonder webmasters cloke their robots.txt files.

That's why there needs to be a standard. It's probably too late for that to happen but that's a discussion for another day.

Pfui

9:14 pm on Dec 30, 2005 (gmt 0)

Where applicable, be sure to define the top-level directory as off-limits:

User-agent: *
Disallow: /

(Leaving it blank tells a robot it gets to decide on its own -- not a good idea:)

If only ALL robots and crawlers and scrapers and the like were hard-coded to respect robots.txt... Just remember, the bad ones will completely ignore it, or worse, read it and then ride rough-shod all over your stuff anyway.

Lord Majestic

9:23 pm on Dec 30, 2005 (gmt 0)

No it's not. If it were, there would well defined rules for all robots to understand and follow, and yes, violate. There really is no such thing as a "valid robots.txt" file.

This is totally incorrect - as I said robots.txt is a de facto standard, it may not have the same status as HTTP or TCP/IP enshrined in RFCs, however by the virtue of support from all major players it IS a standard - all details are here: [robotstxt.org...]

So msnbot understands crawl-delay and Googlebot doesn't. Both use wildcards but other bots do not.

Its simple - Crawl-Delay is not part of standard, that's why there is no violation of standard if its not supported. Nothing new here - lots of standards have optional extentions (like NCQ in SATA) or even extentions proposed by companies (like favicon.ico), nothing new here and it does not override the fact that robots.txt is standard, albeit de facto.

That's last post from me on the subject - if you are aware of something that qualifies for robots standard better than robots.txt then please post.

Key_Master

10:08 pm on Dec 30, 2005 (gmt 0)

Its simple - Crawl-Delay is not part of standard, that's why there is no violation of standard if its not supported.

The robots.txt protocol was intended to be very strict and not nearly as flexible as you have been made to believe. For example, consider the following quote from robotstxt.org:

Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".
[robotstxt.org...]
So if a robot were to follow the "de facto" robots.txt standard and it encountered a wildcard disallow, it would not be in violation of the protocol if it were to ignore that request. That is why a formal standard is needed. Not trying to argue with you, just trying to set the record straight.

Lord Majestic

10:25 pm on Dec 30, 2005 (gmt 0)

So if a robot were to follow the "de facto" robots.txt standard and it encountered a wildcard disallow, it would not be in violation of the protocol if it were to ignore that request.

An improvement to standard, which is support for regular expression in backwards compatible way is by no means bad thing that contradicts standard.

There is nothing better than robots.txt and while I agree that a good revision of it is necessary, it nevertheless is a de facto standard.

Key_Master

10:47 pm on Dec 30, 2005 (gmt 0)

An improvement to standard, which is support for regular expression in backwards compatible way is by no means bad thing that contradicts standard.

True, but who defines the rules for such improvements? A standards committee would (if one were to exist). Just because a couple of search engines make up proprietary rules (wildcard exclusions, crawl-delays, etc) doesn't mean that the other search engines need to held hostage to their rules and reprogram their bots to obey, understand, and follow them.

So, the de facto standard doesn't exist. Sure, most good search engine spiders will grab robots.txt but how they parse it for instructions varies greatly and is subject to their own interpretation of the "de facto" rules.

Lord Majestic

12:17 am on Dec 31, 2005 (gmt 0)

True, but who defines the rules for such improvements?

You are confusing improvements with standard - robots.txt has got no Crawl-Delay defined, however even though Crawl-Delay was introduced by one company (Microsoft I believe), it is rapidly becoming part of de facto standard - I'd say any good bot should support it, MSNbot does, Slurp too and now time for Googlebot to support it.

There is simply no alternative to robots.txt - it is standard simply because there is nothing else.

I finished on the matter.

Reno

7:38 pm on Dec 31, 2005 (gmt 0)

I just checked my logs and MSN is still nailing me. Yesterday I had my site down due to exceeding bandwidth (an end-of-the-month problem largely caused by the excessive msnbot daily crawling over the last 30 days). I had to buy extra bandwidth, so am eager to clear this up.

I'm wondering if it's possible that the msnbot is confused by the last 2 lines:

User-agent: *
Disallow:

That is, first I tell it to disallow various file types, then I immediately say disallow nothing.

The only 3 bots I care about are from Google, Yahoo, and MS, so my question is this: Does it make sense to keep the top exactly as I have it, then underneath put:

User-agent: googlebot
Disallow:
User-agent: slurp
Disallow:

The final robots.txt would thus look like this:

User-agent: msnbot
Crawl-delay: 864
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$
Disallow: /cgi-bin/
Disallow: /graphics/
User-agent: googlebot
Disallow:
User-agent: slurp
Disallow:

Does anyone see a reason why that format would not work, or would cause problems?

Many thanks again...

Lord Majestic

7:45 pm on Dec 31, 2005 (gmt 0)

I'd say your Crawl-Delay is very high - 15 minutes per request is not reasonable, I don't know exactly how MSNbot would react to it, but it is reasonable to expect that they would have a max limit on Crawl-Delay after which they will either not crawl anything from site or just use internal max limit.

Reno

8:37 pm on Dec 31, 2005 (gmt 0)

Thanks for the advice -- I'll try using the number (120) that MS recommends. From their own website:

If you occasionally get high traffic from MSNBot, you can specify a crawl delay parameter in the robots.txt file to specify how often, in seconds, MSNBot can access your website. To do this, add this syntax to your robots.txt file:
User-agent: msnbot
Crawl-delay: 120

Key_Master

9:00 pm on Dec 31, 2005 (gmt 0)

You need a carriage return after each user-agent directive or your robot exclusions may be ignored by bots. It should look like:

User-agent: msnbot
Crawl-delay: 864
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$
Disallow: /cgi-bin/
Disallow: /graphics/

User-agent: googlebot
Disallow:
User-agent: slurp
Disallow:

But, the following has the same effect. You don't need the Googlebot/slurp directives. They will crawl your site unless you specifically exclude them:

User-agent: msnbot
Crawl-delay: 864
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$
Disallow: /cgi-bin/
Disallow: /graphics/

Pfui

9:17 pm on Dec 31, 2005 (gmt 0)

Please see Message #13, above. You're still missing critical slashes.

Key_Master

9:23 pm on Dec 31, 2005 (gmt 0)

Pfui, the foward slash is used in disallow if you want to prohibit the bot from crawling the site. Ommiting the forward slash allows the bot to crawl the site.

Reno

9:47 pm on Dec 31, 2005 (gmt 0)

Thanks very much Key_Master -- that was a silly mistake on my part. I had read this sentence at robotstxt.org, and had overreacted to the "may not have" part:

Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Pfui... I had seen your advice:

User-agent: *
Disallow: /

But since I DO want the top level crawled, I would not want to do anything to discourage that. Actually, all I care about are the standard html pages, thus the reason I am disallowing the .jpg's, php, etc.

I'll monitor this for a couple days, and will report back if there is a significant change.

ps. Happy New Year to everyone at WebmasterWorld!

..............................................

Pfui

9:53 pm on Dec 31, 2005 (gmt 0)

Ayep. But in more than one example, Reno shows this in robots.txt...

User-agent: *
Disallow:

...where to work for non-specified robots.txt-respecting bots, it should be:

User-agent: *
Disallow: /

I know you know that -- but I wanted to make sure Reno saw it.

(And to think we haven't even covered the examples' trailing slash on to-be-omitted directories:)

Reno

10:39 pm on Dec 31, 2005 (gmt 0)

I had taken the format from robotstxt.org
( www.robotstxt.org/wc/exclusion-admin.html )

To exclude all robots from the entire server
User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:

Since I am allowing access but wish to control MSN's spidering frequency, I left off the slash....

Pfui

11:23 pm on Dec 31, 2005 (gmt 0)

1.) I just wanted to clarify that you have more flexibility than all-or-none if you want it, and if the bots respect your wishes:)

-- If you want ALL robots to crawl, you don't even need a robots.txt file.

-- If you want ALL robots to crawl, AND you want to control msnbot (re your original post; OP), then the preceding posts show you how-to. (Similar but not identical controls are also available for many robots per instructions on their sites.)

-- If you want NO robots to crawl, BUT you want msnbot (or Google. etc.), then you include...

User-agent: *
Disallow: /

...BEFORE your msnbot (or Google, etc.) instructions. E.g. --

User-agent: *
Disallow: /

User-agent: msnbot
Crawl-delay: 300
Disallow: /*?
Disallow: /*?$
(etc.)

User-agent: Googlebot
Disallow: /cgi-bin
(etc.)

2.) The majors look for their specific IDs and instructions, so you can block many of the smaller (and/or relentless) ones with a blanket Disallow: / AND still control the majors.

(Aside: The third example, the NO-BUT set-up, is a kind of happy medium for me because I prefer to exclude as many robots/crawlers as possible. I have hundreds of thousands of dynamic pages (once automatically beyond robots' reach -- no longer), plus I 'm tired of shutting down scrapers.)

3.) Okay. Finally (sorry), and hearkening back to your OP, I've found msnbot to be more consistently respectful of robots.txt than any other, plus msnbot doesn't sneak in using IPs and browser UAs -- it's always who-what it says it is.

So here's hoping you've crafted the best msnbot controls for your site, Happy New year back atcha, and kudos for your patience through all the robotic minutiae:)

jdMorgan

11:42 pm on Dec 31, 2005 (gmt 0)

> The majors look for their specific IDs and instructions

That's important -- If you are dealing with an old or 'minor' robot, then don't put the

User-agent: *

record first in your file. Put the robot-specific records first. Many older and less-sophisticated robots will read that 'User-agent: *' record, accept it as a match for their user-agent, and ignore the rest of the file.

In other words, these robots will accept a record matching their user-agent name, or "*" - whichever comes first.

I'll freely admit to being something of a Luddite, and avoiding some of these new extensions to the 'Standard'. But I do not want to be in the position of depending upon proprietary or 'forgiving' behaviour on the part of any third party for my sites to be crawled and indexed properly.

Terminology note:
A robots.txt record is one or more User-agent directives followed by one or more Disallow directives. Records are separated by a blank line. Many old robots do not accept multiple User-agents per record, even though that was in the original proposed Standard. But the key point is that a blank line is required to delimit records, and I know of at least one old robot that malfunctioned if there was no blank line after the last record in the file.

Jim

Reno

11:49 pm on Dec 31, 2005 (gmt 0)

Thanks very much Pfui and Jim for the very informative explanations -- am saving your posts to my "robots.txt" info folder. My only goal here is to get msnbot down from an average bandwidth of 40MB per day to about 6MB, AND, still have them come by on a regular basis as they do send a fair amount of traffic. But at 1.2 GB of bandwidth per month, they are in overkill mode, as I don't change the content that much.

Sometime next week I hope to have an update -- if I'm having this problem, others are likely experiencing the same thing, so perhaps we can come up with a simple robots.txt format to keep their spider from getting into the high octane fuel!

Only about 5 hours left to 2005 here in USA EST, so may 2006 be a great year for one and all....

................................

This 31 message thread spans 2 pages: 31