Forum Moderators: goodroi
Is there a way for me to utilize robots.txt to limit the spidering of the site by msnbot to only once or twice per week? For example, it would be allowed on any Tuesday and/or any Friday?
At this rate I will have to buy more bandwidth just to account for what this one bot is using, which seems silly. Any advice is appreciated....
..........................
However since this is not currently part of the robots.txt standard, it has to be defined on a per User-Agent basis.
If your site is only 15Mb, how come MSN is fetching 43Mb? a day? Is it repeatedly asking for the same pages?
Are your pages dynamically generated, e.g. PHP? If so you may want to look at some sort of cache control.
For example
[alexandre.alapetite.net...]
I found the info on the MS FAQ, so have added this to my robots.txt file:
# Limit msnbot to 400 requests per day
User-agent: msnbot
Crawl-delay: 216
# Prevent certain file types
User-agent: msnbot
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$
I'll watch that for a couple days, and if it is still too much bandwidth, I'll take it down to: Crawl-delay: 864 (100 requests)
..........................
Do your logs show other than msnbot UAs or non-MS IPs/Hosts?
Mine show msnbot accesses as Host/UA:
msnbot.msn.com
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
msnbot64041.search.msn.com
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
sasch1031308.phx.gbl
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
(Also: by1sch4041906.phx.gbl; sasch1031204.phx.gb; etc.)
And here's an older variation on the UA theme:
msnbot/0.9 (+http://search.msn.com/msnbot.htm)
I just wanted to add this info in case someone's spoofing the UA (as they do with Googlebot on occasion).
*FWIW:
User-agent: msnbot
Crawl-delay: 300
Disallow: /*?
Disallow: /*?$
Disallow: /*.cgi$
Disallow: /*.pl$
Disallow: /*.PDF$
Disallow: /*.exe$
Disallow: /*.txt$
Disallow: /*.hqx$
Disallow: /*.zip$
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /cgi-bin
(etc.)
Revised robots.txt draft specification:
[robotstxt.org...]
In my original robots.txt file (prior to these recent changes), I had the standard:
Based on what I learned from this thread, I then added beneath those 2 lines the following:
So now it looks like the lines below, and I'm wondering if that is contradictory, or, is that acceptable?
Thanks again for everyone's advice....
.....................................
There is no robots.txt standard (although there should be): [robotstxt.org...]Its de facto standard.
Generally, it's best to put more-specific records first, followed by the least-specific 'default' at the end.
Many robots have gotten smarter, and many now parse the entire file looking for a 'best match' instead of quitting as soon as they find any record that matches their user-agent name or '*' -- but I would not count on that behaviour.
Jim
Its de facto standard.
No it's not. If it were, there would well defined rules for all robots to understand and follow, and yes, violate. There really is no such thing as a "valid robots.txt" file.
So msnbot understands crawl-delay and Googlebot doesn't. Both use wildcards but other bots do not. What's valid for one isn't necessarily true for the other. No wonder webmasters cloke their robots.txt files.
That's why there needs to be a standard. It's probably too late for that to happen but that's a discussion for another day.
User-agent: *
Disallow: /
(Leaving it blank tells a robot it gets to decide on its own -- not a good idea:)
If only ALL robots and crawlers and scrapers and the like were hard-coded to respect robots.txt... Just remember, the bad ones will completely ignore it, or worse, read it and then ride rough-shod all over your stuff anyway.
No it's not. If it were, there would well defined rules for all robots to understand and follow, and yes, violate. There really is no such thing as a "valid robots.txt" file.
This is totally incorrect - as I said robots.txt is a de facto standard, it may not have the same status as HTTP or TCP/IP enshrined in RFCs, however by the virtue of support from all major players it IS a standard - all details are here: [robotstxt.org...]
So msnbot understands crawl-delay and Googlebot doesn't. Both use wildcards but other bots do not.
Its simple - Crawl-Delay is not part of standard, that's why there is no violation of standard if its not supported. Nothing new here - lots of standards have optional extentions (like NCQ in SATA) or even extentions proposed by companies (like favicon.ico), nothing new here and it does not override the fact that robots.txt is standard, albeit de facto.
That's last post from me on the subject - if you are aware of something that qualifies for robots standard better than robots.txt then please post.
Its simple - Crawl-Delay is not part of standard, that's why there is no violation of standard if its not supported.
The robots.txt protocol was intended to be very strict and not nearly as flexible as you have been made to believe. For example, consider the following quote from robotstxt.org:
Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif".[robotstxt.org...]
So if a robot were to follow the "de facto" robots.txt standard and it encountered a wildcard disallow, it would not be in violation of the protocol if it were to ignore that request. That is why a formal standard is needed. Not trying to argue with you, just trying to set the record straight.
So if a robot were to follow the "de facto" robots.txt standard and it encountered a wildcard disallow, it would not be in violation of the protocol if it were to ignore that request.
An improvement to standard, which is support for regular expression in backwards compatible way is by no means bad thing that contradicts standard.
There is nothing better than robots.txt and while I agree that a good revision of it is necessary, it nevertheless is a de facto standard.
An improvement to standard, which is support for regular expression in backwards compatible way is by no means bad thing that contradicts standard.
True, but who defines the rules for such improvements? A standards committee would (if one were to exist). Just because a couple of search engines make up proprietary rules (wildcard exclusions, crawl-delays, etc) doesn't mean that the other search engines need to held hostage to their rules and reprogram their bots to obey, understand, and follow them.
So, the de facto standard doesn't exist. Sure, most good search engine spiders will grab robots.txt but how they parse it for instructions varies greatly and is subject to their own interpretation of the "de facto" rules.
True, but who defines the rules for such improvements?
You are confusing improvements with standard - robots.txt has got no Crawl-Delay defined, however even though Crawl-Delay was introduced by one company (Microsoft I believe), it is rapidly becoming part of de facto standard - I'd say any good bot should support it, MSNbot does, Slurp too and now time for Googlebot to support it.
There is simply no alternative to robots.txt - it is standard simply because there is nothing else.
I finished on the matter.
I'm wondering if it's possible that the msnbot is confused by the last 2 lines:
The only 3 bots I care about are from Google, Yahoo, and MS, so my question is this: Does it make sense to keep the top exactly as I have it, then underneath put:
The final robots.txt would thus look like this:
Does anyone see a reason why that format would not work, or would cause problems?
Many thanks again...
If you occasionally get high traffic from MSNBot, you can specify a crawl delay parameter in the robots.txt file to specify how often, in seconds, MSNBot can access your website. To do this, add this syntax to your robots.txt file:User-agent: msnbot
Crawl-delay: 120
User-agent: msnbot
Crawl-delay: 864
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$
Disallow: /cgi-bin/
Disallow: /graphics/
User-agent: googlebot
Disallow:User-agent: slurp
Disallow:
But, the following has the same effect. You don't need the Googlebot/slurp directives. They will crawl your site unless you specifically exclude them:
User-agent: msnbot
Crawl-delay: 864
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.php$
Disallow: /*.pl$
Disallow: /*.cgi$
Disallow: /*.shtml$
Disallow: /*.xml$
Disallow: /cgi-bin/
Disallow: /graphics/
Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Pfui... I had seen your advice:
User-agent: *But since I DO want the top level crawled, I would not want to do anything to discourage that. Actually, all I care about are the standard html pages, thus the reason I am disallowing the .jpg's, php, etc.
Disallow: /
I'll monitor this for a couple days, and will report back if there is a significant change.
ps. Happy New Year to everyone at WebmasterWorld!
..............................................
User-agent: *
Disallow:
...where to work for non-specified robots.txt-respecting bots, it should be:
User-agent: *
Disallow: /
I know you know that -- but I wanted to make sure Reno saw it.
(And to think we haven't even covered the examples' trailing slash on to-be-omitted directories:)
To exclude all robots from the entire server
User-agent: *
Disallow: /To allow all robots complete access
User-agent: *
Disallow:
Since I am allowing access but wish to control MSN's spidering frequency, I left off the slash....
-- If you want ALL robots to crawl, you don't even need a robots.txt file.
-- If you want ALL robots to crawl, AND you want to control msnbot (re your original post; OP), then the preceding posts show you how-to. (Similar but not identical controls are also available for many robots per instructions on their sites.)
-- If you want NO robots to crawl, BUT you want msnbot (or Google. etc.), then you include...
User-agent: *
Disallow: /
...BEFORE your msnbot (or Google, etc.) instructions. E.g. --
User-agent: *
Disallow: /
User-agent: msnbot
Crawl-delay: 300
Disallow: /*?
Disallow: /*?$
(etc.)
User-agent: Googlebot
Disallow: /cgi-bin
(etc.)
2.) The majors look for their specific IDs and instructions, so you can block many of the smaller (and/or relentless) ones with a blanket Disallow: / AND still control the majors.
(Aside: The third example, the NO-BUT set-up, is a kind of happy medium for me because I prefer to exclude as many robots/crawlers as possible. I have hundreds of thousands of dynamic pages (once automatically beyond robots' reach -- no longer), plus I 'm tired of shutting down scrapers.)
3.) Okay. Finally (sorry), and hearkening back to your OP, I've found msnbot to be more consistently respectful of robots.txt than any other, plus msnbot doesn't sneak in using IPs and browser UAs -- it's always who-what it says it is.
So here's hoping you've crafted the best msnbot controls for your site, Happy New year back atcha, and kudos for your patience through all the robotic minutiae:)
That's important -- If you are dealing with an old or 'minor' robot, then don't put the
User-agent: *
record first in your file. Put the robot-specific records first. Many older and less-sophisticated robots will read that 'User-agent: *' record, accept it as a match for their user-agent, and ignore the rest of the file.
In other words, these robots will accept a record matching their user-agent name, or "*" - whichever comes first.
I'll freely admit to being something of a Luddite, and avoiding some of these new extensions to the 'Standard'. But I do not want to be in the position of depending upon proprietary or 'forgiving' behaviour on the part of any third party for my sites to be crawled and indexed properly.
Terminology note:
A robots.txt record is one or more User-agent directives followed by one or more Disallow directives. Records are separated by a blank line. Many old robots do not accept multiple User-agents per record, even though that was in the original proposed Standard. But the key point is that a blank line is required to delimit records, and I know of at least one old robot that malfunctioned if there was no blank line after the last record in the file.
Jim
Sometime next week I hope to have an update -- if I'm having this problem, others are likely experiencing the same thing, so perhaps we can come up with a simple robots.txt format to keep their spider from getting into the high octane fuel!
Only about 5 hours left to 2005 here in USA EST, so may 2006 be a great year for one and all....
................................