MSN not obeying robots.txt

Forum Moderators: mack

Message Too Old, No Replies

MSN not obeying robots.txt

DorianWeb

9:27 am on Aug 30, 2006 (gmt 0)

For some reason MSN is indexing directories that I've excluded in my robots.txt file. I have validated my robots file with the webmasterworld tool and another one just to make sure the problem is not on my end.

MSN is indexing things like: mydomain.com/cgi-bin/script.cgi?ID=1

and I have the following statement in my robots.txt excluding the cgin-bin directory:

User-agent:*
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /images

Anyone else having problems like this with MSN? Google seems to be minding my robots file with no problems.

Thanks.

msndude

5:43 pm on Aug 30, 2006 (gmt 0)

Are we crawling your site or are we just indexing it? Robots.txt only tells us not to crawl the site. If we see links from other pages that point to your site we'll index it anyway. This happens because we can't tell the difference between a page we haven't discovered yet and a page that we weren't allowed to crawl.

r3nz0

6:45 pm on Aug 30, 2006 (gmt 0)

msndude you need a holiday! :)

robots.txt tells engines: not to index this url, not crawl like you're saying.. vice versa. Crawling is ok but a little bit rude.. because crawling maybe hidden indexing to gain information...

And the robots.txt has to be handled correctly.. if a link points to a url thats in the robots.txt its 'illegal' to index that page.

DorianWeb

9:58 pm on Aug 30, 2006 (gmt 0)

Just to clarify what I said in my first post. MSN is crawling AND indexing the URLs that are excluded in my robots.txt.

So when I do a site:mysite.com, I see a bunch of mydomain.com/cgi-bin/script.cgi?ID=1 type pages in the MSN Serps. (but I have the cgi-bin directory on Disallow)

Maybe I'm totally misunderstanding the proper use of robots.txt. I thought that if the directory was "Disallowed", it will not show up in your index.

Am I wrong here?

Thanks.

abates

10:02 pm on Aug 30, 2006 (gmt 0)

Do you have a specific msnbot section in your robots.txt as well, or just the * section?

jay5r

10:24 pm on Aug 30, 2006 (gmt 0)

I think what msndude is trying to say is that just because a URL is in their index doesn't mean they've crawled it - it means they've crawled another page that mentioned that URL, so they're aware of the URL. You'd need to see msnbot hitting the URL in your server logs to know if they've done anything 'illegal'.

When you're looking on MSN, if you just see the URL with no description and no cached version, then it's just indexed, and hasn't been crawled. There's almost no way that these pages will show up in any SERPs, so it's not really an issue. Even if you do see a description it may just mean they crawled it back at a time when it was 'legal' to crawl.

DorianWeb

11:17 pm on Aug 30, 2006 (gmt 0)

Do you have a specific msnbot section in your robots.txt as well, or just the * section?

Nothing specific for msnbot.

Thanks for the comment jay5r, I understand now.

[edited by: DorianWeb at 11:17 pm (utc) on Aug. 30, 2006]

jdMorgan

11:26 pm on Aug 30, 2006 (gmt 0)

Well, here's a bit of unfortunate news: I'm seeing msnbot-media violations:

Raw access log:

65.55.235.161 - - [29/Aug/2006:03:26:02 -0400] "GET /robots.txt HTTP/1.0" 200 10179 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.235.161 - - [29/Aug/2006:03:26:03 -0400] "GET / HTTP/1.0" 403 661 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"

Relevant robots.txt snippet:

User-agent: msnbot-media
Disallow: /
User-agent: msnbot-news
Disallow: /
User-agent: msnbot-products
Disallow: /
# MSN search bot
User-agent: msnbot
Crawl-delay: 3
Disallow: /cgi-bin
Disallow: /comm/office
Disallow: /images/
Disallow: /access
Disallow: /test
# all others
User-agent: *
Disallow: /

Although msnbot-media is Disallowed, it nonetheless attempts to fetch "/" and my access control code gives it a 403-Forbidden response.

Jim

atlrus

1:22 pm on Sep 10, 2006 (gmt 0)

We have the same problem - I have disallowed a redirect directory on our website - Google and Yahoo are staying away from it, MSN - nope.

Even more - when I do a site: search for our website - those redirect pages show up at the top of the list - always freshly crawled/indexed (whatever you want to call it)?!?

1.)We have no external links pointing to these pages.
2.)those pages get fresh dates every day looking at the site: command
3.)our home page has 15000+ external links to it and content changes almost every day
4.)our home page gets visited once a week (on a good month)

Simillar website of ours - the redirect directory is not disallowed - site: search on that website shows the redirect URLs - nowhere.

It is obvious you guys have a serious issue (not to mention that the site: search shows a number of pages 3-4 times more than there actually are on the website). I guess I have to dissallow the entire site to ensure a good and timely visits by your bot ;)

msndude

6:53 pm on Sep 11, 2006 (gmt 0)

altrus: Do you see a cached page for your site? That would definitely mean we're crawling (rather than just indexing) your site.

If you really think we're crawling you but shouldn't be, send e-mail to msnbot at Microsoft.com. Also, if you'll send me a sticky note with the URL and search terms, I'll try to have a look myself.

Thanks, and sorry for the inconvenience.

Kelowna

3:39 am on Sep 21, 2006 (gmt 0)

[search.msn.com...]

This page is now a dead link and it shows up as the address to use to find bot info. I am trying to get msn to stop trying to get pages that have been removed long ago.

This is what they are still using as a link in my logs like...
...HTTP/1.0" 404 204 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"

Does anyone know the url to use to find the info to block (all) msn bots from using up worthless bandwidth?

LunaC

11:49 am on Sep 21, 2006 (gmt 0)

That would definitely mean we're crawling (rather than just indexing) your site.

Isn't that what noindex means? Don't index it?

Seriously, is there any way to NOT have a page listed? I've tried robots.txt and meta tags as well as both separately.. still pages are getting indexed that I clearly say I don't want in MSN. Google and Yahoo are both simply not listing the page (Exactly as they'd been told).

For one site one of the pages is sometimes at the top of the results when I search by site name, the real main page is just below it. (Those 2 tend to flip spots regularly)

I do link to that page from every other page on the site (it's the contact form), so I understand it might seem important from a bot perspective, but if I say noindex.. why is it indexed?

So.. how do I get MSN to really not index a page?

jdMorgan

1:30 pm on Sep 21, 2006 (gmt 0)

LunaC,

I'm seeing exactly the same thing with regard to on-page meta-robots "noindex" being ignored.

The pages are not Disallowed by robots.txt, but contain <meta name="robots" content="noindex,nofollow">, and yet they appear for a search on the domain name, i.e. "example.com".

And further, they are also my contact pages, which I would like to obscure from spammers. Also linked from most pages...

This site also got buried in the "Windows Live" results -- Although it's among the three most authoritative and popular sites on it's main (admittedly niche) subject, it's been relegated to the 5th page of results, below an awful lot of one-page wonders and scrapers. I'm hoping the massive number of "noindexed" contact links in their index doesn't have anything to do with that!

Not sure if this thread needs a title change or perhaps a whole new thread, as this has to do with meta-robots compliance, not robots.txt.

Jim

JenLN79

3:16 pm on Sep 21, 2006 (gmt 0)

In related news, as far as allowing content, I've read that you should not include a robot meta tag to allow a page:

The below Robots META Tag is not required nor is it suggested in the MSNBot guidelines which clearly state that the use of the Robots META Tag is for restricting the indexing of content. [seoconsultants.com]

and then I've read you NEED a robots.txt to even get indexed

Can anyone tell me which statement is true?

[edited by: jatar_k at 5:02 pm (utc) on Sep. 21, 2006]
[edit reason] authoritative urls only thanks [/edit]

jdMorgan

4:34 pm on Sep 21, 2006 (gmt 0)

Both may be true, at least partly...

The robots.txt file and the "index,follow" robots meta-tag are two completely-separate things.

robots.txt is a small, plain-text file that goes in the same directory as you "home page."
The robots meta-tag is a line of code in the <head> section of an HTML document.

If your site does not have a robots.txt file, then your site error log is likely to be full of errors caused by search engines trying to fetch robots.txt. So you should have a robots.txt file on your site, even if it is blank. If nothing else, this will make your site error log much more useable for finding real errors on your site, which I presume you'd like to fix as quickly as possible.

SEs will interpret a blank robots.txt as carte-blanche to index the entire site. Some SEs may ignore sites without a robots.txt file because they assume the site is of low quality due to all the 404=Page Not Found errors they receive. But I really doubt that MSN/WIndows Live is one of them, as the major search engines simply treat that case the same as they would a blank robots.txt file.

There are many on-page meta-tags that *are* utterly useless, such as "revisit-after". The "robots" tag is useful, but because "index,follow" is the default behaviour in the absense of any robots tag, the only thing that <meta name="robots" content="index,follow"> accomplishes is to waste bandwidth and push your real content down a few nore characters.

However, the same tag with different attributes, such as <meta name="robots" content="all,noarchive,noodp"> can be quite useful in some circumstances.

Ref: [robotstxt.org...]

Jim

atlrus

5:30 pm on Sep 21, 2006 (gmt 0)

Either way - MSN does not honor the robots.txt

As I mentioned before - I have disallowed a redirect directory in my robots.txt file - yahoo and google have no trace of the 20 pages that are in the directory, MSN however...Argh!
When I do a site: search it shows this as the first 10 results:

- first 4 are disallowed URLs
- my home page in #5
- next 5 are disallowed URLs

PAGE 2

- all 10 results - disallowed URLs

PAGE 3

- disallowed URL
- then the rest of my website

WHAT IN THE WORLD?!?
Mind you - NO external links point to any of those pages.

jdMorgan

12:22 am on Sep 22, 2006 (gmt 0)

atlrus,

Do you see normal listings with Title and Description for those disallowed URLs in MSN, or are they URL-only?
What does your robots.txt say that applies to those pages?
What is in the on-page robots meta-tag?

Robots.txt and the on-page robots meta tags are not entirely equivalent.

A Disallow in robots.txt instructs a spider not to fetch a page; It is a bandwidth reduction mechanism.

On-page meta robots tags can be used (among other things) to tell a spider not to index a page. This has been widely-interpreted to mean "Don't include this page in your index, and don't return it as a search result."

In order to fetch and comply with an on-page robots meta tag, the page must not be Disallowed in robots.txt; The spider must be allowed to fetch the page so that it can read the on-page meta robots tag.

Jim