How to stop MSN from listing noindex pages?

Forum Moderators: mack

Message Too Old, No Replies

How to stop MSN from listing noindex pages?

LunaC

3:00 pm on Nov 3, 2006 (gmt 0)

Since MSNbot won't listen to meta tags noindex,nofollow, what is the best way to really get it to stop listing these pages?

Some of these pages float to the top of my listings and these are pages that I really don't want listed in the SERP in any way... contact forms etc.

Right now it lists the pages anyway, a URL only listing but that is still definitely indexed. (The tag doesn't say nocrawl.. it says noindex!)

From what I've read in MSNdude's posts, that's their interpretation of how it should be, that this is not a bug in their eyes, and therefore won't be corrected anytime soon. (If I'm misunderstanding this MSNdude, please explain.)

So, what's the best way to stop this and keep those URLs out of the index, while not damaging the rest of my sites listings?

(These pages have always had the meta tags stating noindex,nofollow, are not blocked in robots.txt and never have been.)

Receptional

11:02 am on Nov 5, 2006 (gmt 0)

Time to start using robots.txt to block it now then. MSN certainly seems to listen to robots.txt. I'd assume this is better for your server anyway - using Meta instructions means that a crawler will have to open each page before finding out it shouldn't index it, but robots.txt only needs reading once (assuming the crawler isn't even more stupid than the average) so a more efficient way to do things.

Robert Charlton

8:46 pm on Nov 5, 2006 (gmt 0)

Time to start using robots.txt to block it now then. MSN certainly seems to listen to robots.txt.

The problem with this, I think, is that different engines follow different practices. As I understand it, the best way to keep Google from indexing a reference to a page as well as the page itself is to use...

So if you block the page with a generic robots.txt for MSN, Googlebot will never get to the page to see that tag... and if there's a link to the page on an unblocked page, Google will index that link.

I suppose you could have different instructions for each bot, but that starts to get very byzantine, and I'm not sure we know all the different ways each of the engines treats this situation.

Jim Morgan first brought this to my attention in this thread...

Question about simple robots.txt file
[webmasterworld.com...]

There have been a huge number of discussions on the question since, including one long and heated discussion that I can't find right now that was buried in one of the update threads.

LunaC

2:48 pm on Nov 6, 2006 (gmt 0)

The problem I've found is that no matter which I use for MSN, robots.txt or meta, it still lists the url (the title is domain.com, no description) in the SERP. (The same way Google lists when blocked by robots.txt but it finds a link leading to it.)

From what MSNdude has said, it sounds like that is how MSN intends it, that it's not "crawled" if it's disallowed, but will be "indexed".

I'm really hoping MSNdude can clarify how to get these out, and why they're there in the first place.

I'm even at this point considering cloaking the pages from MSNbot to keep them out... which really seems a bit over the top when they shouldn't be listed in the first place. I'm hoping there is a simpler way with less potential to backfire.

asusplay

4:05 pm on Nov 6, 2006 (gmt 0)

I'm finding the same. I just disallow them on the robots.txt but for some reason they get insexed and in a site: search all these urls which shouldn't be there appear above the homepage.

Robert Charlton

9:22 pm on Nov 6, 2006 (gmt 0)

I just disallow them on the robots.txt but for some reason they get insexed...

asusplay - You've got to be a lot more careful about how you handle those pages. ;)

LunaC

12:06 am on Nov 7, 2006 (gmt 0)

lol I'd read that and hadn't seen the typo, I needed a giggle right now. :)

I know a lot of people are having troubles with pages that are not supposed to be indexed getting in. I've seen a lot rising to the top of listings to for site:, it's happening to me as well.

[edited by: LunaC at 12:09 am (utc) on Nov. 7, 2006]

asusplay

9:14 am on Nov 7, 2006 (gmt 0)

Lol...sorry about the typo!

LunaC

4:07 pm on Nov 20, 2006 (gmt 0)

Hmm, more and more of the disallowed pages are showing up, and are getting listed before real pages in site search.

They look like:

example.com
www.example.com/folder/filename.ext
11/20/2006 Cached page

The cache just opens a page with "Could not find the requested document in the cache.".. so they at least are not caching them. But I don't want these pages listed at all.

Is cloaking against msnbot really the only way to keep it from listing pages that are disallowed?

LunaC

4:11 pm on Nov 20, 2006 (gmt 0)

Oops, forgot to also mention these pages show up by searching domain.com as well.. it's not just using the site: command.

acb123

3:31 pm on Nov 23, 2006 (gmt 0)

I see Livebot

(((livebot-65-54-188-74.search.live.com)))

not only indexing noindex/nofollow pages, but ALSO, and more aggravatingly, following all the links in my disallow list in robots.txt!

I listed a test-page in robots.txt that no other page in my site links to, and it was indexed.

So; what to do? does one have to bow before the M$ megalopoly and let them stick their grubby fingers wherever they want, or is there a better way to keep the pernicious livebot off a page?

Any and all help appreciated

--acb