Forum Moderators: mack
Some of these pages float to the top of my listings and these are pages that I really don't want listed in the SERP in any way... contact forms etc.
Right now it lists the pages anyway, a URL only listing but that is still definitely indexed. (The tag doesn't say nocrawl.. it says noindex!)
From what I've read in MSNdude's posts, that's their interpretation of how it should be, that this is not a bug in their eyes, and therefore won't be corrected anytime soon. (If I'm misunderstanding this MSNdude, please explain.)
So, what's the best way to stop this and keep those URLs out of the index, while not damaging the rest of my sites listings?
(These pages have always had the meta tags stating noindex,nofollow, are not blocked in robots.txt and never have been.)
Time to start using robots.txt to block it now then. MSN certainly seems to listen to robots.txt.
The problem with this, I think, is that different engines follow different practices. As I understand it, the best way to keep Google from indexing a reference to a page as well as the page itself is to use...
<meta name="robots" content="noindex">
So if you block the page with a generic robots.txt for MSN, Googlebot will never get to the page to see that tag... and if there's a link to the page on an unblocked page, Google will index that link.
I suppose you could have different instructions for each bot, but that starts to get very byzantine, and I'm not sure we know all the different ways each of the engines treats this situation.
Jim Morgan first brought this to my attention in this thread...
Question about simple robots.txt file
[webmasterworld.com...]
There have been a huge number of discussions on the question since, including one long and heated discussion that I can't find right now that was buried in one of the update threads.
From what MSNdude has said, it sounds like that is how MSN intends it, that it's not "crawled" if it's disallowed, but will be "indexed".
I'm really hoping MSNdude can clarify how to get these out, and why they're there in the first place.
I'm even at this point considering cloaking the pages from MSNbot to keep them out... which really seems a bit over the top when they shouldn't be listed in the first place. I'm hoping there is a simpler way with less potential to backfire.
I know a lot of people are having troubles with pages that are not supposed to be indexed getting in. I've seen a lot rising to the top of listings to for site:, it's happening to me as well.
[edited by: LunaC at 12:09 am (utc) on Nov. 7, 2006]
They look like:
example.com
www.example.com/folder/filename.ext
11/20/2006 Cached page
The cache just opens a page with "Could not find the requested document in the cache.".. so they at least are not caching them. But I don't want these pages listed at all.
Is cloaking against msnbot really the only way to keep it from listing pages that are disallowed?
(((livebot-65-54-188-74.search.live.com)))
not only indexing noindex/nofollow pages, but ALSO, and more aggravatingly, following all the links in my disallow list in robots.txt!
I listed a test-page in robots.txt that no other page in my site links to, and it was indexed.
So; what to do? does one have to bow before the M$ megalopoly and let them stick their grubby fingers wherever they want, or is there a better way to keep the pernicious livebot off a page?
Any and all help appreciated
--acb