Forum Moderators: mack
How do you prevent msnbot from indexing a page? This page has "<meta name="robots" content="noindex,nofollow">" in the head. ALL bots are also excluded from this page via robots.txt. And yet the page IS indexed and appears on search.live.com.
Google and Yahoo ignore this page as expected. How do I prevent msnbot from indexing the page? Do I need to specifically mention msnbot in robots.txt?
I am becoming very worried about possible duplicate content problems.
Thanks.
However, if enough sites link to you, we'll index your site anyway -- without crawling you -- just because you must be interesting to someone. You need to have a LOT of links from a lot of different non-spam people for that to happen, though. It's kind of like being a celebrity; if you're famous, you can't really complain that people are talking about you.
If I say noindex, nofollow.. I DON'T want that page listed, not as a URL, not in the SERP in any way. I don't care if people are talking about it or not.
And as for "You need to have a LOT of links from a lot of different non-spam people for that to happen, though.", again, I disagree. I have pages that are only linked from my own site, that clearly are disallowed (meta noindex, nofollow.. and always have been) that are listed.
One example is a contact form. I don't want people coming in on that page since there's no good reason to contact if they haven't been to the rest of the site. The only reason people look for contact forms is to spam, find potential vulnerabilities and a myriad of other unpleasant things... I see no good reason for this to be an entry page.
This page is not linked to from any outside source. It has always had noindex,nofollow... so why is it even listed?
This isn't the only example I've found either, I have many pages that are only of use if the visitor has been *in* the site.. coming in on them is useless.. so I disallow... still, they are listed in the search listings. (Yes, URL only, but that is clearly indexed!) I also worry about duplicate content on these.
An example is a stock photo archive. The pages that are disallowed are the full image, allowed are the pages with the thumbnails and a brief, but useful description of each image. Since the full image pages would be very thin on spiderable content (simply a large image and very minimal navigation), that would be useless as a landing page, and very much open to duplicate content. Therefore.. noindex, nofollow would be added. So why would they be listed?
Still using a stock photo archive as an example, say it has 100 pages of thumbnails, each listing 10 images. MSN would list 1000 urls, 100 descriptive listings equaling a total (ignoring fluff pages such as FAQ etc.) of 1100 listings. Now I have 1100 pages to try to sift through to find the item I was looking for instead of 100.
I know from experience that if I use site: to find pages listed, the full listings do not rise to the top, not even the pages I'm more than likely looking for.. ie, the most commonly viewed... how is that useful? (Don't believe me? try site:www.microsoft.com, how is that helpful?)
What use is it to searchers to get this information? Since there must be a real reason that MSN has decided to list disallowed pages and URL only listings, I'd like to hear why MSN has decided listing this content is to the benefit of anyone, either searchers, website owners.. or ANY good reason.
As for being a celebrity and not being able to complain about people talking.. ok.. to a degree I can agree with that. I can *talk* all I want about a celebrity... but peeking into their windows, hinting at what I see and giving their private address IS NOT acceptable.
This page has "<meta name="robots" content="noindex,nofollow">" in the head. ALL bots are also excluded from this page via robots.txt. And yet the page IS indexed and appears on search.live.com.
Pages that are excluded by robots.txt can appear as URL-only results in the SERPs.
Pages that carry the "noindex" robots meta should not appear in the SERPs as URL only or anything else.
But, since the robots.txt directive is the first thing that a bot sees it does not retrieve the page, so does not see the "noindex" robots meta.
You have to remove the block in the robots.txt so the bot can see the "noindex".
Note that Googlebot works the same way, not sure about Y!.
I missed that when I wrote my post, however my sites don't have robots.txt blocking anything. I'm just using meta tags (noindex, nofollow) and those pages are being indexed (url only) in MSN.
My post was to find out why MSN ignores noindex, as msndude said "if enough sites link to you, we'll index your site anyway -- without crawling you -- just because you must be interesting to someone." and if there is any way to stop it from happening. Perhaps "noindex!important, nofollow" could be implemented if we really mean, "don't index this page" ;)
[edited by: LunaC at 9:00 pm (utc) on Oct. 18, 2006]
wires crossed
Yeah, mine. I thought you were the OP replying to my reply to the OP ;-).
I'm assuming that, in context, when msndude said, "if enough sites link to you, we'll index your site anyway -- [b]without crawling you[b]," he was referring to to a URL-only listing when the bot is excluded by robots.txt.
If your meta robots elements are correctly formed, and if you still see URL-only listings for those pages, it's probably time to drop a dime and send msndude some concrete examples.
The only way to stop them is by cloaking the page for their robot, redirecting it to some other URL that has a real listing. :(
Now that they've apparently got their robots.txt compliance sorted, perhaps they can address this problem.
However, beware of a common problem before blaming MSN/Live: In order to obey an on-page meta-robots tag, the robot must be allowed to fetch that page. So do not disallow a page's URL in robots.txt if you want SEs to comply with meta-robots tags on that page.
Jim
However, MSN/Windows Live treats it just like a robots.txt Disallow, and will index a URL-only link
Whoa, I haven't seen this behaviour as yet. Seems like MSN is taking all control out of webmasters' hands. Might be time for certain people in the industry to get together again to discuss compliance with standards, or what are supposed to be standards.
All now looks as it should. I really did see a "noindex" page in the SERPS - not just URL-only. But all have disappeared.
MSNdude - the "dup content worry" was because I have 5 different ways of displaying the same page/content. 4 have "noindex" set - so SEs only get the one standard page (users can reformat if they like after they have arrived). But last week I noticed one ot the "noindex" pages in search.msn.com - hence the panic.
And thanks for putting me straight on robots.txt/noindex meta conflict. Will let bots in to see the noindex.
Thanks all.