Forum Moderators: open
[The site exists on its own, and also provides content to a bunch of major newspapers, some of which have very high PageRank. Blocking the spiders was the only way to prevent a bunch of mirrors with dupe content from cluttering up the serps].
This worked perfectly well up until a month or two ago. We're now beginning to see the urls of these blocked subdomains appearing in the rankings, on searches for phrases that appear in link anchors linking to blocked pages.
Only the urls, not the page titles, appear in the serps. Also, there are no cached pages, confirming my feeling that the robots.txt is being observed and that the pages aren't being spidered, but that for whatever reason the links are being indexed.
Any thoughts? GoogleGuy... if you're not too busy with the update thread, what's supposed to be happening with Googlebot here?
Not written for your exact situation, but this thread [webmasterworld.com] might help (details in post #12).
Jim
If Google finds a robots.txt Disallow for a page, it will remove the page's title and description from its search results. It will also no longer match search terms to the words on that page. So, the page essentially disappears from the Google search results pages. However, if Google finds a link to that page, it will still show that page in results when someone clicks on "More results from <this domain>".I went around and around with this, trying to find a way to tell them "don't mention my contact forms pages at all, please", and here's what I ended up with:
For Google, don't Disallow the page in robots.txt, but place a <meta name="robots" content="noindex"> tag in the head section of the page itself.You'll also need to do this for Ask Jeeves/Teoma as well; their handling of robots.txt is the same as Google's.
All the others seem to interpret a robots.txt Disallow as "don't mention this page at all." (
GoogleGuy - Since you've asked in the past for suggestions for improving Google's serps, I'd suggest that less aggressive indexing here would be helpful. I can't imagine why Google would want to return a link to a blocked page.
Per Jim's post, will placing "noindex" in our subdomain pages cause Google to drop the url link from the serps page? We really don't want people going there via search. It will be difficult, since we'll have to come up with some kind of SSI that places the <meta name="robots" content="noindex"> tag only on the subdomains.