Problem with Googlebot and robots.txt?

Forum Moderators: open

Message Too Old, No Replies

Problem with Googlebot and robots.txt?

Google indexing links to blocked urls even though it's not following them

Robert Charlton

12:04 am on Apr 12, 2003 (gmt 0)

A while back I dealt with a bunch of co-branded mirror subdomains on a non-profit educational site by blocking all robots from those subdomains.

[The site exists on its own, and also provides content to a bunch of major newspapers, some of which have very high PageRank. Blocking the spiders was the only way to prevent a bunch of mirrors with dupe content from cluttering up the serps].

This worked perfectly well up until a month or two ago. We're now beginning to see the urls of these blocked subdomains appearing in the rankings, on searches for phrases that appear in link anchors linking to blocked pages.

Only the urls, not the page titles, appear in the serps. Also, there are no cached pages, confirming my feeling that the robots.txt is being observed and that the pages aren't being spidered, but that for whatever reason the links are being indexed.

Any thoughts? GoogleGuy... if you're not too busy with the update thread, what's supposed to be happening with Googlebot here?

jdMorgan

12:12 am on Apr 12, 2003 (gmt 0)

Robert_Charlton,

Not written for your exact situation, but this thread [webmasterworld.com] might help (details in post #12).

Jim

GoogleGuy

12:31 am on Apr 12, 2003 (gmt 0)

If we have evidence that a page is good, we can return that reference even though we haven't crawled the page.

Robert Charlton

12:48 am on Apr 12, 2003 (gmt 0)

Jim and GoogleGuy - Thanks for your quick feedback. From msg #15 that Jim had posted in the above-referenced thread...

If Google finds a robots.txt Disallow for a page, it will remove the page's title and description from its search results. It will also no longer match search terms to the words on that page. So, the page essentially disappears from the Google search results pages. However, if Google finds a link to that page, it will still show that page in results when someone clicks on "More results from <this domain>".
I went around and around with this, trying to find a way to tell them "don't mention my contact forms pages at all, please", and here's what I ended up with:
For Google, don't Disallow the page in robots.txt, but place a <meta name="robots" content="noindex"> tag in the head section of the page itself.
You'll also need to do this for Ask Jeeves/Teoma as well; their handling of robots.txt is the same as Google's.
All the others seem to interpret a robots.txt Disallow as "don't mention this page at all." (

GoogleGuy - Since you've asked in the past for suggestions for improving Google's serps, I'd suggest that less aggressive indexing here would be helpful. I can't imagine why Google would want to return a link to a blocked page.

Per Jim's post, will placing "noindex" in our subdomain pages cause Google to drop the url link from the serps page? We really don't want people going there via search. It will be difficult, since we'll have to come up with some kind of SSI that places the <meta name="robots" content="noindex"> tag only on the subdomains.

Web_Player

1:23 am on Apr 12, 2003 (gmt 0)

As I recall, If another site is linked to a blocked page, that page can get listed because the spider follows that link and does not see the robot.txt file.

Also, I have noticed that Google does seen to always heed the meta robot tag on the page.