Does GoogleBase override robots.txt ?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Does GoogleBase override robots.txt ?

defanjos

5:42 pm on Apr 16, 2009 (gmt 0)

I have this annoying problem - Google stopped following my robots.txt.

For example I have a folder "/folder/" disallowed like:
User-agent: Googlebot
Disallow: /folder/

But, Google keeps on showing the urls in the format "/folder/1234.htm" in the index.
When I do site:example.com, over one thousand of those urls show. I want them gone.

I am wondering if it has to do with Googlebase. I have uploaded all the unwanted urls using a datafeed.
Basically, I want them to show in Googlebase, but not in the Google index.
I case you are wondering why I want that, it is because I rank a lot better when those urls are not in the index.

I even uploaded a sitemap to webmaster central with the URLs I want, but still no good.

tedster

8:39 pm on Apr 16, 2009 (gmt 0)

Do those pages actually get spidered and their content shown in the SERPs? Or are they displayed as "url-only"?

[edited by: tedster at 9:53 pm (utc) on April 17, 2009]

defanjos

11:40 pm on Apr 16, 2009 (gmt 0)

When I do a site:example.com, the majority of results show title, desc, and url. Some show only the url, but they are in the minority.

None of the unwanted pages rank, as far as I can tell. I just know that when they are in the index, they screw up my rankings for the main pages.

tedster

2:26 am on Apr 17, 2009 (gmt 0)

If a robots.txt disallowed url shows up in the SERPs, even as a url-only listing, you can request its removal through the tool inside your WebmasterTools account.

defanjos

3:22 pm on Apr 17, 2009 (gmt 0)

I thought about that, but, Google says they have to return a 404, and I can't do that because I need the pages live for Googlebase and the other SEs

Thanks for the ideas thus far.

tedster

9:57 pm on Apr 17, 2009 (gmt 0)

Google says they have to return a 404

You may have misunderstood something. A robots.txt rule is definitely a way to remove urls from the index. See this Google Webmaster Help page: Removing my own content from Google's index [google.com]

defanjos

10:57 pm on Apr 17, 2009 (gmt 0)

I was referring to the WebmasterTools only - they write:

To do this, ensure that each page returns an HTTP status code of either 404 or 410, or use a robots.txt file or meta noindex tag to block crawlers from accessing your content.
If you're requesting removal of a full site or directory, you must use a robots.txt file to block crawlers from accessing this content.

So, I basically did what they want - I used robots.txt to remove the directory, but it is not working. I could enter over 1000 urls manually into the webmaster tools, but that is a pain.

Anyway, I will wait a month and see if Google removes them again.

The weird part is, they used to be gone, but suddenly started being indexed again, thus the reason for my original question.

tedster

11:19 pm on Apr 17, 2009 (gmt 0)

I could enter over 1000 urls manually into the webmaster tools, but that is a pain.

Agreed. This is a frustration in the current implementation of url removal. I'm pretty sure that back when the only removal tool was public you used to just "remove any urls blocked in robots.txt" -- or something along those lines -- and that took care of the wild card effect in robots.txt. It should be that way, IMO.