Forum Moderators: Robert Charlton & goodroi
For example I have a folder "/folder/" disallowed like:
User-agent: Googlebot
Disallow: /folder/
But, Google keeps on showing the urls in the format "/folder/1234.htm" in the index.
When I do site:example.com, over one thousand of those urls show. I want them gone.
I am wondering if it has to do with Googlebase. I have uploaded all the unwanted urls using a datafeed.
Basically, I want them to show in Googlebase, but not in the Google index.
I case you are wondering why I want that, it is because I rank a lot better when those urls are not in the index.
I even uploaded a sitemap to webmaster central with the URLs I want, but still no good.
Google says they have to return a 404
You may have misunderstood something. A robots.txt rule is definitely a way to remove urls from the index. See this Google Webmaster Help page: Removing my own content from Google's index [google.com]
To do this, ensure that each page returns an HTTP status code of either 404 or 410, or use a robots.txt file or meta noindex tag to block crawlers from accessing your content.
If you're requesting removal of a full site or directory, you must use a robots.txt file to block crawlers from accessing this content.
So, I basically did what they want - I used robots.txt to remove the directory, but it is not working. I could enter over 1000 urls manually into the webmaster tools, but that is a pain.
Anyway, I will wait a month and see if Google removes them again.
The weird part is, they used to be gone, but suddenly started being indexed again, thus the reason for my original question.
I could enter over 1000 urls manually into the webmaster tools, but that is a pain.
Agreed. This is a frustration in the current implementation of url removal. I'm pretty sure that back when the only removal tool was public you used to just "remove any urls blocked in robots.txt" -- or something along those lines -- and that took care of the wild card effect in robots.txt. It should be that way, IMO.