Forum Moderators: goodroi
My robots.txt file reads:
User-agent: *
Disallow: /inest/
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /css/
Disallow: /js/
Disallow: /hg/
As a back up, I have on each page...
<meta name="robots" content="noindex,nofollow" />
But Google has nonetheless began indexing pages in this folder...
<snip>
Any suggestions or explanations would be greatly appreciated. Thanks!
[edited by: Woz at 11:33 pm (utc) on June 28, 2005]
[edit reason] No URLs please, see Tos#13 [/edit]
The URL's that are listed there are for a folder called /texas/ which isn't in the robots.txt sample you listed.
Also note that it is only listing the URL's and not any page titles or cached contents. That means it found the URL's somewhere, but when it fetched the pages it found your meta tags, and hence hasn't indexed the contents or attempted to follow any links from those pages.
If you want those URL's removed add Disallow: /texas/ to your robots.txt, and visit [google.co.nz ] on instructions on how to get the URL's removed.
User-agent: *
Disallow: /inest/
the above tells compliant spiders to not spider anything in the directory /inest/
so if /texas/ is located within /inest/ then it is disallowed to all bots
>> why do you think google found these pages
well, they can find them, are they linked from anywhere? I assume they are or they wouldn't be found at all. Spiders read content they aren't allowed to all the time, doesn't mean they cache it or rank it or even give it a description. Those won't show up in any search results except a specific search for pages from your site.
Yes, but with rel="nofollow" included.
>>Was it before those pages were uploaded or after?
Robots.txt file uploaded before.
I did just use the google remove urls tool via "Remove pages, subdirectories or images using a robots.txt file. "
I'd like to keep google away from these pages as they include meta refresh tags (pages used for tracking only) and I know G is not fond of these.
There seems to be a gap between what robots.txt does and what the meta tags do.
The robots.txt tells the bots that they aren't allowed to request certain pages, however this doesn't disallow them listing the URL's.
The noindex META tag tells search engines you don't want that page in their index, however if you've disallowed that page in robots.txt it will never request that page and read those META tags, and hence might still list the URL.
So currently the only way to mage sure that URL's don't get listed is to have the links to those URL's only appear on a page with noindex,nofollow or is disallowed also. This would be rather hard to ensure, as someone else might link to the page and then the search engines will find the a link.