Google ignoring my robots.txt?

Forum Moderators: goodroi

Message Too Old, No Replies

Google ignoring my robots.txt?

homesby

8:47 pm on Jun 28, 2005 (gmt 0)

Trying to keep google from indexing all pages within specific directory named "inest".

My robots.txt file reads:

User-agent: *
Disallow: /inest/
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /css/
Disallow: /js/
Disallow: /hg/

As a back up, I have on each page...

But Google has nonetheless began indexing pages in this folder...

<snip>

Any suggestions or explanations would be greatly appreciated. Thanks!

[edited by: Woz at 11:33 pm (utc) on June 28, 2005]
[edit reason] No URLs please, see Tos#13 [/edit]

Dijkgraaf

11:03 pm on Jun 28, 2005 (gmt 0)

Hi homesby

The URL's that are listed there are for a folder called /texas/ which isn't in the robots.txt sample you listed.
Also note that it is only listing the URL's and not any page titles or cached contents. That means it found the URL's somewhere, but when it fetched the pages it found your meta tags, and hence hasn't indexed the contents or attempted to follow any links from those pages.
If you want those URL's removed add Disallow: /texas/ to your robots.txt, and visit [google.co.nz ] on instructions on how to get the URL's removed.

homesby

11:22 pm on Jun 28, 2005 (gmt 0)

Thank you Dijkgraaf. I assumed (mistakenly) that Disallow: /inest/ would include any folders within the inest folder.

Question . . would it be correct to use...

Disallow: /inest/texas/

...since I have other folders named texas within other directories that I do want indexed?

jatar_k

11:28 pm on Jun 28, 2005 (gmt 0)

Welcome to WebmasterWorld homesby,

>> would include any folders within the inest folder

that's correct it should include anything with /inest/ in the path

homesby

11:33 pm on Jun 28, 2005 (gmt 0)

Thanks for the welcome, Jatar_k.

>>that's correct it should include anything with /inest/ in the path

Do you mean to say that I was not mistaken in my assumption and that what Dijkgraff posted is not accurate?

If so, than why do you think google found these pages?

jatar_k

11:40 pm on Jun 28, 2005 (gmt 0)

I realize now that I didn't word that very well sorry

User-agent: *
Disallow: /inest/

the above tells compliant spiders to not spider anything in the directory /inest/

so if /texas/ is located within /inest/ then it is disallowed to all bots

>> why do you think google found these pages

well, they can find them, are they linked from anywhere? I assume they are or they wouldn't be found at all. Spiders read content they aren't allowed to all the time, doesn't mean they cache it or rank it or even give it a description. Those won't show up in any search results except a specific search for pages from your site.

Dijkgraaf

11:40 pm on Jun 28, 2005 (gmt 0)

Sorry googleboy, didn't spot that /inest/ was in the first part of the URL, my mistake.
When did you add that into robots.txt?
Was it before those pages were uploaded or after?

homesby

12:08 am on Jun 29, 2005 (gmt 0)

>>are they linked from anywhere?

Yes, but with rel="nofollow" included.

>>Was it before those pages were uploaded or after?

Robots.txt file uploaded before.

I did just use the google remove urls tool via "Remove pages, subdirectories or images using a robots.txt file. "

I'd like to keep google away from these pages as they include meta refresh tags (pages used for tracking only) and I know G is not fond of these.

Dijkgraaf

12:17 am on Jun 29, 2005 (gmt 0)

Possibly Google just listed the URL's and never actually requested the pages. You could confirm this by looking at your web logs if you have access to those.

There seems to be a gap between what robots.txt does and what the meta tags do.
The robots.txt tells the bots that they aren't allowed to request certain pages, however this doesn't disallow them listing the URL's.
The noindex META tag tells search engines you don't want that page in their index, however if you've disallowed that page in robots.txt it will never request that page and read those META tags, and hence might still list the URL.

So currently the only way to mage sure that URL's don't get listed is to have the links to those URL's only appear on a page with noindex,nofollow or is disallowed also. This would be rather hard to ensure, as someone else might link to the page and then the search engines will find the a link.