Need Clarification On Directory Exclusions

Forum Moderators: goodroi

Message Too Old, No Replies

Need Clarification On Directory Exclusions

jonrichd

11:54 am on Jun 13, 2005 (gmt 0)

I'm working with a site that uses the following structure for links:

/maintopic/
/maintopic/subtopic/
/maintopic/subtopic2/
/maintopic/subtopic2/morestuff/

In other words, every link is to a directory, with subdirectories containing additional content about maintopic.

I have the following line in robots.txt:
Disallow: /maintopic/subtopic2/

However, G has indexed /maintopic/subtopic2/morestuff/

My understanding, from this thread [webmasterworld.com] and other things I have read says that disallowing /maintopic/ should disallow any url starting with /maintopic/, including subdirs.

What am I missing?

Dijkgraaf

8:40 pm on Jun 13, 2005 (gmt 0)

Have you validated your robots.txt file?

Is that line under a
User-agent: *
or under a
User-agent: googlebot
?

Did you create and upload /maintopic/subtopic2/morestuff/ before you put the exlusion in your robots.txt?

Reid

6:02 am on Jun 14, 2005 (gmt 0)

googlebot found /maintopic/subtopic2/morestuff/
from an inbound link.
IBL's can cause googlebot to bypass a disallow in robots.txt
perhaps the new google sitemaps is the cure for this problem.

Dijkgraaf

10:51 am on Jun 14, 2005 (gmt 0)

If inbound links could cause it to bypass robots.txt, then robots.txt would be useless.
Somehow I would think there must be another factor at play.

Lord Majestic

1:39 pm on Jun 14, 2005 (gmt 0)

IBL's can cause googlebot to bypass a disallow in robots.txt

Its not a bypass -- robots.txt controls whether robot crawls site or not, it is not a blank prohibition to link to the site from other sites like search engine itself.

Reid

5:02 pm on Jun 14, 2005 (gmt 0)

If googlebot follows an inbound link to your site it will list that URL (as a URL that the other site links to) without checking your robots.txt

Lord Majestic

5:16 pm on Jun 14, 2005 (gmt 0)

If googlebot follows an inbound link to your site it will list that URL (as a URL that the other site links to) without checking your robots.txt

In case of URLs only results the bot did not actually follow anything, in fact it has nothing to do with the bot or crawling: its the indexer who finds URL pointing to your site and decides to add it to the index, thus giving it chance to appear in the search results.

This behavior does not contradict robots.txt in any shape or form, in fact it does not even fall into "jurisdiction" of robots.txt

Dijkgraaf

8:40 pm on Jun 14, 2005 (gmt 0)

So jonrichd, did Google just list the link, or does it have a cached copy of your page it its index?

Reid

1:22 am on Jun 15, 2005 (gmt 0)

thanks for the clarification LM

what happens is googlebot spiders a page that links to you. So it adds your URL to it's list (without even visiting your site) so it 'knows' about the URL.
Then it will try to spider that URL but will be stopped by robots.txt
you end up with a URL only listing until google decides to remove it - if ever.
That's why I'm wondering if the new google sitemaps feature would be the cure for this problem.

jdMorgan

5:00 am on Jun 18, 2005 (gmt 0)

The cure for the problem is to allow the robot to fetch the page, but use the on-page <meta name="robots" content="noindex"> tag to tell the robot not to list the URL or the page information in search results.

As LordMajestic stated above, robots.txt tells robots not to fetch a page. It says nothing about listing that page's URL in search results. It was intended for bandwidth control, not search results presentation control, and certainly not security (since a malicious robot can simply ignore robots.txt completely).

Jim