Forum Moderators: goodroi
/maintopic/
/maintopic/subtopic/
/maintopic/subtopic2/
/maintopic/subtopic2/morestuff/
In other words, every link is to a directory, with subdirectories containing additional content about maintopic.
I have the following line in robots.txt:
Disallow: /maintopic/subtopic2/
However, G has indexed /maintopic/subtopic2/morestuff/
My understanding, from this thread [webmasterworld.com] and other things I have read says that disallowing /maintopic/ should disallow any url starting with /maintopic/, including subdirs.
What am I missing?
If googlebot follows an inbound link to your site it will list that URL (as a URL that the other site links to) without checking your robots.txt
In case of URLs only results the bot did not actually follow anything, in fact it has nothing to do with the bot or crawling: its the indexer who finds URL pointing to your site and decides to add it to the index, thus giving it chance to appear in the search results.
This behavior does not contradict robots.txt in any shape or form, in fact it does not even fall into "jurisdiction" of robots.txt
what happens is googlebot spiders a page that links to you. So it adds your URL to it's list (without even visiting your site) so it 'knows' about the URL.
Then it will try to spider that URL but will be stopped by robots.txt
you end up with a URL only listing until google decides to remove it - if ever.
That's why I'm wondering if the new google sitemaps feature would be the cure for this problem.
As LordMajestic stated above, robots.txt tells robots not to fetch a page. It says nothing about listing that page's URL in search results. It was intended for bandwidth control, not search results presentation control, and certainly not security (since a malicious robot can simply ignore robots.txt completely).
Jim