Forum Moderators: Robert Charlton & goodroi
On another domain I have asked Google not to index XML feeds and printer friendly pages. It has been months before some of the pages are gone.
Be patient my friend.
So that's why...
[insert topic I didn't post because I knew I'm doing it the right way]
Not sure about this, but my experience is that in the meantime Google keeps comparing the content of these pages to the ones I want to promote. Meaning dupe content. Sometimes those that make it to the index are those excluded in theory, and those dropping out are those I'd need links to.
...
If you have links pointing to the pages they will still be indexed. Try adding <meta name="robots" content="noindex,nofollow"> to the files between the <head> tags of course.
If you have links pointing to the pages
I don't.
Perhaps I did at one point.
... page is same page as the other URL, only mod rewritten to look nicer. I don't have the time for this...
One way to check is by using Google's own diagnostic tool
I did, it should work.
Besides, robots.txt isn't like the other directives.
In theory, it should forbid the crawling itself.
Not indexing.
But if the page has been indexed aleady, it won't drop out. Or if Google guesses a URL, or people link to it out of good will ( even though they can't even access it any way but directly ) it would be a problem.
So, best practice is:
- remove line from robots.txt
- add NOINDEX, NOARCHIVE to page ( if it was separate )
- program the damn site to add the META if the request is for xyz URL ( yeah right )
- wait until they fall into the now unmarked supplemental index
- put the directive back in robots.txt
OR
You can use Google's url removal tools to force them to drop disallowed urls from the index
Eh, quite frankly... I stopped worrying about things I'd need to work extra on. If it works, it works, if it should but it doesn't... I don't care anymore.
...
[edited by: Miamacs at 12:56 am (utc) on Aug. 2, 2007]
Contractor,is it so.I have a doubt.
Did you look at these posts [google.com]? Google will indeed throw them in the index (many times as URL only). Very easy to see this problem. Put allinurl:cgi-bin/ in Google search box. Go to the URL only results and see if they have a robots.txt file blocking their cgi-bin. I know for a fact openbsd.org does. Do a test on your own site for allinurl:yourdomain.com/blocked_folder/. A good one to check is allinurl:www.library.upenn.edu/cgi-bin/ as they have over 34K pages that in my opinion should be blocked via robots.txt regardless of if links are pointing to them or not.
[edited by: The_Contractor at 11:42 am (utc) on Aug. 2, 2007]
robots.txt is designed to prevent crawling of pages, however it is not designed to prevent inclusion of urls into index if they were referenced elsewhere via links.
Thats the way it is being used. Many a site has been hacked because it's very easy to find sites using the script via allinurl. Same goes with sensitive data which many companies assume will be kept private since it's blocked.
[edited by: The_Contractor at 12:07 pm (utc) on Aug. 2, 2007]
The solution here is entirely in webmasters' domain - if you don't want a publicly posted page to be accessed by public, then don't post it public: and if you do, then don't complain if the page is found.
robots.txt is designed to prevent crawling of pages, however it is not designed to prevent inclusion of urls into index
if they were referenced elsewhere via links.
robots.txt is there only to control crawling, nothing else, and even at that it is a totally voluntary thing, albeit it is good manners and wise choice to obey it.
Say you can have website that disallows all crawling activity - however if your site is very popular then it might be added to a directory of sites out there, even though crawling of the page is not allowed - this does not and should not affect inclusion of said link (but not full index text and cache stuff) into any search engine or directory who can learn about the site via other legitimate means than crawling the web.