google not following robots.txt

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

google not following robots.txt

restricted folder & pages

experienced

9:06 am on Aug 1, 2007 (gmt 0)

I am not sure why this is happening. google is not following my robots file and indexing the pages & folder that have already been blocked. Crawling the blocked files and are indexed properly. what to do for this. sud i contact google for this or i can do this at my end only. I can not wait for their reply as i think i hardly get any reply. Need help

skweb

3:05 pm on Aug 1, 2007 (gmt 0)

Google can take months to change its index after you change the robots.txt file. For example on one domain I realized that the images were not being indexed because of the restriction in the robots file (that I did not know about). It took over four months before it started to crawl the images.

On another domain I have asked Google not to index XML feeds and printer friendly pages. It has been months before some of the pages are gone.

Be patient my friend.

Miamacs

3:25 pm on Aug 1, 2007 (gmt 0)

Ah.

So that's why...

[insert topic I didn't post because I knew I'm doing it the right way]

Not sure about this, but my experience is that in the meantime Google keeps comparing the content of these pages to the ones I want to promote. Meaning dupe content. Sometimes those that make it to the index are those excluded in theory, and those dropping out are those I'd need links to.

...

The Contractor

4:33 pm on Aug 1, 2007 (gmt 0)

Try looking at these threads [google.com].

If you have links pointing to the pages they will still be indexed. Try adding <meta name="robots" content="noindex,nofollow"> to the files between the <head> tags of course.

whoisgregg

4:37 pm on Aug 1, 2007 (gmt 0)

You should also ensure your robots.txt file is properly constructed. One way to check is by using Google's own diagnostic tool [google.com].

tedster

4:38 pm on Aug 1, 2007 (gmt 0)

You can use Google's url removal tools to force them to drop disallowed urls from the index. Often works in just a few days.

Miamacs

12:54 am on Aug 2, 2007 (gmt 0)

If you have links pointing to the pages

I don't.
Perhaps I did at one point.

... page is same page as the other URL, only mod rewritten to look nicer. I don't have the time for this...

One way to check is by using Google's own diagnostic tool

I did, it should work.
Besides, robots.txt isn't like the other directives.
In theory, it should forbid the crawling itself.
Not indexing.

But if the page has been indexed aleady, it won't drop out. Or if Google guesses a URL, or people link to it out of good will ( even though they can't even access it any way but directly ) it would be a problem.
So, best practice is:

- remove line from robots.txt
- add NOINDEX, NOARCHIVE to page ( if it was separate )
- program the damn site to add the META if the request is for xyz URL ( yeah right )
- wait until they fall into the now unmarked supplemental index
- put the directive back in robots.txt

You can use Google's url removal tools to force them to drop disallowed urls from the index

Eh, quite frankly... I stopped worrying about things I'd need to work extra on. If it works, it works, if it should but it doesn't... I don't care anymore.

...

[edited by: Miamacs at 12:56 am (utc) on Aug. 2, 2007]

new_seo

4:56 am on Aug 2, 2007 (gmt 0)

May be those pages are indexed by G before the implementation of robots.txt.De index those pages.

If you have links pointing to the pages they will still be indexed.

Contractor,is it so.I have a doubt.

The Contractor

11:33 am on Aug 2, 2007 (gmt 0)

Contractor,is it so.I have a doubt.

Did you look at these posts [google.com]? Google will indeed throw them in the index (many times as URL only). Very easy to see this problem. Put allinurl:cgi-bin/ in Google search box. Go to the URL only results and see if they have a robots.txt file blocking their cgi-bin. I know for a fact openbsd.org does. Do a test on your own site for allinurl:yourdomain.com/blocked_folder/. A good one to check is allinurl:www.library.upenn.edu/cgi-bin/ as they have over 34K pages that in my opinion should be blocked via robots.txt regardless of if links are pointing to them or not.

[edited by: The_Contractor at 11:42 am (utc) on Aug. 2, 2007]

Lord Majestic

11:35 am on Aug 2, 2007 (gmt 0)

robots.txt is designed to prevent crawling of pages, however it is not designed to prevent inclusion of urls into index if they were referenced elsewhere via links.

The Contractor

11:41 am on Aug 2, 2007 (gmt 0)

robots.txt is designed to prevent crawling of pages, however it is not designed to prevent inclusion of urls into index if they were referenced elsewhere via links.

Thats the way it is being used. Many a site has been hacked because it's very easy to find sites using the script via allinurl. Same goes with sensitive data which many companies assume will be kept private since it's blocked.

Lord Majestic

11:50 am on Aug 2, 2007 (gmt 0)

Perhaps we will see soon see major search engines dropping allinurl: and similar commands.

The Contractor

11:59 am on Aug 2, 2007 (gmt 0)

I believe Google is the only one that handles robots.txt this way (could be wrong). So it would be easier if they would not index the url if the folder is blocked via robots.txt. If you do a check of inurl:www.library.upenn.edu in Yahoo you will see over 1 million results, but a check of inurl:www.library.upenn.edu/cgi-bin/ shows nothing.

[edited by: The_Contractor at 12:07 pm (utc) on Aug. 2, 2007]

Lord Majestic

12:10 pm on Aug 2, 2007 (gmt 0)

If some link is referenced by other sites, then it is an indication that this is possibly a valuable link - robots.txt might prevent full-text indexing of it (for whatever reason), however link text makes it possible to provide possibly more relevant results to the end user despite robots.txt block. It's a tough call for a search engine, especially given that implementing back feedback mechanism that you describe means considerable extra work.

The solution here is entirely in webmasters' domain - if you don't want a publicly posted page to be accessed by public, then don't post it public: and if you do, then don't complain if the page is found.

new_seo

12:22 pm on Aug 2, 2007 (gmt 0)

robots.txt is designed to prevent crawling of pages, however it is not designed to prevent inclusion of urls into index

If google will not able to crawl a page for a long time,I feel it will automatically removed from Idexing.

if they were referenced elsewhere via links.

If google find a link of a page he will go to the root of that page for crawling.There he will find the robots.txt as guide and that page is blocked there,so how can it be possible that the page will be crawled from other link.Then what the use of robots.txt.

Lord Majestic

12:33 pm on Aug 2, 2007 (gmt 0)

robots.txt was designed to prevent crawling of urls - it is not meant to prevent SE to include pure url (without full text) on the basis of incoming links found in pages crawled on other sites that allowed such crawling using robots.txt.

robots.txt is there only to control crawling, nothing else, and even at that it is a totally voluntary thing, albeit it is good manners and wise choice to obey it.

Say you can have website that disallows all crawling activity - however if your site is very popular then it might be added to a directory of sites out there, even though crawling of the page is not allowed - this does not and should not affect inclusion of said link (but not full index text and cache stuff) into any search engine or directory who can learn about the site via other legitimate means than crawling the web.

The Contractor

1:02 pm on Aug 2, 2007 (gmt 0)

It's a tough call for a search engine, especially given that implementing back feedback mechanism that you describe means considerable extra work.

It seems the other search engines have mastered that....

Lord Majestic

2:41 pm on Aug 2, 2007 (gmt 0)

By "the other" you mean MSN and Yahoo? Both of them are not as agile with link analysis as Google is - there are good relevancy reasons that justify inclusion of pure links into database even if robots.txt disallows such urls - if robots.txt is obeyed and those urls were not crawled from site, then insofar as robots.txt rules are concerned all is fine in my view - this view is from "search engine"/"user of search engine" window of course - I am sure webmasters have their own view of this problem, but then again things have to be balanced here.