Welcome to WebmasterWorld Guest from 54.205.251.179

Message Too Old, No Replies

Disallowing via robots.txt

Files Disallowed in robots.txt are getting indexed

   
12:05 pm on Oct 18, 2005 (gmt 0)

5+ Year Member



Hi,
I am new to the webmasterworld.
I have been observing a strange behavior in Google, I have disallowed a few of my files in the robots.txt document so that they dont get crawled by the spider. But the stunning part is that the files are getting indexed, when I perform the search using "site:mydomain.com" in Google, all those files are displayed in the SERP.
The specific format I am using in the robots.txt file is as follows:

User-agent: GoogleBot
Disallow: /folder/
Disallow: /private/
Disallow: /rd

I presume the format is not incorrect.
Any feedback in this regard will be very helpful.

2:19 pm on Oct 25, 2005 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



They will get crawled - but they will not be indexed.

Google has interpreted robots.txt to only apply to indexing and not spidering. YOu will need to use a htaccess ban to stop the bot in that directory.

3:00 pm on Oct 25, 2005 (gmt 0)

10+ Year Member



what would be the effects of a htaccess ban?
3:02 pm on Oct 25, 2005 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Well - if you do it right - you only ban anything with GOOGLEBOT as the agent in those directories. This is difficult to do.
8:00 am on Oct 27, 2005 (gmt 0)

5+ Year Member



Hi Brett

Thanks a lot for replying.

They will get crawled - but they will not be indexed.
Google has interpreted robots.txt to only apply to indexing and not spidering.

Actually I could not get this correctly. What I interpret is if I disallow a particular file or folder using robots.txt it gets crawled but the content is not indexed in the SE's database.

But what I am observing is, I have disallowed a few folders in my robots.txt file, but inspite of that respective Url's are being displayed in the SERP's for a particular keyword, without any title tags or description.
Do I have to use .htaccess file in order to stop this.
I shall be highly obliged if you help me out in this regard.

8:48 am on Oct 27, 2005 (gmt 0)

10+ Year Member



So they are URL only listings.

what i've seen is if there are external links to the pages you have disallowed, they still appear as URL only listings. They will either dissappear in time or stay url only. If you really want to get them removed from the index submit your bot file to Google they will remove any pages that are indexed. But be warned make sure your Bot file is correct syntax wise (use the bot validator) and these pages will return after 180 days i think it mentions that on the page somewhere.

Vimes.

11:36 am on Oct 27, 2005 (gmt 0)

10+ Year Member



Hi Seochristine,

I have seen googlebot perfectly respecting the /robots.txt - that is: NO crawling of Disallowed: stuff and therefore NO indexing.

If the files are old and you have put up the robots.txt only recently, perhaps you have to give it some more time to settle.

The format of your file looks good.
Make sure you have the robots.txt in the domain's document root and that it is accessible (file permissions). Check in the logs that it got accessed by googlebot without error.

Regards,
R.

11:50 am on Oct 27, 2005 (gmt 0)

5+ Year Member



Hi Vimes
Thanks for the advice, but do I submit the robots.txt file as normal page/url submission. Or are there other means to submit a .txt file on the server.
12:03 pm on Oct 27, 2005 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Use the <meta name="robots" content="noindex"> tag on the page to completely remove all mention of the page from the index (and do not mention the page in robots.txt otherwise Google will not get in to actually see the tag).

If you only use robots.txt then Google will always show the page as a URL-only listing, and will show it (probably) for ever more.

.

If you disallow something using robots.txt, something that is already indexed, then Google will not remove it on its own. You can submit the URL of the robots.txt file to the Removal Tool on the Google URL Console and that will remove it for 180 days (sometimes only 90) but then it will be relisted, even if it is still disallowed in the robots.txt file. Use the meta tag for full and permanent removal.

3:03 pm on Oct 27, 2005 (gmt 0)

10+ Year Member



in response to SeoChristine:

"I have seen googlebot perfectly respecting the /robots.txt - that is: NO crawling of Disallowed: stuff and therefore NO indexing.

If the files are old and you have put up the robots.txt only recently, perhaps you have to give it some more time to settle."

im having the same problem, my ecommere software generates a horrible site map. its too big, has far too many links etc.

so i set out my robots.txt like this:

==========================
User-agent: *
Disallow: sitemap.htm
==========================

is this ok?

thankyou

3:37 pm on Oct 27, 2005 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



If Google ever sees a link to a page (even if that page doesn't actually exist, and has never existed, and never will exist) then it will still show the URL that it has found as a URL-only entry in the SERPs. Listing that URL in your robots.txt will stop the URL from being crawled and indexed. It will NOT stop the URL from being listed as a URL-only entry for ever more though. Only a noindex meta tag on the page itself can stop the URL from appearing at all.