They will get crawled - but they will not be indexed.
Google has interpreted robots.txt to only apply to indexing and not spidering. YOu will need to use a htaccess ban to stop the bot in that directory.
what would be the effects of a htaccess ban?
Well - if you do it right - you only ban anything with GOOGLEBOT as the agent in those directories. This is difficult to do.
Thanks a lot for replying.
|They will get crawled - but they will not be indexed. |
Google has interpreted robots.txt to only apply to indexing and not spidering.
Actually I could not get this correctly. What I interpret is if I disallow a particular file or folder using robots.txt it gets crawled but the content is not indexed in the SE's database.
But what I am observing is, I have disallowed a few folders in my robots.txt file, but inspite of that respective Url's are being displayed in the SERP's for a particular keyword, without any title tags or description.
Do I have to use .htaccess file in order to stop this.
I shall be highly obliged if you help me out in this regard.
So they are URL only listings.
what i've seen is if there are external links to the pages you have disallowed, they still appear as URL only listings. They will either dissappear in time or stay url only. If you really want to get them removed from the index submit your bot file to Google they will remove any pages that are indexed. But be warned make sure your Bot file is correct syntax wise (use the bot validator) and these pages will return after 180 days i think it mentions that on the page somewhere.
I have seen googlebot perfectly respecting the /robots.txt - that is: NO crawling of Disallowed: stuff and therefore NO indexing.
If the files are old and you have put up the robots.txt only recently, perhaps you have to give it some more time to settle.
The format of your file looks good.
Make sure you have the robots.txt in the domain's document root and that it is accessible (file permissions). Check in the logs that it got accessed by googlebot without error.
Thanks for the advice, but do I submit the robots.txt file as normal page/url submission. Or are there other means to submit a .txt file on the server.
Use the <meta name="robots" content="noindex"> tag on the page to completely remove all mention of the page from the index (and do not mention the page in robots.txt otherwise Google will not get in to actually see the tag).
If you only use robots.txt then Google will always show the page as a URL-only listing, and will show it (probably) for ever more.
If you disallow something using robots.txt, something that is already indexed, then Google will not remove it on its own. You can submit the URL of the robots.txt file to the Removal Tool on the Google URL Console and that will remove it for 180 days (sometimes only 90) but then it will be relisted, even if it is still disallowed in the robots.txt file. Use the meta tag for full and permanent removal.
in response to SeoChristine:
"I have seen googlebot perfectly respecting the /robots.txt - that is: NO crawling of Disallowed: stuff and therefore NO indexing.
If the files are old and you have put up the robots.txt only recently, perhaps you have to give it some more time to settle."
im having the same problem, my ecommere software generates a horrible site map. its too big, has far too many links etc.
so i set out my robots.txt like this:
is this ok?
If Google ever sees a link to a page (even if that page doesn't actually exist, and has never existed, and never will exist) then it will still show the URL that it has found as a URL-only entry in the SERPs. Listing that URL in your robots.txt will stop the URL from being crawled and indexed. It will NOT stop the URL from being listed as a URL-only entry for ever more though. Only a noindex meta tag on the page itself can stop the URL from appearing at all.