phranque - 12:01 pm on Jul 19, 2013 (gmt 0)
How would you normally block all the https: requests through Robots.txt? Is there a specific syntax for it?
if you want to exclude all https: urls from being crawled, you should put the appropriate disallow directive in https://www.example.com/robots.txt
Can you please explain me how do I start it over?
i don't understand the question.
start what over?
So, it's just not the snippet where Google used to show the URL only version of a blocked content? Is is showing the complete page now?
if google discovers a url that you have excluded googlebot from crawling and google decides to index that url, the following description will appear in the search result:
A description for this result is not available because of this site's robots.txt - learn more [support.google.com].
I also got to know from the same forum that if we that page in the sitemap it will get crawled and indexed no matter if we block it in robots.txt or not. Is that also true?
i have never seen a case where googlebot crawled an excluded url.
just because a url is indexed doesn't mean the content was crawled or indexed.
it simply means the url was discovered.
if you don't want the url to appear in the index you should allow it to be crawled and provide a meta robots noindex element in the html document head or send a X-Robots-Tag: noindex header with the HTTP Response.