aakk9999 - 9:44 pm on Jun 30, 2013 (gmt 0)
So if I understood well:
- You have confirmed via WMT that URLs with pattern /merchant/ are blocked via robots.txt
- However, you have positively identified in your logs (via IP address and user agent) that Googlebot has requested an URL with the pattern /merchant/, i.e. in your logs there was a line something like :
GET /merchant/ with 200 OK, IP address from Googlebot and user agent Googlebot
Are you absolutely sure that this URL was requested by Googlebot and not some other bot from Google (e.g. AdsBot-Google treats robots.txt differently, see Note 2 below)
If so, how odd...
With regards to the results you are seeing in SERPs for URLs with /merchant/, which appear as URL with "A description for this result is not available because of this site's robots.txt – learn more") - to me this would indicate that Googlebot knew the page was blocked via robots.txt and that it has not crawled it.
There is an important distinction between crawling and indexing. Robots.txt controls crawling, but not indexing [developers.google.com ]:
Note: Pages may be indexed despite never having been crawled: the two processes are independent of each other. If enough information is available about a page, and the page is deemed relevant to users, search engine algorithms may decide to include it in the search results despite never having had access to the content directly. That said, there are simple mechanisms such as robots meta tags to make sure that pages are not indexed.
(*)Note 2: AdsBot-Google ignores robots.txt User-agent: * section and to block it there has to be a dedicated user-agent section declared in robots.txt