Forum Moderators: Robert Charlton & goodroi
I wonder if we are talking cross-wired.
Assuming that the bot reads and obeys the robots.txt before it visits each page. Including when following inbound links.
Which the bots do not do...
Has anyone seen the googlebot-- the real thing, not a spoofer-- snuffling around where it doesn't belong?
to which bots are you referring?
We have specific pages that we do not want indexed,
The Googlebot shows up in logs all the time as having accessed those pages. These pages are properly blocked in robots.txt
and have noindex in the headers
These pages show up in the SERPs from time to time with the "description blocked by robots.txt" statement.
[edited by: phranque at 12:56 am (utc) on Jun 30, 2013]
The only way the Googlebot is accessing these URLs is from internal pages as we have yet to add rel=nofollow to them...
Disallow: /merchant/ <meta name='robots' content='noindex, nofollow' /> Googlebot 531,659+147
Blocked by line 12: Disallow: /merchant/
Detected as a directory; specific files may have different restrictions
then in a day or two those pages will be in the SERPs, with aforementioned "description is blocked by robots.txt". Then there will be some sort of data update/refresh and the pages are gone.
Am I missing something here?
147 times while crawling 531K pages.
you won't see the pages - only the urls.
and when you say "the pages are gone" are you sure they aren't filtered out?
try adding &filter=0 to the google search url and see if those urls reappear.
i don't think you want googlebot requesting robots.txt first for every resource requested.
iirc googlebot caches robots.txt for up to 24 hours.
what was the elapsed time for those 147 requests of robots.txt?
the part i see missing is where you have verified that googlebot has actually requested a url in the /merchant/ directory and if so that you checked the IP of the visitor to verify that it is in fact googlebot and not a spoofed user agent.
it has been mentioned numerous times in this thread that the noindex directive is irrelevant when you have excluded googlebot from crawling that url.
You can't set a per-page quota. Well-behaved small robots pick up robots.txt at the start of each separate visit. Large robots-- and you can hardly get bigger than the googlebot-- read robots.txt, spread it around to their fellow googlebots, and hold it for up to 24 hours.
As I understand it, "nofollow" doesn't mean "pretend you haven't seen this link". It just means "I make no claims about the quality of the material I'm linking to".
How does Google handle nofollowed links?
In general, we don't follow them. This means that Google does not transfer PageRank or anchor text across these links. Essentially, using nofollow causes us to drop the target links from our overall graph of the web. However, the target pages may still appear in our index if other sites link to them without using nofollow, or if the URLs are submitted to Google in a Sitemap. Also, it's important to note that other search engines may handle nofollow in slightly different ways.
https://support.google.com/webmasters/answer/96569?hl=en
Note: Pages may be indexed despite never having been crawled: the two processes are independent of each other. If enough information is available about a page, and the page is deemed relevant to users, search engine algorithms may decide to include it in the search results despite never having had access to the content directly. That said, there are simple mechanisms such as robots meta tags to make sure that pages are not indexed.
- You have confirmed via WMT that URLs with pattern /merchant/ are blocked via robots.txt
- However, you have positively identified in your logs (via IP address and user agent) that Googlebot has requested an URL with the pattern /merchant/, i.e. in your logs there was a line something like :
GET /merchant/ with 200 OK, IP address from Googlebot and user agent Googlebot
Are you absolutely sure that this URL was requested by Googlebot and not some other bot from Google (e.g. AdsBot-Google treats robots.txt differently, see note below)
If so, how odd...
There is an important distinction between crawling and indexing. Robots.txt controls crawling, but not indexing
However, saw plain as day, the 200 header response in the logs.
have you verified that all preceding requests for robots.txt also got a 200 OK response?
and there's no chance that robots.txt was modified such that the /merchant/ directory was not excluded at some point?
66.249.75.200 /[REMOVED_BY_ME]/review/[REMOVED_BY_ME].html 200 GET HTTP/1.1 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html
If you search for a unique sentence from one of these pages, do they show in SERPs?
Were they full GET fetches or just HEAD? When {well-known hotlinking site} checks for pages, it just does a HEAD to ensure the page is still there. I think w3's link checker does the same.
If you search for a unique sentence from one of these pages, do they show in SERPs?
It could be natural to make sure the URL exists when they choose to show it in the SERPs, even without description.
Just speculation of course... but how else would they remove robot-blocked and indexed URLs that is deleted or never existed at all? They would just sit there in the results forever.
Why does this appear to be so unbelievable
/[REMOVED_BY_ME]/review/[REMOVED_BY_ME].html
I can remember searching on the Google for some sort of "how to" related to website coding - came across a W3Schools result in the SERPs that said "A description for this result is not available because of this site's robots.txt – learn more". Upon clicking on the link, it was exactly what I was looking for.
Personally believe EVERYTHING gets crawled...
User-agent: *
Disallow: /brand/
Disallow: /merchant/
Disallow: /review/
Disallow: /images/
Disallow: /feeds/
Blocked by line 13: Disallow: /review/
Detected as a directory; specific files may have different restrictions
<meta name='robots' content='noindex, nofollow' />