Forum Moderators: Robert Charlton & goodroi
Robots-tip: crawlers cache your robots.txt; update it at least a day before adding content that is disallowed.
[twitter.com...]
Do you really want Googlebot crawling all of those documents and displaying URI only listings
i have a disallow prefix in robots text
I don't see Googlebot crawling robots.txt disallowed documents.
If they don't crawl them, why are there so many URI only listings when performing site: searches
[edited by: tedster at 5:36 pm (utc) on May 30, 2010]
No. The URLs disallowed in robots.txt are not crawled.
Definition: crawl = request the file from the server. Only server logs can tell you what files were crawled.
URI-only listings are not evidence that the document was crawled, only that the existence of the URL is known to Google.
if they are not crawled what is the proper terminology when Googlebot requests the robots.txt file and takes action on the directives?
I need more literal definitions of crawling, indexing, parsing.
When a bot crawls a robots.txt file, particularly Googlebot, what is it doing with the Disallow entries?
URI only listings...
If we have evidence that a page is good, we can return that reference even though we haven't crawled the page.
GoogleGuy - Since you've asked in the past for suggestions for improving Google's serps, I'd suggest that less aggressive indexing here would be helpful. I can't imagine why Google would want to return a link to a blocked page.
Do you ever use "noindex, follow"?
I agree robots.txt is like having the curtains open.
John Mueller: It’s always a good idea for your XML Sitemap file to include all pages which you want to have indexed. If you have pages such as tag or archive pages which you prefer not to have indexed, it’s recommended to add a “noindex” robots meta tag to the pages (and of course, not to include them in the Sitemap file).
What exactly happens during the crawl routines of this website?
I think those URI only entries are black holes for crawl equity. I don't want the bot wasting its resources on referencing 60,000 URIs, I really don't. I don't even want the bots to know that those URIs exist. No, I want to grab that bot by the balls and send them on a pre-planned crawling adventure.