Welcome to WebmasterWorld Guest from 188.8.131.52
This is one of the areas where precise terminology is key. However, the vast majority of casual conversations (and indeed some official resources) tend to be quite lax.
Would not a person of ordinary intelligence interpret this to mean that a file in a roboted-out directory will stay out of the index, once removed?
If we have evidence that a page is good, we can return that reference even though we haven't crawled the page.
Could you please tell me WHEN/WHY DO I NEED A ROBOTS.TXT, THEN? I beg, could anyone please explain it to me precisely?
And why do I need to keep them off my pages/files when they can simply ignore the robots.txt and index those pages/files in their SERPs through external, internal links to those pages/files?
You use robots.txt to keep Google off your page. It stops them knowing stuff. That's it.
Real-world reasons for employing it include, but are not limited to
- Preserving Crawl budget (CSS files might not need crawling)
- Blocking file directories (/images/)
- Creating bad spider lists (block a directory, link to it in a hidden link, ban anything that finds its way there)
That's one reason I like robots.txt for a quick control on query string "sort" parameters and the like. Sorted product URLs are very easily inserted into social media links by well-intentioned fansI was wondering why you did not use a Canonical instead for all the pages with query string(sort)?
I once described someone as the world's leading authority on such-and-such obscure subject. I didn't and still don't know if he really is, but I haven't seen any serious competition. Months later it occurred to me that it should be possible to look it up.
Oh, now this sounds promising: an article by the person I named. I've read the article; it's damn good. Maybe I glossed over an introduction by some equally knowledgeable person, describing him as the world's leading et cetera.
No luck. Maybe in some older, cached version. This comes with the g### boilerplate, informing me that my search terms only appear in pages that link to this page. Let's stop right there.
The key point is that I got a search-engine hit based purely on the text that linked to a page. It happened to be my own link, and the page happened to be fully indexed in its own right-- but both of those are tangential. The significant part is that it was a search that could have occurred in real life.
That's one reason I like robots.txt for a quick control on query string "sort" parameters and the like. Sorted product URLs are very easily inserted into social media links by well-intentioned fans, The robots.txt file is a down and dirty way to stop crawling from generating a mess of duplicate content as well as messing up the quality of your site's googlebot crawl altogether.
...I'd suggest that less aggressive indexing here would be helpful. I can't imagine why Google would want to return a link to a blocked page.
I have very mixed thoughts about Google's aggressive indexing, btw. As a web professional who knows what I want indexed and what I don't, Google's aggressiveness in indexing has been a PITA. As a searcher looking for important information where webmasters have been too inept to make it visible, I can understand what Google's doing, and occasionally I've been glad they've done it.
Bottom line, if you don't want something indexed, use noindex and/or password protection.
A description for this result is not available because of this site's robots.txt – learn more
[edited by: Robert_Charlton at 3:26 am (utc) on Sep 12, 2012]
since they are crawling where explicitly banned from?
Should I be in touch with a lawyer
How can I remove these phantom pages from serps since Google is ignoring my directives?
This page has been blocked by robots.txt but is still indexed?
#82407. [project "Other Search Features"] For pages that we do not crawl because of robots.txt, we are usually unable to generate a snippet for users to preview what's on the page. This change added a replacement snippet that explains that there's no description available because of robots.txt.