From what I can see, Google shows 182,000 robot.txt files in their index - [inurl:robots.txt filetype:txt] is the query I was studying. By posting that specific query, I am making an exception to our usual policy of "no specific searches", but I feel it is generic enough - there are no keywords involved - to allow an exception in this case.
My guess is that somewhere, someone has linked to these robots.txt specific files - including yours, most likely. Certainly the top results for that query are robots.txt files that I know have been linked to - including ours here at WebmasterWorld.
Google should simply drop all robots.txt files from their index, IMO - but we're the tail trying to wag the dog on this kind of thing. How is this causing you a problem, Mike? Is your robots.txt actually ranking on a keyword query?
At any rate, this does bring up the crazy question, how can you remove a robots.txt file from Google's index? If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop. And of course, you don't use meta tags in a robots.txt file.
Unless this is some kind of a problem on keyword queries, I'd suggest you just chalk it up to strange and move on.
At any rate, this does bring up the crazy question, how can you remove a robots.txt file from Google's index?
Wonder whether it could be removed within the webmaster tools by using
From my experience, historically, you had to link to robots.txt to get it indexed. These days, search engines seem to want to index absolutely any valid URL. This type of spidering is stretching the limits of robots exclusion.
If you don't want spiders to index certain URLs, your only recourse is to block them, preferably at a point in the network chain as close to the request as possible.
you could probably block it with a php header.
header('X-Robots-Tag: noindex, follow', TRUE);
you'd have to serve the .txt file as php though
|Wonder whether it could be removed within the webmaster tools by using |
In order to use that tool, you have three options:
1. block with robots.txt
2. use a noindex in a robots meta tag
3. return a 404
None of these options are viable, unless just maybe you can afford to have no robots.txt file at all. Still, that is not sane.
and including their own as well as many of the big names around.
|How is this causing you a problem, Mike? Is your robots.txt actually ranking on a keyword query? |
You never know if it is causing a problem with Google. Wondered if it was a side effect of something bad. We were the victim of a server hack (someone uploaded p*rn images into every page) two weeks ago, then last week someone linked to 1000+ pages that did not exist on the site (yes, Google indexed them too as duplicate content even though they had noindex/nofollow tags) so I wondered if this was the latest ranking attack by our competitors.
You can try adding this in your .htaccess file:
Header set X-Robots-Tag "noindex, nofollow"
Then try again to remove with the Google Webmaster Tools "Remove URL".
Good approach, Webnauts! I haven't had cause to use an X-Robots-Tag since it was introduced in July 2007, and it completely escaped my mind. This is exactly the type of situation that the protocol was created for.
|...META tags give you useful control over how each webpage on your site is indexed. But it only works for HTML pages. How can you control access to other types of documents, such as Adobe PDF files, video and audio files and other types? Well, now the same flexibility for specifying per-URL tags is available for all other files type. |
Official Google Blog [googleblog.blogspot.com]
|If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop. |
You broke the internet! ;)
Anyone remember Brett's robot.txt experiment which resulted in the WebmasterWorld robots.txt with a PR5 or 6.
(Back to lurk mode)
How about if you only get your robots.txt published if you link to it from your own domain? So that others can't get your robots.txt published (unless you have user generated content that is too big for you to control)
I can see no point in wasting any effort on indexing a robots.txt file. What possible service could this provide to a internet searcher?
I needed a robots.txt file to be gone from the index.
I added this to robots.txt:
Google picked up the new robots.txt file after a few days (for use in checking what they can index).
They looked at it several times each week, but the OLD robots.txt content remained in the SERPs and in their cache and snippet for many weeks.
About a month later, the robots.txt file dropped from the SERPs.
You can use robots.txt to block the indexer from indexing its content.
That will not restrict access for the bot that retrieves the file for its real intended purpose.
|If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop. And of course, you don't use meta tags in a robots.txt file. |
We already have a very common situation where the contents of robots.txt doesn't prevent robots.txt actually being fetched:
Clearly robots.txt is included under the root directory, but still must be requested and requested repeatedly to see if it has changed.
So just as
would prevent all .txt files from being indexed then so should
prevent that particular .txt file from being indexed without problems.
I find this discussion amusing.
There wasn't anyone around here asking the question - "Why?"
Why do you care that your robots.txt is in google index?
What is the possible damage?
What could go wrong if someone reads it?
I have many pages (robots, sitemaps, disclaimers, privacy policies, login forms) on my sites that google had indexed and are completely useless to anyone. I really couldn't care less.
if someone wants to read my robots.txt - be my guest!
Anyone has a good reason for this discussion?
This would be just STOOPID.
Sometimes, you lock access to sensitive folders using it and there's no point in any noob out there finding those paths and then banging at their doors.
It's not a privacy threat, it's a security one also.
A suggestion from Matt Cutts which he posted in a comment [mattcutts.com] yesterday on his blog:
...., why not use a noindex directive in the robots.txt file?
I have recently been given a nudge about BotSeer at psu.edu. That is a site that indexes robots.txt files, republishes them, and allows that copied content to be indexed. The site also links back to your robots.txt files allowing them to appear in the SERPs too.
They have collected more than two million robots.txt files, and it is interesting to see which bots other people are blocking.
I use Disallow: /robo in my robots.txt so will not have the problem of my own data being so open to inspection. :-)
How did you come out on this? Where you able to get your robots.txt removed from the index?
I have got them removed in the past using Disallow: /robo
Haven't had to do that any time recently.
Even Google's robots.txt file is listed, among other amusing sites ;-)