Welcome to WebmasterWorld Guest from 107.20.75.63

Message Too Old, No Replies

My robots.txt file is in the SERPS

     
9:25 am on May 14, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 25, 2003
posts:2527
votes: 0


Title says it all really. Any ideas why? It only seems to be affecting one of my sites. Can't see anything I might have done different with this one over the others.

Thanks
Mike

8:26 pm on May 14, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


From what I can see, Google shows 182,000 robot.txt files in their index - [inurl:robots.txt filetype:txt] is the query I was studying. By posting that specific query, I am making an exception to our usual policy of "no specific searches", but I feel it is generic enough - there are no keywords involved - to allow an exception in this case.

My guess is that somewhere, someone has linked to these robots.txt specific files - including yours, most likely. Certainly the top results for that query are robots.txt files that I know have been linked to - including ours here at WebmasterWorld.

Google should simply drop all robots.txt files from their index, IMO - but we're the tail trying to wag the dog on this kind of thing. How is this causing you a problem, Mike? Is your robots.txt actually ranking on a keyword query?

At any rate, this does bring up the crazy question, how can you remove a robots.txt file from Google's index? If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop. And of course, you don't use meta tags in a robots.txt file.

Unless this is some kind of a problem on keyword queries, I'd suggest you just chalk it up to strange and move on.

9:03 pm on May 14, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Feb 6, 2005
posts:1678
votes: 71


tedster


At any rate, this does bring up the crazy question, how can you remove a robots.txt file from Google's index?

Wonder whether it could be removed within the webmaster tools by using

Remove URLs

9:14 pm on May 14, 2008 (gmt 0)

Senior Member

joined:Jan 27, 2003
posts:2534
votes: 0


From my experience, historically, you had to link to robots.txt to get it indexed. These days, search engines seem to want to index absolutely any valid URL. This type of spidering is stretching the limits of robots exclusion.

If you don't want spiders to index certain URLs, your only recourse is to block them, preferably at a point in the network chain as close to the request as possible.

9:14 pm on May 14, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 12, 2006
posts:2493
votes: 22


you could probably block it with a php header.

header('X-Robots-Tag: noindex, follow', TRUE);

you'd have to serve the .txt file as php though

9:55 pm on May 14, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Wonder whether it could be removed within the webmaster tools by using

Remove URLs

In order to use that tool, you have three options:

1. block with robots.txt
2. use a noindex in a robots meta tag
3. return a 404

None of these options are viable, unless just maybe you can afford to have no robots.txt file at all. Still, that is not sane.

10:06 pm on May 14, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


...including ours here

and including their own as well as many of the big names around.
9:32 am on May 15, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 25, 2003
posts:2527
votes: 0


How is this causing you a problem, Mike? Is your robots.txt actually ranking on a keyword query?

You never know if it is causing a problem with Google. Wondered if it was a side effect of something bad. We were the victim of a server hack (someone uploaded p*rn images into every page) two weeks ago, then last week someone linked to 1000+ pages that did not exist on the site (yes, Google indexed them too as duplicate content even though they had noindex/nofollow tags) so I wondered if this was the latest ranking attack by our competitors.

2:57 am on May 16, 2008 (gmt 0)

New User

5+ Year Member

joined:June 14, 2006
posts: 31
votes: 0


You can try adding this in your .htaccess file:

<FilesMatch "robots\.txt">
Header set X-Robots-Tag "noindex, nofollow"
</FilesMatch>

Then try again to remove with the Google Webmaster Tools "Remove URL".

3:34 am on May 16, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Good approach, Webnauts! I haven't had cause to use an X-Robots-Tag since it was introduced in July 2007, and it completely escaped my mind. This is exactly the type of situation that the protocol was created for.

...META tags give you useful control over how each webpage on your site is indexed. But it only works for HTML pages. How can you control access to other types of documents, such as Adobe PDF files, video and audio files and other types? Well, now the same flexibility for specifying per-URL tags is available for all other files type.

Official Google Blog [googleblog.blogspot.com]

6:51 am on May 16, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 16, 2004
posts:91
votes: 0


If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop.

You broke the internet! ;)

8:13 am on May 16, 2008 (gmt 0)

Senior Member from HK 

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 14, 2002
posts:2283
votes: 10


Anyone remember Brett's robot.txt experiment which resulted in the WebmasterWorld robots.txt with a PR5 or 6.

(Back to lurk mode)

2:44 pm on May 16, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 8, 2002
posts:2335
votes: 0


How about if you only get your robots.txt published if you link to it from your own domain? So that others can't get your robots.txt published (unless you have user generated content that is too big for you to control)
4:07 pm on May 16, 2008 (gmt 0)

New User

5+ Year Member

joined:Feb 5, 2008
posts:2
votes: 0


I can see no point in wasting any effort on indexing a robots.txt file. What possible service could this provide to a internet searcher?
6:44 pm on May 16, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I needed a robots.txt file to be gone from the index.

I added this to robots.txt:

Disallow: /robo

Google picked up the new robots.txt file after a few days (for use in checking what they can index).

They looked at it several times each week, but the OLD robots.txt content remained in the SERPs and in their cache and snippet for many weeks.

About a month later, the robots.txt file dropped from the SERPs.

You can use robots.txt to block the indexer from indexing its content.

That will not restrict access for the bot that retrieves the file for its real intended purpose.

12:42 am on May 18, 2008 (gmt 0)

New User

5+ Year Member

joined:July 16, 2007
posts: 28
votes: 0


If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop. And of course, you don't use meta tags in a robots.txt file.

We already have a very common situation where the contents of robots.txt doesn't prevent robots.txt actually being fetched:


User-agent: *
Disallow: /

Clearly robots.txt is included under the root directory, but still must be requested and requested repeatedly to see if it has changed.

So just as


Disallow: /*.txt$

would prevent all .txt files from being indexed then so should

Disallow: /robots.txt

prevent that particular .txt file from being indexed without problems.
9:10 am on May 21, 2008 (gmt 0)

Full Member

5+ Year Member

joined:Feb 15, 2006
posts:201
votes: 0


I find this discussion amusing.
There wasn't anyone around here asking the question - "Why?"
Why do you care that your robots.txt is in google index?
What is the possible damage?
What could go wrong if someone reads it?
I have many pages (robots, sitemaps, disclaimers, privacy policies, login forms) on my sites that google had indexed and are completely useless to anyone. I really couldn't care less.
if someone wants to read my robots.txt - be my guest!
Anyone has a good reason for this discussion?
4:54 am on May 22, 2008 (gmt 0)

Junior Member

5+ Year Member

joined:Dec 3, 2007
posts:58
votes: 0


This would be just STOOPID.
Sometimes, you lock access to sensitive folders using it and there's no point in any noob out there finding those paths and then banging at their doors.

It's not a privacy threat, it's a security one also.

2:04 pm on May 22, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Feb 6, 2005
posts:1678
votes: 71


A suggestion from Matt Cutts which he posted in a comment [mattcutts.com] yesterday on his blog:

...., why not use a noindex directive in the robots.txt file?

3:31 pm on July 8, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I have recently been given a nudge about BotSeer at psu.edu. That is a site that indexes robots.txt files, republishes them, and allows that copied content to be indexed. The site also links back to your robots.txt files allowing them to appear in the SERPs too.

They have collected more than two million robots.txt files, and it is interesting to see which bots other people are blocking.

I use Disallow: /robo in my robots.txt so will not have the problem of my own data being so open to inspection. :-)

11:48 am on Aug 4, 2008 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38048
votes: 12


How did you come out on this? Where you able to get your robots.txt removed from the index?
5:14 pm on Aug 4, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I have got them removed in the past using Disallow: /robo

Haven't had to do that any time recently.

6:17 pm on Aug 4, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 3, 2002
posts:894
votes: 0


Even Google's robots.txt file is listed, among other amusing sites ;-)
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members