homepage Welcome to WebmasterWorld Guest from 54.205.254.108
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
My robots.txt file is in the SERPS
internetheaven




msg:3649548
 9:25 am on May 14, 2008 (gmt 0)

Title says it all really. Any ideas why? It only seems to be affecting one of my sites. Can't see anything I might have done different with this one over the others.

Thanks
Mike

 

tedster




msg:3650104
 8:26 pm on May 14, 2008 (gmt 0)

From what I can see, Google shows 182,000 robot.txt files in their index - [inurl:robots.txt filetype:txt] is the query I was studying. By posting that specific query, I am making an exception to our usual policy of "no specific searches", but I feel it is generic enough - there are no keywords involved - to allow an exception in this case.

My guess is that somewhere, someone has linked to these robots.txt specific files - including yours, most likely. Certainly the top results for that query are robots.txt files that I know have been linked to - including ours here at WebmasterWorld.

Google should simply drop all robots.txt files from their index, IMO - but we're the tail trying to wag the dog on this kind of thing. How is this causing you a problem, Mike? Is your robots.txt actually ranking on a keyword query?

At any rate, this does bring up the crazy question, how can you remove a robots.txt file from Google's index? If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop. And of course, you don't use meta tags in a robots.txt file.

Unless this is some kind of a problem on keyword queries, I'd suggest you just chalk it up to strange and move on.

reseller




msg:3650148
 9:03 pm on May 14, 2008 (gmt 0)

tedster


At any rate, this does bring up the crazy question, how can you remove a robots.txt file from Google's index?

Wonder whether it could be removed within the webmaster tools by using

Remove URLs

Receptional Andy




msg:3650155
 9:14 pm on May 14, 2008 (gmt 0)

From my experience, historically, you had to link to robots.txt to get it indexed. These days, search engines seem to want to index absolutely any valid URL. This type of spidering is stretching the limits of robots exclusion.

If you don't want spiders to index certain URLs, your only recourse is to block them, preferably at a point in the network chain as close to the request as possible.

londrum




msg:3650156
 9:14 pm on May 14, 2008 (gmt 0)

you could probably block it with a php header.

header('X-Robots-Tag: noindex, follow', TRUE);

you'd have to serve the .txt file as php though

tedster




msg:3650213
 9:55 pm on May 14, 2008 (gmt 0)

Wonder whether it could be removed within the webmaster tools by using

Remove URLs

In order to use that tool, you have three options:

1. block with robots.txt
2. use a noindex in a robots meta tag
3. return a 404

None of these options are viable, unless just maybe you can afford to have no robots.txt file at all. Still, that is not sane.

Staffa




msg:3650234
 10:06 pm on May 14, 2008 (gmt 0)

...including ours here

and including their own as well as many of the big names around.

internetheaven




msg:3650666
 9:32 am on May 15, 2008 (gmt 0)

How is this causing you a problem, Mike? Is your robots.txt actually ranking on a keyword query?

You never know if it is causing a problem with Google. Wondered if it was a side effect of something bad. We were the victim of a server hack (someone uploaded p*rn images into every page) two weeks ago, then last week someone linked to 1000+ pages that did not exist on the site (yes, Google indexed them too as duplicate content even though they had noindex/nofollow tags) so I wondered if this was the latest ranking attack by our competitors.

Webnauts




msg:3651490
 2:57 am on May 16, 2008 (gmt 0)

You can try adding this in your .htaccess file:

<FilesMatch "robots\.txt">
Header set X-Robots-Tag "noindex, nofollow"
</FilesMatch>

Then try again to remove with the Google Webmaster Tools "Remove URL".

tedster




msg:3651518
 3:34 am on May 16, 2008 (gmt 0)

Good approach, Webnauts! I haven't had cause to use an X-Robots-Tag since it was introduced in July 2007, and it completely escaped my mind. This is exactly the type of situation that the protocol was created for.

...META tags give you useful control over how each webpage on your site is indexed. But it only works for HTML pages. How can you control access to other types of documents, such as Adobe PDF files, video and audio files and other types? Well, now the same flexibility for specifying per-URL tags is available for all other files type.

Official Google Blog [googleblog.blogspot.com]


eltercerhombre




msg:3651583
 6:51 am on May 16, 2008 (gmt 0)

If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop.

You broke the internet! ;)

shri




msg:3651635
 8:13 am on May 16, 2008 (gmt 0)

Anyone remember Brett's robot.txt experiment which resulted in the WebmasterWorld robots.txt with a PR5 or 6.

(Back to lurk mode)

Clark




msg:3651870
 2:44 pm on May 16, 2008 (gmt 0)

How about if you only get your robots.txt published if you link to it from your own domain? So that others can't get your robots.txt published (unless you have user generated content that is too big for you to control)

cssteve




msg:3651924
 4:07 pm on May 16, 2008 (gmt 0)

I can see no point in wasting any effort on indexing a robots.txt file. What possible service could this provide to a internet searcher?

g1smd




msg:3652059
 6:44 pm on May 16, 2008 (gmt 0)

I needed a robots.txt file to be gone from the index.

I added this to robots.txt:

Disallow: /robo

Google picked up the new robots.txt file after a few days (for use in checking what they can index).

They looked at it several times each week, but the OLD robots.txt content remained in the SERPs and in their cache and snippet for many weeks.

About a month later, the robots.txt file dropped from the SERPs.

You can use robots.txt to block the indexer from indexing its content.

That will not restrict access for the bot that retrieves the file for its real intended purpose.

Jasp




msg:3652844
 12:42 am on May 18, 2008 (gmt 0)

If you use robots.txt to block it, that would mean that googlebot should not even request robots.txt - an insane loop. And of course, you don't use meta tags in a robots.txt file.

We already have a very common situation where the contents of robots.txt doesn't prevent robots.txt actually being fetched:


User-agent: *
Disallow: /

Clearly robots.txt is included under the root directory, but still must be requested and requested repeatedly to see if it has changed.

So just as

Disallow: /*.txt$

would prevent all .txt files from being indexed then so should

Disallow: /robots.txt

prevent that particular .txt file from being indexed without problems.

webdudek




msg:3655113
 9:10 am on May 21, 2008 (gmt 0)

I find this discussion amusing.
There wasn't anyone around here asking the question - "Why?"
Why do you care that your robots.txt is in google index?
What is the possible damage?
What could go wrong if someone reads it?
I have many pages (robots, sitemaps, disclaimers, privacy policies, login forms) on my sites that google had indexed and are completely useless to anyone. I really couldn't care less.
if someone wants to read my robots.txt - be my guest!
Anyone has a good reason for this discussion?

5ubliminal




msg:3656032
 4:54 am on May 22, 2008 (gmt 0)

This would be just STOOPID.
Sometimes, you lock access to sensitive folders using it and there's no point in any noob out there finding those paths and then banging at their doors.

It's not a privacy threat, it's a security one also.

reseller




msg:3656357
 2:04 pm on May 22, 2008 (gmt 0)

A suggestion from Matt Cutts which he posted in a comment [mattcutts.com] yesterday on his blog:

...., why not use a noindex directive in the robots.txt file?


g1smd




msg:3693306
 3:31 pm on Jul 8, 2008 (gmt 0)

I have recently been given a nudge about BotSeer at psu.edu. That is a site that indexes robots.txt files, republishes them, and allows that copied content to be indexed. The site also links back to your robots.txt files allowing them to appear in the SERPs too.

They have collected more than two million robots.txt files, and it is interesting to see which bots other people are blocking.

I use Disallow: /robo in my robots.txt so will not have the problem of my own data being so open to inspection. :-)

Brett_Tabke




msg:3714697
 11:48 am on Aug 4, 2008 (gmt 0)

How did you come out on this? Where you able to get your robots.txt removed from the index?

g1smd




msg:3714912
 5:14 pm on Aug 4, 2008 (gmt 0)

I have got them removed in the past using Disallow: /robo

Haven't had to do that any time recently.

webdude




msg:3714956
 6:17 pm on Aug 4, 2008 (gmt 0)

Even Google's robots.txt file is listed, among other amusing sites ;-)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved