Googlebot appears to ignore robots.txt

Forum Moderators: open

Message Too Old, No Replies

Googlebot appears to ignore robots.txt

Key_Master

4:43 am on Jul 4, 2002 (gmt 0)

Google is crawling the cgi-bin on one of my sites even though it is (and has been) clearly prohibited from doing so in my (validated) robots.txt file. My security software kicked in automatically and prevented Googlebot from retrieving any information but this still has me greatly concerned. Others on this forum have reported similar invasions into their cgi-bin.

Privacy and security concerns aside, what effect does this have on a site's theme and overall PR?

<rant>GoogleGuy, what part of Disallow: /cgi-bin/ does your robot not understand? How would you like it if I sent my spider into your protected areas? :(</rant>

No respect!

nancyb

3:54 pm on Jul 10, 2002 (gmt 0)

Thanks GoogleGuy for the explanation, but my robots.txt hasn't changed (except for additions) in over a year, it validates and returns a 200. When I have time I'll check my old logs and mail search-quality with the intrusions.

GoogleGuy

7:45 pm on Jul 10, 2002 (gmt 0)

Sounds good, nancyb.

jdMorgan

8:54 pm on Jul 10, 2002 (gmt 0)

Googleguy,

First of all, please read this as if I'm your next door neighbor as well as
a webmaster, and we're having a beer down the street after work, and I'm
frustrated with a minor problem - not disgruntled! I have been told that I
often "write grumpy", and that's not the case. Many others here have far
more at stake on the web than I do - My main site is a hobby site, and the
others are informational, not e-commerce sites.

You wrote:

Here are a few things that you should check for first though:
1. If you try to fetch www.yourdomain.com/robots.txt, do you get a valid page,
preferably with a 200 status code? You should not return a 403
(forbidden error).
2. If you see result pages returned that are supposed to be forbidden, are you
sure we crawled the page? Google can return a link as a result even if we
didn't crawl the page. One key thing to look for is that if the page
doesn't have a snippet, then we probably didn't crawl the page.

On review, this appears to be happening to me - uncrawled pages listed with no
snippet. I would like to understand why Google indexes these pages, when they
are all marked "Disallowed" in robots.txt and contain the "robots
noindex,nofollow" html tags.

These pages are either "somewhat sensitive", containing stuff I would prefer to
keep out of easy view (e-mail spammer treasures), or they are printable forms or
other "junk" that would be of no plausible use to anyone outside of our
organization.

I understand that the answer is likely to be that "the code isn't written that
way", but I would - respectfully - like to know, "Which part of 'Disallow,
noindex, nofollow' does Googlebot not understand?" It is my opinion that any
resource Disallowed by robots.txt or by the noindex metatag should be dropped
cold by Googlebot; It says "noindex", not "nofetch" or "nocrawl" or "nosnippet".

I am sure that there is a way to work this in with the spidering process such
that it is not necessary to re-fetch robots.txt every time Googlebot finds a
link to a given domain, and also such that it is not necessary for Googlebot (or
your db crunching routines) to keep copies of every robots.txt on the web.

Personally, I would not care if it meant an increase in robots.txt-effectiveness
latency in the process - IF I was assured that these links would be dropped as
a final step before the new index was published.

To save time, let me say that I'm using shorthand here for the syntax of the
robots.txt directives and the robots metatag - My resources all validate to the
appropriate standards. They also will return http server code 200 and are not
cloaked unless you use use a well-known e-mail harvester or site-grabber UA.
The problem is just as you described - uncrawled but Disallowed links that I'd
like to see dropped from the index (in a systemic way, not case-by-case).

Thank you GoogleGuy, in both your official and unofficial capacity, for
"engaging" with the webmaster community here at WebmasterWorld. I am
pleasantly surprised (OK, astounded) that any modern company has the wisdom to
realize the value of the goodwill that this generates.

Jim

Key_Master

10:03 pm on Jul 10, 2002 (gmt 0)

Thank you for bringing this up jdMorgan. I didn't want to seem like I was picking on Google or GoogleGuy.

There is a benefit for Google in what jdMorgan is suggesting- less 404's in Googles database. No telling how many there are already. I've also noticed that Google has had to modify search results for various file exploits, e.g. formmail.pl. However (without posting specific urls), I was still able to find hundreds of password lists in Googles database. Many of these undescript URLs were in robots.txt prohibited directories. This is what I meant about Google's huge effect on Web security.

jdMorgan

10:23 pm on Jul 10, 2002 (gmt 0)

Key_Master,

Yeah, I don't think Google should be picked-on. Fact is, Google is one of
only two SEs who are responsive to wmw member's concerns in
any meaningful way. And Google is far and away more engaged than the other
player, who is still a new kid on the block.

After watching several interactions with GoogleGuy on wmw, I
have no doubt that if the problem is fixable, it will be fixed - and soon.
Google's participation here is good business, and it is an outstanding example
of Public Relations (not to be confused with that other "PR"). :)

If we "pick on" GG, mob him, or stickymail him to death, Google might stop
participating here, and that would be bad... Thus, my introductory
"write grumpy" disclaimer. ;)

(A few days ago, K-M and I discussed aspects of this problem on another thread,
and I had to go check my logs before I posted here to make sure those pages had
not been crawled. They had not been crawled, but were included the results
of certain advanced searches as snippet-less links apparently because there were
links pointing to them.)

Jim

mbauser2

11:09 pm on Jul 10, 2002 (gmt 0)

These pages are either "somewhat sensitive", containing stuff I would prefer to
keep out of easy view (e-mail spammer treasures), or they are printable forms or
other "junk" that would be of no plausible use to anyone outside of our
organization

A seriously misguided approach. Google lists those pages because they're linked from other indexed pages. Removing the link from Google won't hide you from spambots, because they can find you from the other links to the page. Expecting Google to protect you from spammers is bad design, and a bit delusional.

"Somewhat sensitive" information is a non-functional categorization in an network (the Web) designed to make information more accessible. Information has to be protected, or public. Creating "inbetween" categories is counterproductive, and in the end, useless, because you depend on others' good behavior.

While we're at it, your original assumption about robots.txt is wrong.

It is my opinion that any
resource Disallowed by robots.txt or by the noindex metatag should be dropped
cold by Googlebot; It says "noindex", not "nofetch" or "nocrawl" or "nosnippet".

This opinion is inconsistent with the stated purposes of the 1994 robots.txt specification [robotstxt.org]. Some key quotes:

Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.

The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots.

The mission of robots.txt is intentionally narrow, and meant to protect a server from being accessed, not to protect an address from being known. Any information garnered about a resource without touching the server is fair game.

This is a non-productive line of thought. You guys are trying to take standards created by the search engine industry and misrepresenting the standards in order to use standards against them. robots.txt and the robots META tag were created to protect servers and content, respectively. An address isn't content.

jdMorgan

12:20 am on Jul 11, 2002 (gmt 0)

mbauser2,

OK, I'm misguided... I want to keep our "junk" files off the web, and would
prefer if visitors used standard entry points to our site - if for no other
reason than to avoid confusion... on my part and theirs.

But nevertheless, I have stated in two different ways - using robots.txt and the
robots html metatag (The only two methods, BTW, available to most of the
"tripod.com/~home page" creators) - that I do not wish the pages to be included
in SE's indexes. Without invoking formal symantic analysis of the precise
meaning of the robots exclusion standard, why would Google want to waste space
on their disk farms to store these link-only no-snippet page results when the
pages are tagged for exclusion?

I will see if anything constructive comes of this discussion, and if not, take
further steps to clean up the mess by using image or scripted links, rearranging
and deleting pages, using .htaccess to block outside-of-domain access to some
resources, etc. There's nothing so important that I need to move it a secure
server or anything, but do web searchers really want a deep link to the third
page of our sports club's membership application form - which is essentially
meaningless outside of the application process context?

However, I hope that Google will consider the balance between "indexing
everything" and having their robot behave in a manner consistent with the
expectations of webmasters who use robots.txt and the robots metatag -
even if not limited by the formal requirements of the robots exclusion standard.
Yes, I said "expectations" - even if misguided, pollyannic, naive, uninformed,
or just plain stupid... I acknowlege that Google is not doing anything wrong
"by the book." That's why I consider this a discussion, not a demand.

Simply put, I put a "noindex" directive on some of my pages. They are indexed.
This was a surprise to me. I know what "the book" says, but yet, I don't think
the issue should depend on what the legal definition of "is" is...

Cheers,
Jim

Key_Master

12:34 am on Jul 11, 2002 (gmt 0)

mbauser2,

A link isn't content...debatable. Search engines do use text in links to determine relevance, albeit at a minimal.

Anyhow you are confusing the issue. The robots.txt specs have nothing to do with links in a textual sense. In other words, it is fair for a search engine to index an allowed page that may contain links, even if some of those links may point to robots.txt prohibited directories or files.

One argument here is whether a search engine can take a bold step and say that a link to a robots.txt prohibited files is in fact a true link. Think about it, Google doesn't even know if it really is a link. It cannot know- Googlebot has never followed it. Why would Google post a link that it has been asked not to index or follow? Where (or what) is the profit in that? Why doesn't Googlebot simply crawl robots.txt files and list all of those exclusions in their search results?

Besides, you really think your files are protected? You don't think someone could put a page up listing links to files on your site you thought were secret? And if this were to happen, you're saying it's fair game for the search engines to show? Comeon...

Ringing the bell, giving out wings...;)

[google.com...]
[google.com...]

rjohara

12:59 am on Jul 11, 2002 (gmt 0)

Jim, I'm curious about your problem because I also use noindex meta elements to keep a few local pages from being indexed, and they have always worked for me. Could you perhaps paste the meta elements here so we can look at them? It is possible for a page to validate but for there to be something wrong with the arrangement of the elements.

Just trying to think of what might be wrong.

RJO

Everyman

2:36 am on Jul 11, 2002 (gmt 0)

Whether the webmaster knows how to configure his website or not, the problem is that Google is too aggressive, with or without a robots.txt.

The toolbar being used to find new sites, and not mentioning this in their privacy policy, is one example.

Another example is that the robots.txt "disallow" still gets a link listed, even though it may prevent crawling. If the filename or directory name is relevant, this link comes out in the SERPs, available for everyone to click on.

Another example is when we all found out about the image search in June, 2001 by noticing that Google was sucking up our images. Those of us who didn't like this had to move all our images into a disallowed directory, and wait for the next image crawl to kick in months later.

Here's another example -- put this in the Google search box:

"index of /" +etc +passwd -- You get 14,200 hits.

Google has an attitude problem. They feel that your site is really their site.

jdMorgan

4:56 am on Jul 11, 2002 (gmt 0)

rjohara,

Since others have reported the same problem, I don't think there's a need to
post my robots.txt - I've been coding since the Intel 8008 (8008, not 8080)
processor came out, and the elements of a robots.txt file - and their order-
dependent meaning - are clear to me. I'll sticky you if you really want to
see it, though.

The way I get these snippet-less link-only results to show up is to simply
click on a "[More results from www.quux-foo.org]" link after doing a regular
simple search.

Here's what a couple of them from the "more results" SERP look like (just
imagine they're blue and grey and underlined):

www.quux-foo.org/join_us.html
Similar pages

www.quux-foo.org/pri_pol.html
Similar pages

I just don't see the point in these listings. For the most part, I don't
deeply care that they're listed, I'm just surprised that they're listed.

And having been in the computer business a fairly long time, it wouldn't
surprise me at all if this is just the result of imperfect translation of
imperfect requirements specifications based on imperfectly-worded standards
into imperfect code. Add to that the complications of massive parallelism,
and you just might get a bug. :)

Key-Master said what I meant (in a lot fewer words, too) in the first four
sentences of the third paragraph in his post above. I'm planning to go back
into lurk mode, and see if we get any word from the 'plex... Sometimes
"looking into it" takes awhile. Right now, I've got to go fix another
problem - Lycos just came in and tried to grab robots.txt with NO UA and NO
referer. It uses its UA in the next line fetching "/" though... (sigh).

Jim

EliteWeb

6:49 am on Jul 11, 2002 (gmt 0)

Everyman in regards to the query
"index of /" +passwd.txt

on google.com that is of the fault of the webmaster it sucks sure but with anything good there is bad for those who want bad :)

GoogleGuy

7:13 am on Jul 11, 2002 (gmt 0)

This is a great thread. I would write a long post now, but right now I'm on an ancient version of Netscape where the text box doesn't wrap, so it's really hard to post. I'll try to write up some more thoughts/details about this by this weekend.

jdMorgan, thanks for the kind words. I've read much grumpier posts, and I appreciate the honest feedback.

Well, it's coming out anyway, so I'll do a quick pass. :) First, I think there's a definite value to returning a link to a page even if we can't crawl that page. Quick example: the New York Times used to disable all bots from crawling them. That's fine, and we respected their robots.txt. But if a user comes to Google and types "ny times" into our search box, the best result to give them is nytimes.com. By returning the link to nytimes.com--even though we never actually crawled that page--we were giving a better search result to users. Luckily, most sites have realized that being visible in search engines is a good thing.

So that's a quick example that shows the value of returning links even if you can't crawl the page. If a certain url is "newsworthy" enough to be returned as a search result even if we didn't crawl the page, it's probably a good result. jdMorgan, that's what's happening with you. Whatever your site is, it's reputable enough that we recorded a few of your links even though we weren't allowed to crawl them, according to your robots.txt.

Everyman brings up a really good point about the balance between aggressively seeking information vs. holding back, and it's true that Google's philosophy is along the lines of "if someone can type it into a browser address bar, we ought to index it." That definitely leads to a few problems where webmasters don't protect their data, and Google finds it. On the other hand, 9 out of the top 10 results for "index of /" +etc +passwd are actually useful information pages or discussions about the /etc/passwd file. For now, I'll just agree with Everyman that this is a tough balance to strike.

Anyway, I hope you don't stay totally in lurk mode, jdMorgan. Seems like you've got a lot to contribute around here! :)

GoogleGuy

7:17 am on Jul 11, 2002 (gmt 0)

I feel obligated to insert a quick post here to bring me to an even 300 posts. Woohoo! :) I'm glad that we've got a line of communication open with webmasters.

vitaplease

8:32 am on Jul 11, 2002 (gmt 0)

By returning the link to nytimes.com--even though we never actually crawled that page--we were giving a better search result to users.

In a way Google is right. If hundreds and thousands of sites and directories want to help their visitors by also showing links to the nytimes.com why should Google not be just as helpful. They are just showing a link.

These pages are either "somewhat sensitive", containing stuff I would prefer to keep out of easy view (e-mail spammer treasures), or they are printable forms or other "junk" that would be of no plausible use to anyone outside of our organization.

jdMorgan also has all reason to be worried if links are showing to his protected sub-pages. The "junk" matter is less of an issue. That's Google's problem.

I would say an easy way out is that Google would use its own Pagerank mechanism to solve this.

Google should only show links towards index-pages (such as the nytimes.com example) with the right robots protection code that would have had a high Pagerank, should they have been allowed to be indexed. (With an inclusion that it needs many links, not just one link from a PR8 page).

Google should not show links towards internal robot-protected pages of a site. Those would, in 90% of the cases be sensitive material according to its owners.

Again: nothing wrong with showing the link towards the index-page of a well known site, but lets be less inquisitive about internal stuff.

If Google is really trying to serve the surfer by showing these "obscure" internal, index protected pages by way of a link, why not save that hard disk space for previously unindexed, complete sites with no links to them, but allowing full indexing in all their robot texts?

(BTW I am a webmaster since Pentium 600 ;))

ciml

10:27 am on Jul 11, 2002 (gmt 0)

jdMorgan, your META robots tag cannot be used by a /robots.txt respecting robot like Google, as the page is never fetched.

I would be interested to know if Google lists URLs that have NOINDEX in the robots META tag, at least one link from a page in Google, and are not /robots.txt forbidden.

From the robots.txt spec's [robotstxt.org], "The INDEX directive specifies if an indexing robot should index the page." so I don't see why an engine couldn't list the URL without indexing the page. However, in this case it would make sense not to keep the link as it's clear the webmaster doesn't want it to be found (unlike /robots.txt exclusion which can be for server load reasons).

vitaplease, I disagree that Google should only list un-fetched URLs if they are at document root. Often, the pages in a prouct catalogue don't get fetched due to lack of PageRank, but can be listed as URLs from their link text.

Robots Exclusion Protocol should never be used to keep "sensitive material" safe; it just doesn't do that job and it cannot do that job. If it's sensitive, it shouldn't be returned to unauthenticated users who connect to port 80 and ask for it.

GoogleGuy:
> I'm glad that we've got a line of communication open with webmasters.

Me too (even if you haven't told us how to avoid those darned PR0 penalties yet:)).

vitaplease

1:54 pm on Jul 11, 2002 (gmt 0)

vitaplease, I disagree that Google should only list un-fetched URLs if they are at document root. Often, the pages in a prouct catalogue don't get fetched due to lack of PageRank, but can be listed as URLs from their link text.

If Google does not fetch pages because of lack of Pagerank, wouldn't it be that Google then has a spidering capacity/efficiency problem and that that would have little to do with privacy/security/presentation issues?

But - anno-now - in an imperfect Google-world, your option is probably the most realistic.

If Google wants to keep everyone happy, i.e. the searcher looking for certain sites and the webmaster wanting to keep certain low-level information private, I believe they should try to revert to a similar index-page Pagerank restriction model as mentioned above, as soon as their spidering capacity/efficiency is no more an issue.

mbauser2

6:18 pm on Jul 11, 2002 (gmt 0)

There's nothing so important that I need to move it a secure
server or anything, but do web searchers really want a deep link to the third
page of our sports club's membership application form - which is essentially
meaningless outside of the application process context?

If they're searching something that leads them there, then yes, that might be what they're looking for. That's the Web for you.

You're making the ultimate mistake of a Web Author Who Doesn't Get It: You're assuming the author always knows better than the reader. The Web is a hypermedia system; the entire point of hypermedia is to give users a choice of paths through an information space. Search engines are one way of creating paths. Browser bookmarks are another. Static links are a third. The path is never completely under author control.

The failure is not in the medium, it's in your failure to understand the organizing principle of the medium. You may as well start complaining that all of Gutenberg's bibles look the same.

Simply put, I put a "noindex" directive on some of my pages. They are indexed.
This was a surprise to me. I know what "the book" says, but yet, I don't think
the issue should depend on what the legal definition of "is" is...

OK, now you've gone past "misguided" and straight into "offensive". I'm not the one who tried to redefine meaning, you are. The standard says what it says, and the robots do what it says. You're the one diving for cover behind semantics, demanding that others bend to your "expectation". You have expectation, I have expectations, the world has expectations: Expectations don't work as a way to settle debates. Standards do.

Responsible human beings who have issues with rules try to reform the rules, not just insist everyone ignore them. You're fighting the wrong fight, and that's what makes your fight meaningless.

mbauser2

8:01 pm on Jul 11, 2002 (gmt 0)

A link isn't content...debatable.

I didn't say "link", I said "address". An address is a string of characters that identifies the location of a content-bearing resource in an information space. Location isn't content, it's metadata. Even Yahoo knows that [dir.yahoo.com], and their editors are usually the last ones to know anything important.

One argument here is whether a search engine can take a bold step and say that a link to a robots.txt prohibited files is in fact a true link. Think about it, Google doesn't even know if it really is a link. It cannot know- Googlebot has never followed it.

Ow. You're confusing "link", "address", and "resource" again. A link defines a relationship, and relationships can be examined without seeing both ends of the link. If one end (resource) of the relationship is hidden, lost, or imaginary, the relationship still has some meaning. Google's big "thing" is that they understood the relevance of relationships before other engines.

Why would Google post a link that it has been asked not to index or follow? Where (or what) is the profit in that?

Apparently, the profit is in helping people find information. They've been including URL-only entries since Day One, and people have apparently been using them. Google is living proof that relationship analysis can be almost as useful for resource discovery as content analaysis is.

Why doesn't Googlebot simply crawl robots.txt files and list all of those exclusions in their search results?

robots.txt doesn't it tell it anything about the content of those files. There's no meaningful semantic information in robots.txt

Besides, you really think your files are protected? You don't think someone could put a page up listing links to files on your site you thought were secret?

Are you trying to put somebody else's words in my mouth?

And if this were to happen, you're saying it's fair game for the search engines to show?

That's exactly what I am saying. "Security through obscurity" is naive. Trying to control what others say about you is naive. Allowing others to control what third parties say about them is dangerous for the free flow of information.

Ringing the bell, giving out wings...

I'd tell you where to put that bell, but the moderators would just edit it out.

mbauser2

8:03 pm on Jul 11, 2002 (gmt 0)

Google should only show links towards index-pages (such as the nytimes.com example)

Assumes "domain == site". Not true.

jdMorgan

8:56 pm on Jul 11, 2002 (gmt 0)

mbauser2,

I am not complaining. I expressed surprise that the combined system of
robots.txt and meta robots tags leaves a gap with unexpected results. A major
contributor to this gap is the fact that some search engine robots observe
robots.tx, some observe the robots meta tag, some do both, and I suppose there
may be some low-profile ones that do neither. If all SEs observed both, I could
"Allow" these pages in robots.txt, and then "meta noindex" them in the pages
themselves. According to what Googleguy posted early-on in this thread, that
would likely solve the problem.

I understand the medium well; I am no more trying to bend the rules to my will
than a search engine company which decides not to index all those tilde-domain
home pages (whether as a policy or by algorithm design) because they are - in
the grand scheme of things - not that important. I think I'm just being
practical. YMMV.

The e-mail-address-hiding business is something I've just finished doing,
and it was a lot of work. Maybe my frustration was showing.

Perhaps the next update will sort things out if there was a problem with
Googlebot accessing my robots.txt or some similar temporary problem. If not,
I'll just take those pages down, then rewrite and re-post a smaller number of
them to reduce the cr@p-factor. Or perhaps some "negative optimization"
will help.

You have contributed to clarifying this discussion, and for that - Thanks. As
to your personal remarks about me, I will not be drawn by ad-hominem attacks.

Jim

Key_Master

10:45 pm on Jul 11, 2002 (gmt 0)

A link isn't an address and an address isn't a link? Talk about splitting hairs. If a link (errrrr address if you prefer) points to a robots.txt prohibited page on my site I would prefer that it not be listed, if (for nothing else) the sake of common courtesy. Anyhow, what is the "resource" in a robots.txt prohibited, .htpasswd protected directory? Nobody but the owner can access it (lawfully), why does the link need to show up at all in Google's database? Does Googlebot go, "Hmmmm, preventing access to a shopping cart script. No problem, I know a few hackers who search for this sort of thing..."?

Trying to control what others say about you is naive. Allowing others to control what third parties say about them is dangerous for the free flow of information.

Putting words into my mouth? Who said anything about controlling what others say about you? But since you brought it up I'll let you in on a little "secret"- this is done all the time. Yes, believe it or not, we can control what other parties "say" about us. There is no such thing as "free flow of information". That's what you are so naive about.

This argument will eventually subside because in the future I'd wager that most sensitive hyperlinks will be time coded and encrypted. Google and other search engines wont bother indexing them because the links will have expired 30 minutes after they were spidered.

If one end (resource) of the relationship is hidden, lost, or imaginary, the relationship still has some meaning.

Who are you quoting from, Sigmund Freud?

I feel obligated to insert a quick post here to bring me to an even 300 posts. Woohoo! I'm glad that we've got a line of communication open with webmasters.

GoogleGuy, noticed you didn't say, Yahoo! Just kidding...Congratulations on 300 posts. Although I don't always agree with your company's stance on some issues, I really do appreciate your input.

I use Google every day just like the rest of the world. ;)

ciml

10:51 am on Jul 12, 2002 (gmt 0)

Key_Master:
If a link (errrrr address if you prefer) points to a robots.txt prohibited page on my site I would prefer that it not be listed.

I know of no accepted protocol for making that known. Maybe it would be useful if the Robots Exclusion protocol had a "please don't mention this URL" facility in /robots.txt, but it doesn't. It has a "please don't fetch this URL" facility.

Google added the noarchive META tag, but it's hard to see how they could come up with a scheme for 'secret URLs'. Maybe it's better that they don't. The false sense of security would encourage people to think that they don't need to protect their sensitive information.

mbauser2:
> If one end (resource) of the relationship is hidden, lost, or imaginary, the relationship still has some meaning.

That's what my imaginary friends tell me.:)

This 53 message thread spans 2 pages: 53