Google ignoring robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Google ignoring robots.txt

I'm seeing lots of my 'Disallowed' pages in their index

Amygdala

11:35 am on Apr 22, 2004 (gmt 0)

I run a large site (approx 47,500 results for a site:www.domain.co.uk search).

I have robots.txt set to:-

User-agent: *
Disallow: /example.php

And yet, I am seeing LOTS of www.domain.co.uk/example.php?eg=123 type links showing up for the site:www.domain.co.uk search in google.

Why is this? And more importantly, how do I stop google from ignoring my robots.txt file? I do not want it spidering those pages, let alone listing them.

Any help greatly appreciated.

Sanenet

12:19 pm on Apr 22, 2004 (gmt 0)

I belive the actual syntax would be:

User-agent: Googlebot
Disallow: /example.php$

(I may be wrong on this, check it againts the robotstxt page).

However, if you put
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
in the specific pages you don't want spidered, google will not index that page, although it will follow the links.

dougmcc1

4:21 am on Apr 24, 2004 (gmt 0)

Why is this?

Just as example.php and example.php?eg=123 are entirely different pages on your site, Google considers them different pages as well.

For example, if you only wanted to disallow example.php?eg=123 you would need to do the following:

User-agent: *
Disallow: /example.php?eg=123

I'm guessing you want to disallow all versions of the example.php page in which case Sanenet's method would be more appropriate but hopefully this clears things up for you.

jdMorgan

4:54 am on Apr 24, 2004 (gmt 0)

Amygdala,

The code your posted is fine, and should keep spiders from requesting any page whose name starts with "/example.php" -- The Standard for Robots Exclusion calls for robots to use prefix-matching on the pathnames you specify.

However, your problem may be a different one, and it's not clear from what you posted. So I'll describe it, and then you can tell if it applies.

Google and Ask Jeeves will list your page with no title and no description -- a so-called "URL-only" listing -- if they find a link to that page anywhere on the Web. Note that by doing this, they are still technically in compliance with the letter of the Standard for Robots Exclusion - They *did not* fetch your page, they just found a link to it and listed the link in their search results.

Some people (including me) wish they would not do this, but there are two sides to the argument. They say that listing pages that they find links to expands the information available to their users. I say that there are pages on my site that I don't want listed in search engines because, for example, those pages make lousy "landing pages" for first-time visitors. However, they are not actually fetching my disallowed pages, they're just using information they found elsewhere, so I can't say they are violating the Standard. It still makes me mad, but that's tough, I guess.

There is a fix, but unfortunately, it costs some bandwidth. For Google and Ask Jeeves, ALLOW them to fetch the page in robots.txt, and then put the <meta name="robots" content="noindex,nofollow"> tag in the <head> section of the page. If you do this, then they won't include anything about that page in their search results. But it costs you bandwidth because they will periodically re-fetch the page, see the robots tag, and then drop it. I've used this method for a couple of years. It usually takes them awhile to drop a listing, but after that, it is effective.

Other robots may use this policy, or may adopt it in the future -- We just have to keep an eye on them and adapt as necessary. Currently, Google and AJ are the only (major U.S.) search engines I know of that "list by links."

Jim