I have ~300 "bad" documents in the Google index

Forum Moderators: open

Message Too Old, No Replies

I have ~300 "bad" documents in the Google index

how can I get rid of them?

Martin Dunst

8:27 am on Feb 10, 2004 (gmt 0)

Hello,

There's one ressource on my site that I don't want Google to index, as it is a user registration form. The ressource is usually requested with one url-parameter, urls look like that:

www.example.org/thatscript.php?key=value

Now Google indexed 400 different urls of that kind, resulting from 400 different key=value parameters.

My robots.txt (untouched for months):
User-agent: *
Disallow: /thatscript.php

Is there any problem with my robots.txt or is it Googlebot's fault?

regards
Martin

domokun

9:42 am on Feb 10, 2004 (gmt 0)

i think im right in saying that googlebot is working just fine. i predict that these occurance in the serps will sjow just the link i.e. no text, no title, no description.
if you want to googlebot to not display the links you have to tell it not to on the individual page using the meta tag robots and stating noindex, nofollow.

johannamck

2:00 pm on Feb 10, 2004 (gmt 0)

noindex,nofollow will keep Google from adding these pages in the future.

It can be a bit slow with housekeeping though. If no links point to those pages, they might not get visited anytime soon.

I changed the tag to noindex,nofollow for a bunch of pages last year and they stayed in the index for months until I linked to them (in an unconspicious way) from one of my regular pages.

rogerd

2:09 pm on Feb 10, 2004 (gmt 0)

This is kind of paradoxical - to eliminate the listed (but not spidered) URLs you have to undo the prohibition in robots.txt, but then use the robots "noindex" metatag in the page itself. Otherwise, Googlebot makes a note of the link when found on other pages, but doesn't actually spider the page. One can end up with a bunch of zombie URLs that clutter up Google's index. I don't know if this is detrimental to the site's "real" pages, but it certainly has a messy feel to it.

johannamck

2:57 pm on Feb 10, 2004 (gmt 0)

Rogerd, I agree. It does have a messy feel.

It reminds me a bit of this other problem:
[webmasterworld.com...]
where someone had to resurrect an old server to get Google to spider the new server location. (Even though that was a different technical issue.)

Maybe Googlebot is sentimental, and can't let go of the past easily? ;)

dirkz

3:01 pm on Feb 10, 2004 (gmt 0)

> This is kind of paradoxical - to eliminate the listed (but not spidered) URLs you have to undo the prohibition in robots.txt

I have the same "problem". Lots of robots.txt-blocked pagesin the index. What is the drawback of it? Does it somehow hurt my ranking?

adfree

3:24 pm on Feb 10, 2004 (gmt 0)

>Does it somehow hurt my ranking?

Can't think of a good reason why it would hurt you. It's pretty much a matter of ugliness and confidentiality on your side and served quality on the SE side.

Cleaning up would help both, cheers, Jens

GoogleGuy

5:07 pm on Feb 10, 2004 (gmt 0)

It wouldn't make a difference for your rankings either way. You might want to check out our automatic url removal tool:
[google.com...]

Not positive if it will work for your case (typically people use it for 1-2 urls), but you could check it out.

(I don't recall off-hand if that robots.txt will match your urls. We do support wildcards in the Disallow field though. I would check out our webmaster section for some examples.)