Why does Google index this?

Forum Moderators: mack

Message Too Old, No Replies

Why does Google index this?

WilliamT

3:53 am on Jun 19, 2007 (gmt 0)

This is one of the lines in my robots.txt:
Disallow: /*?*

One of the posts on my blog was indexed with this url:
mysite.com/wordpress/index.php?p=56

I don't think this should happen because of the robots.txt line noted above. Indeed if I use Google's robots.txt testing tool, and enter the above url, the response I get is:

Blocked by line 44: Disallow: /*?*

So why is it indexed in Google the way it is? I would have expected this post to have been indexed like this:

site.com/category/post-title

Thanks!

ergophobe

3:31 pm on Jun 19, 2007 (gmt 0)

Was this indexed before you added that rule? It might take time to get it out of the index.

I have some similar rules and also see similar problems, but in my case these were indexed before trying to prohibit any page with a query string, and I haven't had time lately to check up to see whether or not that stuff is getting de-indexed.

WilliamT

4:22 pm on Jun 19, 2007 (gmt 0)

Thanks for the reply. I looked at the dates to be sure and this is the chronology:

The post in question was put up on June 14th. The robots.txt file was put up on June 11th.

So the robots file was definitely there when this post was indexed. The other thing is that the post has not been indexed with its actual post url yet, which is the way I want them indexed. When I check the post url against the robots text file, it is allowed.

I started using a robots text file because on a couple of different web sites, things were being indexed 3, 4, or even 5 times. Once, when first uploaded to the site as a draft, again as a finished post, then in a category, and also other ways. The idea of the robots text file was suggested to me as a way to eliminate all that duplicate content.

phranque

12:11 am on Jun 20, 2007 (gmt 0)

according to the Web Server Administrator's Guide to the Robots Exclusion Protocol [robotstxt.org] there is no regular expression or wildcard support except that "*" can be used to specify "all" user agents.

also, i believe you can exclude directories or files but not parameter strings.

ergophobe

2:17 am on Jun 20, 2007 (gmt 0)

The * is a Google extension to the protocol, so in theory his exclusion should work as he expects with Google, but not necessarily with other search engines.

Another option might be to try to do this via the sitemaps/webmaster tools.

vincevincevince

2:19 am on Jun 20, 2007 (gmt 0)

Was it indexed before your robots.txt changes? In addition, is your RSS or similar feed giving out nice titles and snippets for your disallowed pages using the disallowed URLs?

WilliamT

2:25 am on Jun 20, 2007 (gmt 0)

Thanks! I had this explained to me so I actually have a list in robots.txt for Google and another for other bots. Also, I tested this using the Google webmaster tools and it was excluded.

I was wondering if the Google bot looks at the robots.txt file every time it access the site or if it uses the copy it downloads? (the one displayed on webmaster tools). If the latter that might provide an opportunity for the new robots.txt file to have been missed, since Google show it as being downloaded on 6/18. If the bot looks every time it access the site, this makes no sense.

I'll watch the next few posts and see if the same thing happens.

vincevincevince

2:28 am on Jun 20, 2007 (gmt 0)

Remember that robots.txt controlls permission to crawl, not permission to index. It doesn't say anything about pages which were crawled before the robots.txt was there, and it doesn't stop an engine from making entries for pages which it knows about by other means. In the past, Google has listed pages and domains which are entirely disallowed by robots.txt and used the title and snippet from a reputable directory. Now they are reading feeds, it may be that they are doing it for them as well.

JohnRoy

7:19 pm on Jun 20, 2007 (gmt 0)

- Any ideas how I can find those sites with my URL

Even if you would have been recognized by G, using their link: feature wouldn't help completly, as they only show a fraction of the sites that link to you.

You may wish to enter the name (not url) of your site, in the search query.

url: example.com
site name: "the example place"

WilliamT

9:26 pm on Jun 20, 2007 (gmt 0)

"Remember that robots.txt controls permission to crawl, not permission to index."

Thanks, Vince. I did not understand that.

I could see if a page had previously been indexed and then you disallow it, how Google might still index it. But.... if a page has never been indexed and is disallowed by robots.txt, then the Google bot should ignore it.

Maybe what is happening in my case is that before my robots.txt file was uploaded, Google was allowed to index pages in that directory, so Google knows about the directory. Now, even though I disallow the directory with robots.txt, because Google knows about it, the bot ignores the direction to exclude it and indexes any new pages put into the directory?

WilliamT

1:15 am on Jun 26, 2007 (gmt 0)

Quote from Vince, above:

"and it doesn't stop an engine from making entries for pages which it knows about by other means. In the past, Google has listed pages and domains which are entirely disallowed by robots.txt and used the title and snippet from a reputable directory."

Once again Vince has it right!

For anyone searching and finding this thread, here is what I discovered:

I am using a Wordpress plugin that is broadcasting the url to one specific website as a post or page is going from draft to published form. At that point, there is not a permanent url for the post. The permalink is generated upon publishing. So the url that the plugin uses and passes on ends up on that website and incorrect because the permalink url is created after the plugin does its thing, upon publishing.

I use only the permanent url as what I allow Google to index, but Google finds the incorrect url on that other website and indexes it that way also. One page, two different urls.

I found a way to fix this behavior, so hopefully it will cease.

Thanks for the help!

[edited by: WilliamT at 1:17 am (utc) on June 26, 2007]