Forum Moderators: mack
One of the posts on my blog was indexed with this url:
mysite.com/wordpress/index.php?p=56
I don't think this should happen because of the robots.txt line noted above. Indeed if I use Google's robots.txt testing tool, and enter the above url, the response I get is:
Blocked by line 44: Disallow: /*?*
So why is it indexed in Google the way it is? I would have expected this post to have been indexed like this:
site.com/category/post-title
Thanks!
I have some similar rules and also see similar problems, but in my case these were indexed before trying to prohibit any page with a query string, and I haven't had time lately to check up to see whether or not that stuff is getting de-indexed.
The post in question was put up on June 14th. The robots.txt file was put up on June 11th.
So the robots file was definitely there when this post was indexed. The other thing is that the post has not been indexed with its actual post url yet, which is the way I want them indexed. When I check the post url against the robots text file, it is allowed.
I started using a robots text file because on a couple of different web sites, things were being indexed 3, 4, or even 5 times. Once, when first uploaded to the site as a draft, again as a finished post, then in a category, and also other ways. The idea of the robots text file was suggested to me as a way to eliminate all that duplicate content.
also, i believe you can exclude directories or files but not parameter strings.
I was wondering if the Google bot looks at the robots.txt file every time it access the site or if it uses the copy it downloads? (the one displayed on webmaster tools). If the latter that might provide an opportunity for the new robots.txt file to have been missed, since Google show it as being downloaded on 6/18. If the bot looks every time it access the site, this makes no sense.
I'll watch the next few posts and see if the same thing happens.
Even if you would have been recognized by G, using their link: feature wouldn't help completly, as they only show a fraction of the sites that link to you.
You may wish to enter the name (not url) of your site, in the search query.
url: example.com
site name: "the example place"
Thanks, Vince. I did not understand that.
I could see if a page had previously been indexed and then you disallow it, how Google might still index it. But.... if a page has never been indexed and is disallowed by robots.txt, then the Google bot should ignore it.
Maybe what is happening in my case is that before my robots.txt file was uploaded, Google was allowed to index pages in that directory, so Google knows about the directory. Now, even though I disallow the directory with robots.txt, because Google knows about it, the bot ignores the direction to exclude it and indexes any new pages put into the directory?
"and it doesn't stop an engine from making entries for pages which it knows about by other means. In the past, Google has listed pages and domains which are entirely disallowed by robots.txt and used the title and snippet from a reputable directory."
Once again Vince has it right!
For anyone searching and finding this thread, here is what I discovered:
I am using a Wordpress plugin that is broadcasting the url to one specific website as a post or page is going from draft to published form. At that point, there is not a permanent url for the post. The permalink is generated upon publishing. So the url that the plugin uses and passes on ends up on that website and incorrect because the permalink url is created after the plugin does its thing, upon publishing.
I use only the permanent url as what I allow Google to index, but Google finds the incorrect url on that other website and indexes it that way also. One page, two different urls.
I found a way to fix this behavior, so hopefully it will cease.
Thanks for the help!
[edited by: WilliamT at 1:17 am (utc) on June 26, 2007]