Forum Moderators: open
I used mod_rewrite to set up 301 redirects from all of those URLs to the non-unique versions of the same pages, stripping out the unique elements. Sure enough, I see lots of 301 redirects being served to Googlebot in the logs. But I don't see it then reloading the pages to which it has been redirected -- any ideas on why not?
and its still indexing wrong URL's
ie. googlebot requests url www.otherdomain.tld and gets 301 redirected to www.correctdomain.tld, it indexes the page, and then lists it as www.otherdomain.tld, which is not right, it should be indexing it as www.correctdomain.tld and/or if that domain already exists in the index, updating the existing one and adding the PR from the old domain to the new one in some way, shape or form.
As a result of this incorrect operation, I have about 10 domain names or variants of domain names which are indexed with identical content, instead of one central domain name.
301's don't "send" any user-agent anywhere. The 301 response contains the new URL of the requested resource. It is up to the user-agent (in this case Googlebot) to request that new URL, but it is under no obligation to do so. In the specific case of Googlebot, it may be that since Googlebot receives the new URL in the 301 response, it doesn't necessarily want to crawl the page immediately. Google can and will list a page even if it doesn't crawl the page, but just finds a link to it. The downside is that the page is listed with only its URL, no title, no description. Some of you may want to keep an eye out for this.
Another possibility is that Googlebot is "trusting" you temporarily - assuming that the page at the new URL is in fact the same as the page at the URL it requested. The 'bot can always put the URL on a list and come back to check it later.
If Googlebot - in its possibly-new incarnation - does what we used to call a deep-crawl, I suspect you will see it fetch the new URLs, not the old ones. Or it will fetch both, and then merge the listings to use the new URLs.
I am still watching the signs (and WebmasterWorld posts) that deepbot and freshbot have merged, with no personal conclusions, yet.
I sure hope Google-things settle down soon...
Best,
Jim
Anyway, I'm just hoping it will take care of itself after some time. I doubt they'd just start ignoring a valid HTTP response.
From my experience with DeepFreshBot I tend to disagree.
DeepFreshBot currently does not fetch the URL given in 'Location:' of a 301 redirect header. It simply deletes the previously listed page. It even deletes URLs without a try to fetch them, after it recognized that a server is redirected to another. It seems that's checked by fetching both versions of robots.txt and index page (www and non-www) a few times simultaneously.
If a complete site, where most or all pages were indexed (by DeepBot) under domain.com, gets redirected from domain.com to www.domain.com, all domain.com URLs will disappear and DeepFreshBot does not fetch the pages from www.domain.com. Since the old bot disappeared, I've seen this strange behaviour on a couple of servers. May be it puts the target adress on a list, but unfortunately it's not the crawl schedule :(
Pages on the redirect target get only crawled and indexed when good links point to them, but Googlebot does not follow internal links on this pages. Submissions don't help, the www.domain.com-URLs of dumped interior pages (previously ranked PR3 - PR4) submitted 6 weeks ago weren't fetched yet.
I'm not sure whether that's a bug or just a delay. Probably it's a not yet ready feature.