|Googlebot Not Following 301s|
Googlebot is not loading the URLs redirected to
I inherited a situation where every URL served from a site was unique, based on the timestamp and the real server used from the server pool -- so Google's index contained very many multiple copies of the same pages.
I used mod_rewrite to set up 301 redirects from all of those URLs to the non-unique versions of the same pages, stripping out the unique elements. Sure enough, I see lots of 301 redirects being served to Googlebot in the logs. But I don't see it then reloading the pages to which it has been redirected -- any ideas on why not?
I noticed the same thing over a period of months. Somehow though Google managed to reindex every page that I had redirected.
Can't work it out, from close watching of my logs I saw that the bot never followed the 301 for sub pages but as I say the new urls are now in the index. ?
I see the same on several sites. DeepFreshBot just deletes the old URL but never fetches the new pages. Do a site search, there are lots of posts on this topic:
I am experiencing something similar to this, I have watched googlebot in my log fiels and often it hits a 301 and then doesn't go look up the page that the 301 is sending it to.
and its still indexing wrong URL's
ie. googlebot requests url www.otherdomain.tld and gets 301 redirected to www.correctdomain.tld, it indexes the page, and then lists it as www.otherdomain.tld, which is not right, it should be indexing it as www.correctdomain.tld and/or if that domain already exists in the index, updating the existing one and adding the PR from the old domain to the new one in some way, shape or form.
As a result of this incorrect operation, I have about 10 domain names or variants of domain names which are indexed with identical content, instead of one central domain name.
Could this be an explanation for the loss of backlinks? I was under the impression that over time Google would give credits for backlinks for a proper 301 redirect.
Just a note and some comments...
301's don't "send" any user-agent anywhere. The 301 response contains the new URL of the requested resource. It is up to the user-agent (in this case Googlebot) to request that new URL, but it is under no obligation to do so. In the specific case of Googlebot, it may be that since Googlebot receives the new URL in the 301 response, it doesn't necessarily want to crawl the page immediately. Google can and will list a page even if it doesn't crawl the page, but just finds a link to it. The downside is that the page is listed with only its URL, no title, no description. Some of you may want to keep an eye out for this.
Another possibility is that Googlebot is "trusting" you temporarily - assuming that the page at the new URL is in fact the same as the page at the URL it requested. The 'bot can always put the URL on a list and come back to check it later.
If Googlebot - in its possibly-new incarnation - does what we used to call a deep-crawl, I suspect you will see it fetch the new URLs, not the old ones. Or it will fetch both, and then merge the listings to use the new URLs.
I am still watching the signs (and WebmasterWorld posts) that deepbot and freshbot have merged, with no personal conclusions, yet.
I sure hope Google-things settle down soon...
I tried to clean up some URL confusion with 301s and we just disappeared. We had most links going to www, then when one link appeared without www, we started disappearing in places, so I decided to try to fix it. I fixed it alright.
Anyway, I'm just hoping it will take care of itself after some time. I doubt they'd just start ignoring a valid HTTP response.
>The 'bot can always put the URL on a list and come back to check it later.<
From my experience with DeepFreshBot I tend to disagree.
DeepFreshBot currently does not fetch the URL given in 'Location:' of a 301 redirect header. It simply deletes the previously listed page. It even deletes URLs without a try to fetch them, after it recognized that a server is redirected to another. It seems that's checked by fetching both versions of robots.txt and index page (www and non-www) a few times simultaneously.
If a complete site, where most or all pages were indexed (by DeepBot) under domain.com, gets redirected from domain.com to www.domain.com, all domain.com URLs will disappear and DeepFreshBot does not fetch the pages from www.domain.com. Since the old bot disappeared, I've seen this strange behaviour on a couple of servers. May be it puts the target adress on a list, but unfortunately it's not the crawl schedule :(
Pages on the redirect target get only crawled and indexed when good links point to them, but Googlebot does not follow internal links on this pages. Submissions don't help, the www.domain.com-URLs of dumped interior pages (previously ranked PR3 - PR4) submitted 6 weeks ago weren't fetched yet.
I'm not sure whether that's a bug or just a delay. Probably it's a not yet ready feature.