enigma1,
>Assuming the bot is aware of the 301 you should get nothing about these pages.<
That was my understanding.
>Otherwise the fact the page remains indexed for long time implies the bot may found a loophole with the way you did the redirect.<
My suspicion as well, but no one seems to be able to find a reason that SOME are there and others aren't. Or they returned after removal due to recent algo changes. ALL the most recently added ones seem to work fine.
The exact code I use in the .htaccess is as follows:
ReWriteCond %(HTTP_HOST) !www.example.com
ReWriteRule (.*) http://www.example.com/$1 [R=301,L]
Redirect 301 /index.html http://www.example.net/
Redirect 301 /index.htm http://www.example.net/
Redirect 301 /abc.htm http://www.example.net/
(followed by a bunch other 301s to various places)
If I were to "Redirect 301 / http://www.example.net/ that redirects EVERYTHING, no? So is there something else I need to be 301-ing?
> are all these combinations redirect with a 301 header to the exact page you want? Do they redirect to:
www.example.net/xyz.htm<
In all cases yes, the browser redirects properly. In the case of the original example.com examples, they are all redirected to www.example.net/. In the last test indicated it redirects to www.example.net/?p=1 which is still the same page.
In the case of the later additional discovered list of plain urls mentioned in the last post they are just simple redirects to copies of the original pages on the same or different domain.
>You should use something to check the response headers don't only rely what the browser does.<
I've used fetch as googlebot and they all return appropriate 301's. The example.com gets 301'd to www.example.com (via canonical rewrite command), www.example.com 301-> www.example.net/
www.example.com/abc.htm 301-> www.example.net/
>And I have not seen inconsistencies with the google index once the bot fetches the 301 redirects.<
I've got a list of examples along with my log entries showing recent bot visits to them that do appear inconsistent.
> When you search google for links with the old site:
site:www.example.com how many pages do you see?<
About 100 remaining, not all have been moved yet. However, neither "example.com", "www.example.com" nor "www.example.com/abc.htm" show up when doing a "site:example.com" not even under the "additional listings similar to these" list. They only show up when you do a search on the urls by themselves (e.g. type "example.com/abc.htm" in the Google search box without the quotes). Then if you click on the old url listed by G, it takes you to the new page (as it should since it is redirected).
> Also have you setup the old site on gwt and checked for errors? Are there any? <
Absolutely, see above. No errors
And you don't block the bot on the old site via robots.txt right?
Nope! In fact until last week there WAS no robots.txt on the old site. Checked logs it was returning a 404 and pages were getting regularly crawled. This week it does have one with a simple allow all tag.
g1smd,
> Redirecting a bunch of pages to another site is an entirely different matter to redirecting to the root of the same site. <
It's only a few (major ones) we're having the primary problem with. Other urls in the same .htaccess (even one other redirected to the root of the new domain) have 301'd just fine and are completely gone from the index. But there are also other pages on both domains simply redirected to a renamed file on the same domain which are also having the problem. I can list several examples of each.
>Google does sometimes list URLs... [clip] ...Those URLs eventually drop out of the index.<
I would think a year of crawling the same pages, at the very least every few days, with consistent 301 returns should be long enough for even G to get the idea. I've even fetched as Googlebot and resubmitted them. Granted there are still a lot of links out there around the world to the old urls.
> They are not treated a duplicate content because they do not directly return any content. <
I think is this case they DO (both get counted as dups AND return content, at least as far as G is currently convinced) because when they are returned as results, they are always displayed with excerpts from the page they are redirected to.
[edited by: Robert_Charlton at 11:53 pm (utc) on Dec 26, 2011]