301 Redirect is NOT removing pages from index

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

301 Redirect is NOT removing pages from index

zerillos

12:47 pm on Oct 18, 2011 (gmt 0)

I just 301 redirected a bunch of pages I wanted out of G's index to the home page. G spidered them, but did not remove the URLs from the index. Instead, now when I check the instant preview of the redirected URLs, they display the content of the home page.

I don't think this is the intended behavior of 301 redirects. Am I wrong? I though G was supposed to remove the URLs from the index.

g1smd

8:13 pm on Dec 16, 2011 (gmt 0)

Redirecting a bunch of pages to another site is an entirely different matter to redirecting to the root of the same site.

Google does sometimes list URLs that now redirect. In days of old these would have been labelled as supplemental result in green writing directly after the URL. Those URLs eventually drop out of the index. They are not treated a duplicate content because they do not directly return any content.

enigma1

9:54 am on Dec 17, 2011 (gmt 0)

Redirecting a bunch of pages to another site is an entirely different matter to redirecting to the root of the same site.

There is one definition for each redirect type, if you don't understand them check the w3 spec. Whether you redirect internally or externally is exactly the same.

tedster

5:22 pm on Dec 17, 2011 (gmt 0)

@enigma1, internal and external redirects may use the same technologies, but the SEO effects can be quite different. Let's not muddy the waters of this discussion.

MikeNoLastName

10:43 pm on Dec 17, 2011 (gmt 0)

enigma1,
>Assuming the bot is aware of the 301 you should get nothing about these pages.<
That was my understanding.

>Otherwise the fact the page remains indexed for long time implies the bot may found a loophole with the way you did the redirect.<
My suspicion as well, but no one seems to be able to find a reason that SOME are there and others aren't. Or they returned after removal due to recent algo changes. ALL the most recently added ones seem to work fine.

The exact code I use in the .htaccess is as follows:
ReWriteCond %(HTTP_HOST) !www.example.com
ReWriteRule (.*) http://www.example.com/$1 [R=301,L]

Redirect 301 /index.html http://www.example.net/
Redirect 301 /index.htm http://www.example.net/
Redirect 301 /abc.htm http://www.example.net/

(followed by a bunch other 301s to various places)

If I were to "Redirect 301 / http://www.example.net/ that redirects EVERYTHING, no? So is there something else I need to be 301-ing?

> are all these combinations redirect with a 301 header to the exact page you want? Do they redirect to:
www.example.net/xyz.htm<

In all cases yes, the browser redirects properly. In the case of the original example.com examples, they are all redirected to www.example.net/. In the last test indicated it redirects to www.example.net/?p=1 which is still the same page.
In the case of the later additional discovered list of plain urls mentioned in the last post they are just simple redirects to copies of the original pages on the same or different domain.

>You should use something to check the response headers don't only rely what the browser does.<

I've used fetch as googlebot and they all return appropriate 301's. The example.com gets 301'd to www.example.com (via canonical rewrite command), www.example.com 301-> www.example.net/
www.example.com/abc.htm 301-> www.example.net/

>And I have not seen inconsistencies with the google index once the bot fetches the 301 redirects.<

I've got a list of examples along with my log entries showing recent bot visits to them that do appear inconsistent.

> When you search google for links with the old site:
site:www.example.com how many pages do you see?<

About 100 remaining, not all have been moved yet. However, neither "example.com", "www.example.com" nor "www.example.com/abc.htm" show up when doing a "site:example.com" not even under the "additional listings similar to these" list. They only show up when you do a search on the urls by themselves (e.g. type "example.com/abc.htm" in the Google search box without the quotes). Then if you click on the old url listed by G, it takes you to the new page (as it should since it is redirected).

> Also have you setup the old site on gwt and checked for errors? Are there any? <
Absolutely, see above. No errors

And you don't block the bot on the old site via robots.txt right?
Nope! In fact until last week there WAS no robots.txt on the old site. Checked logs it was returning a 404 and pages were getting regularly crawled. This week it does have one with a simple allow all tag.

g1smd,
> Redirecting a bunch of pages to another site is an entirely different matter to redirecting to the root of the same site. <

It's only a few (major ones) we're having the primary problem with. Other urls in the same .htaccess (even one other redirected to the root of the new domain) have 301'd just fine and are completely gone from the index. But there are also other pages on both domains simply redirected to a renamed file on the same domain which are also having the problem. I can list several examples of each.

>Google does sometimes list URLs... [clip] ...Those URLs eventually drop out of the index.<

I would think a year of crawling the same pages, at the very least every few days, with consistent 301 returns should be long enough for even G to get the idea. I've even fetched as Googlebot and resubmitted them. Granted there are still a lot of links out there around the world to the old urls.

> They are not treated a duplicate content because they do not directly return any content. <

I think is this case they DO (both get counted as dups AND return content, at least as far as G is currently convinced) because when they are returned as results, they are always displayed with excerpts from the page they are redirected to.

[edited by: Robert_Charlton at 11:53 pm (utc) on Dec 26, 2011]

g1smd

11:33 pm on Dec 17, 2011 (gmt 0)

One problem there is that you are using rdirectives from both mod_rewrite and from mod_alias with the same configuration file. This can lead to unwanted and unintended operation.

Use only mod_rewrite for all of your redirects and order them from most specific (affects only specific pages) to most general (affects the most amount of page - with the www/non-www stuff last).

MikeNoLastName

2:33 am on Dec 18, 2011 (gmt 0)

OK, fine, I can see you're going to be adamant about the mod-rewrite and mod-alias combination thing being the cause (since G can do no wrong :-), so just for the benefit of the doubt, I'm willing to try anything at this point.

So you're suggesting to change the simple code above to precisely the following, nothing more, nothing less: (I derived much of this from various websites - Like I said, I speak a number of computer languages, but not apache rewrite... yet)

ReWriteEngine On
# Redirect index requests
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.(php|html?)
RewriteRule ^(([^/]+/)*)index\.(php|html?) http://www.example.com/$1 [R=301,L]

ReWriteCond %(HTTP_HOST) !^(www\.example\.com)?$
ReWriteRule (.*) http://www.example.com/$1 [R=301,L]

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index\.html$ http://www.example.net/? [R=301,NE,NC,L]

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index\.htm$ http://www.example.net/? [R=301,NE,NC,L]

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^abc\.htm$ http://www.example.net/? [R=301,NE,NC,L]

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^xyz\.htm$ http://www.example.com/wxyz.htm? [R=301,NE,NC,L]

(etc.) (the final xyz.htm one being an example for redirecting to a new page on the same domain)

I don't have to add any other tricky code or commands?
Note I changed my second rewrite rule. If you look back to the first time I sent it and Lucy mentioned it (which I had missed previously) my original version from many years ago had no ^'s, \'s or $'s. Could this have been the main issue? Note I also included your previously recommended index request redirect even though it really scares me to do so. But I figure things can't really get much worse at this point (well, yeah, I guess we could also lose our nice Yahoo and Bing rankings if things go awry...)

BTW, which is more correct
RewriteRule ^(([^/]+/)*)index\.(php|html?) http://www.example.com/$1 [R=301,L]
or
RewriteRule (([^/]+/)*)index\.(html?|php)$ http://www.example.com/$1 [R=301,L]
which g1smd wrote in a different forum a few months back? Note the ^ vs $ .

Thanks to ALL!

lucy24

3:41 am on Dec 18, 2011 (gmt 0)

^ is "must begin with"
$ is "must end with"

In this specific case, neither of the anchors is really necessary. By default, Regular Expressions start as soon as they can, and go on as long as they can.

The beginning of the pattern says "pick up any old stuff that might happen to come before the word 'index' followed by a dot." The opening anchor would not add any constraints.

The end of the pattern says "the 'index.' has to be followed by php, htm or html". The closing anchor says that there can't be anything after the extension-- except query strings, which don't count. So unless your users are making up extensions (phpxyx, htmlnm), the closing anchor also makes no difference. And why punish a user just because their cat happened to walk across the keyboard just as they were hitting Return?

An escaped dot \. means "one literal dot, nothing else". An unescaped dot . means "any single character of any kind whatsoever. Again, in this specific context it is not likely to make a difference, because how often do you get requests for "indexbphp" or "index,php"? But it can be crucial if your pattern involves something like an IP address: 1.3 and 1\.3 are very different things.

The rule

RewriteCond %(HTTP_HOST) !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

should go at the very end of your rewrites, where it will only pick up those requests that have not already been redirected via other rules.

I don't understand these.

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index\.html$ http://www.example.net/? [R=301,NE,NC,L]

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index\.htm$ http://www.example.net/? [R=301,NE,NC,L]

They both seem to be saying "if there is no query string, redirect requests for index.html? to the bare directory / without query string". But at this point, the only index-file requests would come from an internal rewrite, since you've already redirected the external requests. So when do the users finally get to go to the index file?

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^abc\.htm$ http://www.example.net/? [R=301,NE,NC,L]

What's the difference between this and the preceding? If everyone is being redirected to the index file, why are they in separate rules?

MikeNoLastName

7:49 am on Dec 18, 2011 (gmt 0)

Hi Lucy24,
Beats me :). That's why I keep saying this all seems so much more complicated than it needs to be. I'm just trying to comply with what everyone on this thread has been saying. All this time, all I'm trying to do is 301 redirect any request for the home page of example.com, including any typical combinations of index.htm, index.html to the home page of example.net. (note old domain-> newdomain) Plus also redirect one other page example.com/abc.htm to the home page of example.net as well, while leaving all other pages on example.com responding from example.com as usual. I thought I had it working perfectly well with the set of 3 Redirect 301's shown a few posts back, since it worked fine with a browser and other search engines, but other persons in the thread seem to indicate that doing it that way is the reason causing G to phantom index www.example.com and example.com as observed by me.

Based on all the searches I've done on the net and the responses I've seen, no one can seem to give me a direct answer, with a distinct line(s) of code to do it.

g1smd said to rewrite the Redirects with Rewrites instead, but left no example. So I searched and got the last set of code from what I thought was a nifty website (search for ".HtAccess 301 Redirect Generator Tool") but it sounds like you're saying they're wrong too?

Frustrated :-|

Thanks!

g1smd

8:48 am on Dec 18, 2011 (gmt 0)

The rule order is incorrect. List from most specific to most general, with the non-www/www redirect last.

The only difference between rules with $ and not with $ is that the rule without $ would also redirect requests for index.php4 whereas the other rule would not.

Apologies for leaving no example. I didn't notice which forum this thread was in. The Apache forum contains several thousand threads with examples as this is a question that gets asked almost every day.

tedster

5:34 pm on Dec 18, 2011 (gmt 0)

Since not all websites use an Apache server, these technical details are best discussed in our Apache Forum - really, researched since as g1smd mentioned, the answers are all through that forum.

Apache Web Server [webmasterworld.com]
Apache Forum Library [webmasterworld.com]

MikeNoLastName

7:28 am on Dec 20, 2011 (gmt 0)

Thanks for the links. Still doesn't give credible evidence that THIS is the reason G is only randomly removing 301 redirected pages. :)

enigma1

2:03 pm on Dec 21, 2011 (gmt 0)

@enigma1, internal and external redirects may use the same technologies, but the SEO effects can be quite different.

@Tedster, the effects can be disastrous for SEO in both cases if the setup is wrong. In the case of the OP, he redirects the root page of the domain to a new one along with some pages not all. So the end result is now you have a domain without a root page, but with different groups of pages interlinked.

That's what's happening and that's what the various .htaccess rules posted above really do. They create a mess. This is not a redirect of one domain to another.

MikeNoLastName

7:20 pm on Dec 22, 2011 (gmt 0)

Thank you Enigma1 , I definitely plan on making significant changes (which were already in the works actually) to benefit the overall structure. I now realize the apparent cause of some of the PR flow issues (i.e. G does not like traversing multiple domains for a single product site).

HOWEVER, and the point of this thread is, I do not feel comfortable moving any existing pages over to a new domain until it is proven that Google is going to handle the redirects correctly, thus avoiding duplication penalties and properly passing PR. Regardless of the site structure, G should properly handle a simple 301 redirect especially within the same domain without causing duplicate entries in the index. I have produced multiple concrete examples of this in the Google index, to multiple persons to confirm. Some between domains, some on the same domain, some on domains unrelated to the above situation. I have even apparently found other persons with identical examples (the original poster on this thread for one). So it is evidently not restricted to just me. It is not restricted to home pages, it is not restricted to intra-domain redirects. However, everyone with an opinion seems to assume it is some .htaccess issue or site structure issue and not instead a problem with the only common factor: 301's and G.

Plain and simply stated: FACT: SOME 301 redirected pages are removed from the G "site:example.com" search (so G apparently IS indeed well aware of the 301), but remain in the index with a duplicate of the new page info, and they can be found when you search on the url: "example.com/abc.htm" of the old page. The Normal handling for MOST 301 redirected pages is that the old url completely vanishes from the index and cannot be found even if you search on the old url. Whether they cause duplication penalties or loss of PR is anecdotal and still wide open to conjecture and/or proof.

[edited by: Robert_Charlton at 11:55 pm (utc) on Dec 26, 2011]

Robert Charlton

6:25 am on Dec 27, 2011 (gmt 0)

My emphasis added...

FACT: SOME 301 redirected pages are removed from the G "site:example.com" search (so G apparently IS indeed well aware of the 301), but remain in the index with a duplicate of the new page info, and they can be found when you search on the url: "example.com/abc.htm" of the old page.

MikeNoLastName - After I pondered this for a while, it hit me that you are most likely describing what was observed and discussed in this thread....

Domain name replaced in SERPS with alias domain name
http://www.webmasterworld.com/google/4327200.htm [webmasterworld.com]

Note tedster's response to the OP in the second post...

Why would Google replace the original URL in the SERPS?

For the basic reason that it actually was the query term. My assumption is that this is an algorithmic change to help improve click-through.

We know Google rewrites titles and description snippets based on the query - this is the first I've heard a report about rewriting a domain name.

Google's behavior changed over the duration of posts in the thread, with many fewer 301 redirected urls displayed later than initially, but we couldn't find any pattern or consistency in what was going on. In my last comment on the thread, I noted that...

In some cases, Google is still displaying redirected old domains.

You're going to need to read the whole discussion to form your own conclusions... but I don't believe that Google is looking at the old url as dupe content.

I think if you can confirm the 301 via a server header checker, or with the FF Live HTTP Headers plug-in, or via a tool like URI Valet (described in our current Favorite SEO Tools [webmasterworld.com] thread), then I don't think you have any dupe content worries.

And again, to answer your question phrased a different way...

Are these being conceived by G as duplication errors just because once upon a time I did a 301 redirect? They also seem to appear to NOT be passing PR on to the destination page.

No. In situations where this had happened to redirects on sites I'd optimized, I'm certain the PR had been passed.

All that said, I would fix the rule order as g1smd suggests, and untangle what's apparently a ball of spaghetti with some of your structural issues.

Robert Charlton

7:17 pm on Dec 27, 2011 (gmt 0)

Also, have you checked the code on your redirected pages to make sure that you're not continuing to link to the old domain in your html?

MikeNoLastName

11:36 pm on Dec 28, 2011 (gmt 0)

Hi Robert,
In so far as the past thread you mentioned, I DO think the two are very closely related if not the same phenomena described in different terms/from a different perspective. In their perspective they described it as a new listing with an old URL, whereas in my perspective, since the old url was in fact there first, I described it as an OLD link with a new snippet and title - same exact thing I think. The consensus, at least at that time, and from your latest remarks in this thread, seemed to be that in opinions this was NOT causing duplication penalties. HOWEVER, that thread was last updated in June. At that time duplicated pages pretty much only affected themselves. Since then we have had another major iteration in Panda, which corresponded with precisely when we started to notice signs of traditional duplication penalties on the the new domain being 301 redirected to. G has stated clearly that Panda has to do with duplication. It COULD be possible that with the last major algo change G started doing more with these internally saved links, either intentionally or unintentionally. In which case everyone may want to reassess their specific situations on sites with these issues to see if they STILL do not correlate with duplication issues.

If this were so, and it IS NOW causing duplication penalties ON THE POINTED TO DOMAIN PAGES, it WOULD bring to mind, as touched upon in the other thread, that a possibly notorious individual would simply need to 301 redirect pages from his own site to a competitors main pages and get them into this phantom mode causing widespread duplication penalties under the new Panda and the target would never even have a clue where to look. Just a lot of conjecture of course. I would gladly attribute our ranking issues coincident with the algo change strictly to being pandalized based solely on content (at least that is something I could work on changing) except for the obvious traditional duplication penalty evidence I see when doing a "site:" search on the new domain and the fact that literally overnight key pages (which were redirected to) went from #5 to #250 in the index just like when one WAS duplicated in the index in the past (more often than not, in those cases, due to canonical issues or other strange but fixable indexing by G).

Yes, of course I AM still continuing to link to the old domain, as long as there are still applicable pages on the old domain. So what? I agree it may be bad style and not the greatest for PR flow, major moves take time, but it SHOULDN'T affect in the least how a search engine INDEXES a standardized 301 redirect should it? So, now there is an indexing penalty for simply linking out to another site? If I want to move a few pages it means I now have to shut down the whole old domain permanently? I think other credible witnessed mentioned on the old thread that they saw them too on their own sites, so mine are not unique. I would move them all over ASAP, if I was confident G was not going to create more phantom links and cause more potential dupe issues. In fact there will always be SOME pages on the old domain, as it is our official company domain which created the content on the new domain and others, the older pages of which are now being split off to the new domain, from the old domain. I can't imagine we are the first site to split out a large section of a prior site into its own domain while keeping the old one intact. Linking to the old one is unavoidable (both internally and externally), for instance copyright and contact-us links just for example which probably shouldn't be duplicated on another domain. The sites have been intertwined for 15 years, and working great in G and other SEs all of that time. The site hasn't changed much (until AFTER noticing these phantoms), Bing and Y rankings haven't changed noticeably over the last year, but apparently G has and it does not seem to be acting on par with B & Y or even the average browser. It SHOULD NOT MATTER that the old pages are still being linked. They are redirected! I THINK that is why redirection was invented. Deal with it like a users browser does. They are gone. Follow them, but stop indexing them, there technically is nothing there. Get them OUT of the index!

I recently (12/29) did an experiment with one non-critical page, by removing the 301 redir and creating a dummy page with just a line of explanation and a link to the new equivalent page to take the place of the redir. I also placed a NOINDEX in the header to try and permanently remove it from the index. But G is not even realizing that a phantomed page, which has been replaced with a new file and un-redirected, and which has been NOINDEXed in its own header even though it has crawled it and is currently PREVIEWING the new page. However it still shows a cache of the redirected page supposedly from 12/23 which can't possibly be if it were working properly. So it now even seems impossible to replace or remove these phantoms. Latest update, is that this page has now lost it's previous (CORRECT) preview and now claims there is NO preview available. It almost acts like half of G knows it's there and the other half doesn't know or care. But Google SHOULD care IF it is a potential attack issue.

It is also unavoidable that other external backlinks will be linking to the old domain long into the future. If there WEREN'T backlinks still linking to them, 301s wouldn't really be necessary, now would they? We'd just 404 them all since no user would find them anymore. Should G be penalizing sites because they are old and collected backlinks, then had/decided to move? I can say MANY, but not all, of our phantomed pages are backlinked from other websites.

Finally, in addition, I now have examples of NEWLY redirected pages (14 days ago) to another page on the SAME domain, with guaranteed absolutely NO internal links to them anywhere, which have just recently been phantomed. They were only ever linked from one page on our entire site to begin with and that link is now gone (G has recrawled that one page also, since then). The files are physically gone as well so it is not showing up in the sitemap xml either. There were about 60 redirected together, 50+ of them were de-indexed just fine and about 6 were phantomed. So I conclusively do not think the phenomena has much if anything to do with gradual relocation from an old domain and/or still internally linking the old domain/pages.

I shall be continuing to experiment with these and offer any additional observations that I can.

MikeNoLastName

9:49 am on Jan 27, 2012 (gmt 0)

Just a quick follow-up. We have seen vast improvements in the last week in SERPs... as if G suddenly fixed at least part of this issue.

Primarily, the aforementioned example.com URL listing which had been 301'd long ago and re-indexed, suddenly disappeared from the index and the very next day the redirected-to page shot up 50% in the SERPS. Duplication penalty? You tell me. I know one when I see one.

Also the "site:" search for the NEW "redirected-to" domain started showing far more correct results with the home page finally at the top of the first page. Not perfect yet, but I would expect it to take a few re-iterations of crawling and re-indexing to get the correct hierarchy back in place now that the duplication penalty has been lifted from the home page. SERPs for prior keywords continue to improve daily.

We'll never know if it was a bug, G is a black-box that answers to no one, not even governments.

MikeNoLastName

11:30 pm on Apr 20, 2012 (gmt 0)

FINALLY! Somewhere between 4/16/20 and 4/20/12 Google realized it was an issue and FIXED this bug! Yippee! Every single remaining phantom index (for us at least) is finally gone, gone, gone all at once! Searching for the redirected url now shows the redirected-to page as the first result (as it should) and in many cases other pages linking to the old redirected page thereafter, as reasonable, and then similar pages on the old domain following that.
Studying ramifications now. Thanks G!

This 48 message thread spans 2 pages: 48