We have a client that just had a complete site redesign. Unfortunately, the guys who developed the site don't really understand the issues that simple mistakes can cause when it comes to indexing a new site with a new url structure. One of the issues caused is by the fact that the whole staging site was allowed to be crawled. Not sure how but it resulted in over 500 duplicate staging. pages being indexed. (for clarity those are NOT on the same domain as the live site) We have, last month requested a NOINDEX tag to be placed on those pages which they have don. However, this morning it seems that there are over 600 pages indexed with staging.
Now, I understand that it can go up a little before it starts to decline but the strange things is that the cached pages of the staging. domain actually shows the correct/real url. Many with very recent dates well after we put the NOINDEX tag. I found a few that were recrawled yesterday but still showing up in the index for a site:staging.stagingdomain.com command (but with the correct url when looking at cached version).
Google just being Google or is there anything else that I may have missed?
p.s checked robots.txt on the staging domain and it allows robots through.
Any suggestions?