Msg#: 4515325 posted 7:37 am on Nov 3, 2012 (gmt 0)
The hand-rolled CMS I have for entering news on my site has a re-edit page previously pointed to an old format for the URLS (which I just corrected)
However, I just noticed today that I have about 98 pages in the index that are using the old news URL.
Here's the issue:
1) the old url gets redirected to the new one by a 301. Http live headers shows the redirection 2) the edit area is in an area of the site that bots should not be crawling due to a robots.txt directive.
How is google storing the old url style? Why would it be ignoring the redirect and storing the old version?
Msg#: 4515325 posted 8:16 am on Nov 3, 2012 (gmt 0)
Is it bringing up both? Two versions of each page? Or did you mean that the new correct URL is in a roboted-out area?
If google can't get to the new version, it will keep serving up cached copies of the old version forever. No-crawl does not mean no-index.
If I've understood the situation correctly, you need to do two things. First remove all those old URLs manually in GWT. (Fortunately you can do whole directories at once.) Then get rid of the robots block and replace it with a meta no-index on the individual pages.
Msg#: 4515325 posted 9:27 am on Nov 3, 2012 (gmt 0)
Google will request every URL it has ever seen, forever. Once they see the redirect, the old URL will be delisted from the SERPs. Google will still occasionally request the old URL to check the current status, forever. Make sure that all links on the site point to the new URLs.
Msg#: 4515325 posted 9:50 am on Nov 3, 2012 (gmt 0)
Sorry I wasn't clear in my post. Late night and a lack of coffee.
The edit page was the only place where the old style URL was showing up (it was a more info link for each news story that would bring up the full article.)
That edit page is in a subdir blocked by robots.txt and protected by http auth. That's the only place those links showed up.
I haven't used those old style urls in the public areas in years. But google was somehow indexing stories from even a few days ago via the incorrect "more info" link in the robots.txt blocked area. The public facing pages have a redirect that catches any of the old style urls with a 301. That's been in place for at least 10 years.
I'm thinking the only way Google could have been aware of those links is via the Google Analytics code that is in the main template and therefore even gets included on the admin pages.