Forum Moderators: open

Message Too Old, No Replies

How to remove old URLs from the index?

duplicate pages after mod_rewrite switch

         

Dolemite

7:25 am on Apr 11, 2003 (gmt 0)

10+ Year Member



I started using mod_rewrite for a site last month (on a site that went live in early march), and now checking allinurl:site at www-ex for the site shows that I have both the rewritten URLs and the original URLs in the google index. The rewritten URLs are all above the originals, and I'm not seeing any of the originals in SERPs...just rewritten URLs. There are no links anywhere to the original URLs, by the way.

My concern is that I want the rewritten URLs to stay in the index, and definitely I don't want the duplicate content filter to kick in and only leave the original pages (since they're older). I don't really want any visitors to get 404's, but based on the SERPs that doesn't appear to be a danger. My plan is to just remove the files for the original URLs and see if I get any 404s, probably make a custom 404 page with a link to home just in case. Is there any danger to doing this? I'm a little concerned that the rewritten pages might be from the freshbot and could disappear at some point...or are all current www-ex results from the last deepcrawl? I suppose I should check my logs to see if the deepcrawl ever hit my rewritten URLs. When the deepcrawl comes looking for the original pages again, are there any problems with them not being there, or will it just not find them since there aren't any links to them?

I'm almost embarrassed to think this since it seems a bit unethical, but would there be any advantage to keeping the original pages live, say to direct PR back to my homepage? I could easily modify the pages as to not have duplicate content to any other page on my site. I really wouldn't do this, its just too spammy and I don't care to risk a penalty, but I'm still curious if this is a vulnerability in the algorithm. The thing is, the original pages themselves don't have any links pointing to them so the next deepcrawl they might get thrown out, or least they'd be PR0. So I guess I answered my own question there.

Dolemite

8:49 am on Apr 11, 2003 (gmt 0)

10+ Year Member



OK, this is not cool. www-ex just dropped all my new, rewritten pages and still has the originals. Checking my logs, I can see that the deepcrawler has only ever touched my homepage, though freshbot has gone deeper and crawled the rewritten pages. The old pages that are now listed were apparently just links taken from the homepage, as they were never crawled themselves. This explains why the results of allinurl:site shows no "page excerpt" for these pages.

Dolemite

6:25 am on Apr 12, 2003 (gmt 0)

10+ Year Member



So it seems my homepage and my old pages are in the permanent index, the rest were just freshbotted and are now out of any index.

jdMorgan

6:46 am on Apr 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Dolemite,

That's probably why you were seeing both. The old URLs were in the permanent index, and the new ones were freshbotted.

If the rewrite is a 301 or a tranparent redirect, there is no need to keep the old files. As far as http (web) access is concerned, they no longer exist. They only exist in the file system of your server, and have essentially been "disconnected" from any access by URL.

If the above leaves you still worried, you might want to post which kind of redirect you used - it would help keep discussion on-track.

HTH,
Jim

Dolemite

6:40 am on Apr 13, 2003 (gmt 0)

10+ Year Member



I actually just deleted the files the old URLs referred to. The mod_rewrite setup refers to an entirely new set of files. So attempts to spider or click through to those URLs will produce 404's. The fact is, I don't think those URLs will come up for any searches except an allinurl search since the deepcrawl didn't directly pick them up, it just knew that the URLs existed since it pulled them from my homepage. I guess I could do a 301 redirect to my new URLs, but I don't think it would make much difference. If I did that, I assume google would replace the old URLs in the index with the new ones it finds in the redirect, right?

jdMorgan

3:53 am on Apr 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Dolemite,

Yes, the old URLs would be replaced if you use a 301.

Just as matter of best pracitices, and whenever possible, return a 301 or a 410 on pages you intentionally remove. That way a 404 in your logs is much more likely to mean you have an on-site problem. It makes the error log much more useful by reducing the number of junk entries in it, and also leads to a better user experience because the user can be either redirected to a relevant page, or told unambiguously that the page has been removed, and he/she should delete or correct any bookmarks, or notify the owner of the site with the bad link to update it, etc.

Jim