Welcome to WebmasterWorld Guest from 18.104.22.168
Way back in April we have done a url rewrite for our entire site. What we have not done was a 301 redirect from our original url's to our new url's.
For that reason Google has dropped both url's (old and rewritten) in supplemental results. The way our site is built, we can not do a 301 redirect. What we are trying to do at this time is put our old url's into robots.txt.
Being that our rewritten url's are in supplemental results, Google doesn't cache our pages, because the bots hardly ever come back to the supplemental pages.
How do we get Google to start caching our supplemental pages and get us out of this big mess?
Is thier any way I can e-mail them, notify them of what's going on?
URLs that return "200 OK", and are not duplicates, should migrate back to the regular index, yes. Make sure that they DO have a unique title tag and a unique meta description though, otherwise they will not.
Not all make it across, as you see from the example that I PM'd you - after many months there is just ONE still left.
The URL that the redirect points to will be the one that is indexed (as long as it returns "200 OK").
If content has gone from the web forever, then the 404 is the right thing to do.
I'm not real concerned that this in going to trip a duplicate filter problem - but I don't want to be wrong.
Any thoughts for excluding the CFM pages from the index? Is there a universal robots.txt exclusion for CFM?
Since we are republishing it would be very difficult to use 301.
I don't know how you could not.
"Any thoughts for excluding the CFM pages from the index"
It can be done in .htaccess on an apche server either by sending a 301 or a 410 gone. Either way, be prepared to bite the bullet for a long time.
The process can be a little more complicated, but this Apache Library [webmasterworld.com] post should be helpful.
joined:Dec 1, 2003
The alternative is to 301 redirect all .cfm to .htm which then makes it look like you have a complete brand new site. You also need to block the old pages with:
You really should retain the same URLs. The bots do not care HOW the pages were produced. They just want to see the HTML code and the on-page content that is output from that script.
You could also do an internal rewrite so that when you call for the .cfm pages, the .htm version is served. In that case you would continue to link to the .cfm names, and that is all the user would see. That rewrite would NOT issue a redirect response; the rewrite would be done internally in the server, and would not expose the real .htm filenames to the outside world. That would work, and is easy to set up if you use Apache.
I am aware of a well-ranked site that has all filenames ending in .asp even though the site has been run using PHP for at least 3 or 4 years now. The easiest way to do this, is to just retain the same filenames as before, even though the technology has changed.
However this all brings up some new concerns. The site owner wants
(and Now has) vanity folder extentions.
In other words:
It sounds like the best course of action is to scrap this naming scheme and go back to the original CFM urls. Any thoughts?
Also, the re-publishing system runs 10 to 12 times a day (on any change) and re-publishes the whole site (28,000 pages)
Could there also be an issue with all the pages showing up as new (being re-published to the original CFM URLs) but only a couple of hundred will have content changes each time a bot comes by?
Thank you! How would you advise the site's owner in this case.
1. The CFM pages are indexed and doing fairly well. All 28,000 of them.
2. I fear all the keyword rich folder names for the HTML pages will trigger some type of spam filter, though the content in them matches the names.
3. It sounds like they could spend a long time waiting for the new pages to be found and do as well as the old pages.
Your thoughts? If you want to see the CFM and HTML sites side by side, I would be happy to sticky you the URLs
again, thank you
Do not use the wildcard for the User-agent: * section. Most other agents do not understand it.
If you use User-agent: Googlebot then Google will read ONLY that section; so you MUST copy all your User-agent: * stuff into the Googlebot [webmasterworld.com] section again.