Getting our rewritten URLs out of Supplemental Results

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Getting our rewritten URLs out of Supplemental Results

F_Rose

3:04 pm on Sep 27, 2006 (gmt 0)

We have most of our url's in supplemental results.

Way back in April we have done a url rewrite for our entire site. What we have not done was a 301 redirect from our original url's to our new url's.

For that reason Google has dropped both url's (old and rewritten) in supplemental results. The way our site is built, we can not do a 301 redirect. What we are trying to do at this time is put our old url's into robots.txt.

Being that our rewritten url's are in supplemental results, Google doesn't cache our pages, because the bots hardly ever come back to the supplemental pages.

How do we get Google to start caching our supplemental pages and get us out of this big mess?

Is thier any way I can e-mail them, notify them of what's going on?

g1smd

5:55 pm on Sep 28, 2006 (gmt 0)

That depends on a lot of factors, and if you really have fixed the problem. YMMV.

URLs that return "200 OK", and are not duplicates, should migrate back to the regular index, yes. Make sure that they DO have a unique title tag and a unique meta description though, otherwise they will not.

F_Rose

6:03 pm on Sep 28, 2006 (gmt 0)

Thank you..

If so, should that take couple of weeks or rather up to a year, for our regular url's to start reappearing in Google's regular database?

g1smd

6:12 pm on Sep 28, 2006 (gmt 0)

I have seen it happen in weeks for some sites. Some have taken longer. Not all pages jump at once. A few move each day, or in groups every few days.

Not all make it across, as you see from the example that I PM'd you - after many months there is just ONE still left.

F_Rose

6:32 pm on Sep 28, 2006 (gmt 0)

Dropping all of our true url's so they return 404 errors, instead of 301 redirects, would that be an option?

Is it a problem with Google having too many 404 error pages?

g1smd

6:35 pm on Sep 28, 2006 (gmt 0)

If content has moved to another URL, then the 301 redirect is what is required.

The URL that the redirect points to will be the one that is indexed (as long as it returns "200 OK").

If content has gone from the web forever, then the 404 is the right thing to do.

F_Rose

7:46 pm on Sep 28, 2006 (gmt 0)

Thank you for sharing your experiences with me.

I will definetely go ahead and implement the changes and will keep you posted with updates. Hopefully, we should see good result in the near future.

stcrim

1:33 am on Oct 10, 2006 (gmt 0)

We have a site that has about 28,000 CFM pages we are republishing to html. We can't seem to plug all the Robot holes with Robot.txt Except for the links being different we have a lot of pages, one in cfm and a copy of it in html.

I'm not real concerned that this in going to trip a duplicate filter problem - but I don't want to be wrong.

Any thoughts for excluding the CFM pages from the index? Is there a universal robots.txt exclusion for CFM?

Since we are republishing it would be very difficult to use 301.

Any thoughts?

-s-

fjpapaleo

2:25 am on Oct 10, 2006 (gmt 0)

"I'm not real concerned that this in going to trip a duplicate filter problem"

I don't know how you could not.

"Any thoughts for excluding the CFM pages from the index"

It can be done in .htaccess on an apche server either by sending a 301 or a 410 gone. Either way, be prepared to bite the bullet for a long time.

jd01

3:08 am on Oct 10, 2006 (gmt 0)

301's are the best way to go to avoid duplicates and get credit for any links to content pages which may be present.

The process can be a little more complicated, but this Apache Library [webmasterworld.com] post should be helpful.

Justin

Pirates

3:16 am on Oct 10, 2006 (gmt 0)

You should maintain your original urls.

g1smd

1:07 pm on Oct 10, 2006 (gmt 0)

Just because you change your scripting technology, you still don't need to change the names of any of the pages. The names can still end in .cfm even if it is PHP or ASP that creates them.

The alternative is to 301 redirect all .cfm to .htm which then makes it look like you have a complete brand new site. You also need to block the old pages with:

User-agent: Googlebot
Disallow: /*.cfm

You really should retain the same URLs. The bots do not care HOW the pages were produced. They just want to see the HTML code and the on-page content that is output from that script.

You could also do an internal rewrite so that when you call for the .cfm pages, the .htm version is served. In that case you would continue to link to the .cfm names, and that is all the user would see. That rewrite would NOT issue a redirect response; the rewrite would be done internally in the server, and would not expose the real .htm filenames to the outside world. That would work, and is easy to set up if you use Apache.

I am aware of a well-ranked site that has all filenames ending in .asp even though the site has been run using PHP for at least 3 or 4 years now. The easiest way to do this, is to just retain the same filenames as before, even though the technology has changed.

stcrim

2:03 pm on Oct 10, 2006 (gmt 0)

Thank you - for the help. G1smd, thank you for the additional instructions.

However this all brings up some new concerns. The site owner wants
(and Now has) vanity folder extentions.

In other words:
FacilityDetail.CFM/FSID/1362/Show/Photo/RegionID/52/Sort/Name

is now

Northern_California/ReceptionSites/SouthBay_SouthCoast_MontereyPeninsula/Northern_California/SantaClara/NameofHotel.html

It sounds like the best course of action is to scrap this naming scheme and go back to the original CFM urls. Any thoughts?

Also, the re-publishing system runs 10 to 12 times a day (on any change) and re-publishes the whole site (28,000 pages)

Could there also be an issue with all the pages showing up as new (being re-published to the original CFM URLs) but only a couple of hundred will have content changes each time a bot comes by?

Thank you
-s-

g1smd

3:51 pm on Oct 10, 2006 (gmt 0)

Only from the point of view that Google will eat up a lot more of your bandwidth pulling the whole page each time instead of just parsing the If-Modified-Since data.

stcrim

4:06 pm on Oct 10, 2006 (gmt 0)

g1smd,

Thank you! How would you advise the site's owner in this case.

1. The CFM pages are indexed and doing fairly well. All 28,000 of them.

2. I fear all the keyword rich folder names for the HTML pages will trigger some type of spam filter, though the content in them matches the names.

3. It sounds like they could spend a long time waiting for the new pages to be found and do as well as the old pages.

Your thoughts? If you want to see the CFM and HTML sites side by side, I would be happy to sticky you the URLs

again, thank you
-s-

stcrim

4:57 pm on Oct 10, 2006 (gmt 0)

g1smd,

One more question:

In the robot.txt it has a wildcard in the "disallow" - it's been a long time since I used robot.txt files but in the old day wildcards were not allowed, are they now acceptable?

User-agent: Googlebot
Disallow: /*.cfm

-s-

g1smd

5:12 pm on Oct 10, 2006 (gmt 0)

Disallow with wildcard is OK for the Googlebot user-agent.

Do not use the wildcard for the User-agent: * section. Most other agents do not understand it.

If you use User-agent: Googlebot then Google will read ONLY that section; so you MUST copy all your User-agent: * stuff into the Googlebot [webmasterworld.com] section again.

This 46 message thread spans 2 pages: 46