Supplemental Index has older version of a regular index page

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Supplemental Index has older version of a regular index page

Google duplicate content

confuscius

10:36 am on Sep 22, 2006 (gmt 0)

Over the past 18 months, I have observed many strange and curious Google anomolies but the leatest one takes the biscuit! On various data centres I have been able to retrieve two versions of my homepage using the sitemap *** wheeze. The most recent version shows as non-supplemental, the older version shows as supplemental - so I have two copies of the same page in the same index - the upshot being that Google has within it 100% perfect duplicate content for me that I have no way of resolving whatsoever. Then to cap it all Google displays the supplemental result not the current version but when you click the cached page it is the current version!

What chance does any webmaster have when Google appears to be so messed up? Perhaps, Adam Lasnik or Google Guy could enlighten me as to how the situation observed can arise.

If this has been covered elsewhere, please point me in the right direction.

tedster

4:01 pm on Sep 22, 2006 (gmt 0)

g1smd:
Another type of Supplemental Result is where the page is simply the previous version of the page. The current version is shown as a normal result, but if you search for keywords that were on the page some 8 to 30 months ago (and which are no longer on the current version of the page) then you see the same page as a Supplemental Result.
The snippet will usually also show that same old content, but the cache will always be the one from recent days or weeks (except for a brief time last week when the old cache would show against the old results in several datacentres).
[webmasterworld.com...]
In short, this is intentional. The main purpose of the Supplemental index is for Google to be able to respond to a wider variety of obscure and complex queries.
[edited by: tedster at 6:20 pm (utc) on Sep. 22, 2006]

g1smd

4:32 pm on Sep 22, 2006 (gmt 0)

>> The most recent version shows as non-supplemental, the older version shows as supplemental - so I have two copies of the same page in the same index - the upshot being that Google has within it 100% perfect duplicate content for me that I have no way of resolving whatsoever. Then to cap it all Google displays the supplemental result not the current version but when you click the cached page it is the current version! <<

Yes. That is how Google works. Current content found at a URL is in the normal index, and any content on older versions of the page (and not on the current version of the page) is in the Supplemental Index. However this is NOT duplicate content. Only one of the results is served at any one time. Which one is served depends on the query.

I will modify the statement of mine that Tedster quoted above though. That stuff was my best guess a year ago. Supplemental Results used to mainly span 8 to 30 month old data, but for the last 6 months or so (Google has updated Supplemental Results several times) I would say this is now mainly 3 to 15 months, or so. I don't see anything (gfe-gv) older than (dated before) 2005-July right now. If I look at gfe-eh then I don't see anything older than (dated before) 2005-December, I think.

Supplemental Results that represent URLs that redirect, or are 404, or are for expired domains, can be safely ignored. Those will be dropped after one year. In the meantime your 301 redirect or your custom 404 page delivers the visitor through to the correct page anyway.

Supplemental Results that are for URLs that still return "200 OK" need to be investigated. Many times it will be a Duplicate Content problem (www/non-www, multiple domains, multiple parameter order, http/https, URL capitalisation issues {on IIS only} etc, maybe even too-similar titles/descriptions) and those are always a big problem for any site. However, sometimes it is just the new data / old data situation for the same URL, and that is not a problem. Google likes to hold on to the old version of a page so that someone who looked at it the day before you changed it, and now wants to look at it again, many weeks later, can still find it today.

My more recent thoughts are in: [webmasterworld.com...]

g1smd

5:26 pm on Sep 22, 2006 (gmt 0)

Oh, and before I forget, index pages are a special case that need extra thought in themselves. Google sometimes treats index and / as separate pages.

Make sure that every page of your site links back to the root index page, but always ensure that you link only to www.domain.com/ or to www.domain.com/folder/ and always omit the index file filename itself from the URL.

Finally, in this long expose on Duplicate Content issues, make sure that every page of your site has a unique title tag and a unique meta description - one that describes exactly what can be found on that particular page.

These searches are useful in finding out what is going on:

site:domain.com
site:domain.com inurl:www
site:domain.com -inurl:www
site:www.domain.com
site:www.domain.com inurl:www
site:www.domain.com -inurl:www

johnlim9988

2:00 am on Sep 24, 2006 (gmt 0)

Hi, G1smd,

Could you kinldy how to get some useful information etc from the commands you listed below?

site:domain.com
site:domain.com inurl:www
site:domain.com -inurl:www
site:www.domain.com
site:www.domain.com inurl:www
site:www.domain.com -inurl:www

Thanks

smokeybarnable

3:01 am on Sep 24, 2006 (gmt 0)

yeah what exactly is inurl?

lmo4103

3:12 am on Sep 24, 2006 (gmt 0)

Advanced search operators [google.com]

dibbern2

4:32 am on Sep 24, 2006 (gmt 0)

its time to say thanks, g1smd,

walkman

5:03 am on Sep 24, 2006 (gmt 0)

all I can say is that it's hard to be a mom & pop store and go online these days.

F_Rose

6:38 pm on Sep 25, 2006 (gmt 0)

Thank you g1smd for your all this great information.

Unfortunately, our site is one of those sites hit very badly with supplemental results for our existing pages.
(I am refferring to second level pages, our home pages is being indexed regularly).

The reason our site went supplemental is due to a url rewritting wich can not be 301 redirected. The site was written in cold fusion and 301 redirecting creates a link loopage.

We have no links to our old url's, only to our rewritten url's and we surely thought, that would take care of the problem.

However, Google is not crawling our pages due to the supplemental issue ever since May.

What could we possibly do to get Google understand our situation, get our supplemental pages crawled once and for all?

Please help us..

P.S. Shall you need any clarifications or have any questions, please ask, we are desperate in getting this taken care of..

g1smd

6:43 pm on Sep 25, 2006 (gmt 0)

The alternative to the rewrite is to modify the script so that all alternative URLs for any piece of content are served with a <meta name="robots" content="noindex"> tag attached. Make sure that only one URL per page of content can be indexed.

Last, and definitely least, you could try to use robots.txt to keep the bots out of the other "versions" of a page.

Whatever you do, the alternatives will turn Supplemental and hang around for a year before Google deletes them from view.

F_Rose

7:28 pm on Sep 25, 2006 (gmt 0)

g1smd,
Thanks for your response.

The problem is that both url's are pulling the same exact page. I have to treat both url's the same being that they use the same exact content. As far as linking, we only link to the rewritten url's, nothing is linked to our old url's.

How could I make the rewritten url's stronger and more noticeable for Google to start indexing these url's and get it out of supplemental?

[edited by: F_Rose at 7:29 pm (utc) on Sep. 25, 2006]

g1smd

7:50 pm on Sep 25, 2006 (gmt 0)

If you can't alter the script to test which URL was requested, then you'll have to go with robots.txt exclusions instead.

F_Rose

7:58 pm on Sep 25, 2006 (gmt 0)

Got it..

What could we do though to get Google to start indexing the right url's? They have thrown these pages into supplemental and they still have a cache date of May 2, 2006.

g1smd

8:01 pm on Sep 25, 2006 (gmt 0)

Once access to all but one URL is discouraged or blocked Google soon picks up the one that really should be indexed - especially if it is the one that internal and external links point to.

F_Rose

8:02 pm on Sep 25, 2006 (gmt 0)

Also,

If both url's are pulling the same page and I put one url (old url) in robots.txt will it have an affect on the rewritten url?

g1smd

8:05 pm on Sep 25, 2006 (gmt 0)

Nope.

Imagine a shirt shop that messed up their URLs. They had a whole bunch of URLs like this:

/shirts.php?colour=blue&size=16
/shirts.php?colour=red&size=17
/shirts.php?colour=green&size=14
/shirts.php?colour=red&size=15
/shirts.php?colour=white&size=17

and then they found that another part of the site accessed the same five pages using:

/shirts.php?size=16&colour=blue
/shirts.php?size=17&colour=red
/shirts.php?size=14&colour=green
/shirts.php?size=15&colour=red
/shirts.php?size=17&colour=white

giving "exact" duplicate content.

You could put:

Disallow: /shirts.php?colour=

in the "robots.txt" file and the problem is solved.

Well, not in the best way, as you are throwing away some PR on the "other" version. But at least there is no more duplicate content exposed to indexing.

F_Rose

8:21 pm on Sep 25, 2006 (gmt 0)

Thank you for the information, I will definetely go ahead and run a test.

How long app. before I get to see any results?

Any other options in getting our internal pages that are in supplemental cached asap?