homepage Welcome to WebmasterWorld Guest from 54.166.111.111
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 110 message thread spans 4 pages: < < 110 ( 1 [2] 3 4 > >     
Whats the Skinny on Supplemental index
Adam from google clarifies some issues
contentwithcontent




msg:3152334
 10:45 am on Nov 10, 2006 (gmt 0)

From Adams post at the Google webmaster discussion group...

"I thought I'd clear the air a bit:

1) Penalty?
When your site has pages in our supplemental index, it does *not*
indicate that your site has been penalized. In particular, we do not
move a site's pages from our main to our supplemental index in response to any violations of our Webmaster Guidelines.

2) Freshness?
You can expect to see a fresher supplemental index in the coming
quarters. By the definition of "supplemental," however, I don't forsee
it becoming as comprehensive or frequently updated as our main index.

3) Cure?
Get more quality backlinks. This is a key way that our algorithms will
view your pages as more valuable to retain in our main index."

more at... Google Groups discussion [groups.google.com]

[edited by: tedster at 11:12 pm (utc) on Nov. 11, 2006]
[edit reason] fix side-scroll [/edit]

 

texasville




msg:3154432
 6:38 pm on Nov 12, 2006 (gmt 0)

>>>>>So, to get my site out of the supplemental index, I may have to get a bunch of links pointing directly at my supplemental pages. That's easier said than done.<<<<<

Bingo! Internal link structure doesn't mean diddly when it comes to this supplemental type. Google isn't looking at the site as a whole anymore. You may feel like each url within your site is just an integral part of your site as a whole, but google now looks at each internal url as a seperate entity unto itself.

tedster




msg:3154437
 6:48 pm on Nov 12, 2006 (gmt 0)

Are many people seeing this "new kind of supplemental"? I've just gone through several ecommerce sites that I work with and I see no sign of it. One of them is under 1000 urls total, so I can see every one easily -- and NONE have a Supplemental tag. I know for sure that each product on this domain does not have an IBL. Not anywhere near that. And most of them are 3-clicks from the domain root.

So I'm not sure we've got a complete understanding right now.

RonnieG




msg:3154483
 7:40 pm on Nov 12, 2006 (gmt 0)

A fundamental disconnect here seems to be steveb's definition of "getting crawled":

The issue with not having supplementals is getting crawled. You get crawled because you either have lots of links, even if all bad ones, or a small number of good ones, and you don't have any dupe issues.

To me, "getting crawled" simply means that the site's home page is known and indexed in G, hopefully in the main index, and that it has a well designed internal link structure that allows G and other robots to crawl each and every page on the site, and that Gbot has in fact been to each and every page of the site at least once. Sufficient evidence of "getting crawled" successfully, to me, is that G has every page of the site either in its main index or supplemental index.

So, "getting crawled", in steveb's terms, really means having IBLs from other sites, not "getting crawled" in the traditional SE spider sense. Mat C's comments that getting quality IBLs is one way to get a URL out of supplementals, but this is still a circular argument. If you, as webmaster/developer for many sites, have the ability to give your own managed site's IBLs from other sites you own and/or manage, that's dandy for you. You have found a way to successfully "game the system", because these links, while perhaps not "purchased", are still not the "natural" links that G is supposedly trying to encourage and reward. And small single-site site webmasters like myself, who do not have the ability to add these unnatural links from many sites, do not have the same advantage. That means we have to buy them or go to extreme efforts to try to convince other webmasters to link to our PR0 internal pages, and why should they? That is the unfairness of this new supplemental scheme.

Finally, G's recent focus on canonicaliazation should mean that simply adding new 301 redirects should not help a target URL, since G will now index that URL only once, in its canonical form, and never under the URL that contains a proper 301 redirect. There may be some "leftovers" from previous times, but if you see the same URL in the main index twice, in any form, then G is not in fact doing what Matt C said they were supposed to be doing with Big Daddy and other recent algo changes in 2006.

If anyone wants to see evidence of this "new supplemental effect", PM me and I will provide a couple of specific examples.

texasville




msg:3154493
 7:58 pm on Nov 12, 2006 (gmt 0)

Tedster,
I have not seen it hit the larger sites yet. It is hitting smaller sites. Think under 50 pages. I believe it will expand exponentially.

RonnieG




msg:3154521
 8:46 pm on Nov 12, 2006 (gmt 0)

texasville is right.

There seems to be a page count threshold as well as other factors at play. I can certainly add lots more pages to my 61 page widget site, but they would probably not add much to the user experience. I could create a site that has 32,000 pages, one for each individual widget for sale in my market, instead of giving users access to all 32,000 widgets via the user searchable database portal I provide now. Would that really enhance the user experience? Probably not, but it certainly would let G think I have more unique content than I really do. It would also add 32,000 more pages for Gbot to crawl every couple of weeks. I would then look just like the huge eCommerce sites that do basically the same thing, and G seems to love so well. I guess the the way out of "supplemental hell" is to play the game, and let Google become the inventory database search function of every site on the planet, instead of building sites that are portals with their own database engines to serve up their inventories.

[edited by: RonnieG at 8:47 pm (utc) on Nov. 12, 2006]

steveb




msg:3154572
 9:47 pm on Nov 12, 2006 (gmt 0)

"Most pages that were tagged supplemental had the same meta description tags two months ago..."

Which puts those pages in a different category than healthy pages being talked about. The pages had a problem.

"And there's nothing I can do in that scenario except wait patiently for Supplemental Googlebot to roll around."

No, that has nothing to do with it really. People need to remember that the main index and the supplemental one are parallel to each other. Supplemental bot has nothing to do with the main index. Once you have a supplemental, you can have it for about a year. But you can also have the same URL in the main index.

You don't "go supplemental". Your page is dropped from the main index, leaving only the previously hidden supplemental to rank.

"So, to get my site out of the supplemental index, I may have to get a bunch of links pointing directly at my supplemental pages. That's easier said than done."

Totally irrelevant. Google doesn't care where links are, and doing something like linking from domainA to internal pages of domainB will be no more help to get a page regularly indexed than internal links.

"So, "getting crawled", in steveb's terms, really means having IBLs from other sites,"

Huh? Getting crawled is having Googlebot visit a page. I don't understand what you are trying to say here. "Crawled" has nothing to do with good or bad linking structure. It means Googlebot actually devouring your page.

It may be a just a problem a couple of you are having in using standard terms in an exotic way.

steveb




msg:3154574
 9:55 pm on Nov 12, 2006 (gmt 0)

In short, the problem is the new Googlebot is so weak that while it may crawl a main page at a site, it often does not drill deeper on sites with low PR and/or few links. It's like Slurp used to be a lot of the time, banging its head against the index page of a domain (and robots.txt) and then skipping off like a stone across a lake.

IF a domain is healthy (and the vast majority of supplements come from non-health) then more links to your homepage, more link paths to internal pages, more PR, more more more... this is what you need to drive Googlebot deeper into your domain to grab current versions of your pages to put in the main index. The supplementals will still be there in the background, waiting to appear if your indexed pages are dropped from the main index again.

g1smd




msg:3154589
 10:08 pm on Nov 12, 2006 (gmt 0)

Yes, a "healthy" URL which shows as a normal result will also have an older cached version "hidden" in the Supplemental index. You can see it if you search for words that were on the old version of the page and are not in the new version of the page.

You can also see it in some of the results of a site:domain.com -inurl:www search.

texasville




msg:3154607
 10:42 pm on Nov 12, 2006 (gmt 0)

I am talking about sites that have never had canonical problems, content on the supplemental pages has had little or no change. These are sites that have low pr.
Perhaps to clarify we should once more analyze the statement by Matt Cutts made to a website owner complaining of this exact thing:

>>>>".....having supplemental results these days is not such a bad thing. In your case, I think it just reflects a lack of PageRank/links. We've got your home page in the main index, but if you look at your site ... you'll see not a ton of links ... So I think your site is fine ... it's just a matter of we have to select a smaller number of documents for the web index. If more people were linking to your site, for example, I'd expect more of your pages to be in the main web index."<<<<<<<

I truly like that ONE part of the statement and find it the most interesting: >>>>it's just a matter of we have to select a smaller number of documents for the web index.<<<<
Really?
And: <<<<having supplemental results these days is not such a bad thing.<<<<
Yeah? But if this particular type of supplemental hits these pages you cannot find them in any search. no way, no how!

g1smd says:<<<<Yes, a "healthy" URL which shows as a normal result will also have an older cached version "hidden" in the Supplemental index. You can see it if you search for words that were on the old version of the page and are not in the new version of the page. <<<<

But for these types of supplemental, you can't find them in any shape or form except in a site search. They do NOT exist in a "search for words that were on the old version of the page " and in many cases there is not a new version, old version. Page hasn't changed. Nor should it. The information on it is static.
The whole gist of this is these sites do not have the resouces for buying or developing large volumes of content just to fill it out for google's benefit. They also don't have the resources for hiring someone to do link building campaigns. Google has started treating them as red headed step children.

steveb




msg:3154620
 10:59 pm on Nov 12, 2006 (gmt 0)

Having more pages wouldn't help. if you can't get the bot to 50 pages, it won't go to 500.

Previously with the more robust googlebot it was common for a site with 600 pages to have all 600 indexed. Now it is common for the same domain to have 560 or so pages indexed. If those 40 pages that were dropped from the main index had parallel supplementals, those would show in the results. But if the domain is healthy, you just see the 560 pages for a site search.

Having pages dropped that have no supplementals is no big deal in terms of health, since again all you need is more links or one good link to get the pages crawled again, but it is now Google's policy to limit its listings of these weakly linked pages.

A PR3 page with two quality links will often fall out of the main index because Google's priority is to crawl pages with volume of links now. A PR2 page that has 20,000 PR0 blog comment links aiming at it will get crawled basically every day. The new bot may choose to not crawl to the page via 19,700 of tose blog links (or even 19,999), but it still will crawl the garbage page every day. Pagerank of an index page matters some, but Google has gone extremely blog-happy in its priorities. PR is a very distant secondary consideration. Volume of crawl paths is what matters, regardless of how very poor quality those links are.

Halfdeck




msg:3154673
 12:29 am on Nov 13, 2006 (gmt 0)

Totally irrelevant. Google doesn't care where links are, and doing something like linking from domainA to internal pages of domainB will be no more help to get a page regularly indexed than internal links.

Of course Google doesn't care. But if I add new links on a supplemental page, Google is going to ignore those links till the page's cache is refreshed. Whereas if I put the same links on a page that gets crawled every day, they're going to have an immediate effect.

[edited by: Halfdeck at 12:37 am (utc) on Nov. 13, 2006]

g1smd




msg:3154677
 12:43 am on Nov 13, 2006 (gmt 0)

>> If I add new links on a supplemental page, Google is going to ignore those links till the page's cache is refreshed. Whereas if I put the same links on a page that gets crawled every day, they're going to have an immediate effect. <<

Eh? They are always going to be ignored until the page is crawled again. The factor is how often that page is crawled.

RonnieG




msg:3154705
 1:34 am on Nov 13, 2006 (gmt 0)

The bottom line is that Google has recently implemented a new threshold on what it decides to put in the main index, and what it thinks is insignificant, so it goes into the supplemental index instead, or in reality, the supplemental trash bin, since it is truly not indexed at all, and cannot be found with even exact long tail quotes from the page contents.

And it really has absolutely nothing to do with getting a page crawled, or whether or not the page or URL or site has a problem. I completely understand why truly problem URLs and cached duplicates from past metatag and other edits get into the supplementals. I have a few of those myself, and I am not concerned about them. They are explainable. What is a concern, is why every page of a site, except the home page, goes to supplemental, when they were all previously in the main index, there have been no changes to the site, and there are no issues with 99% of the pages.

The answer really is obvious and simple. Google has decided that the solution to their infrastructure and database size issues, resulting from the explosion of web sites and pages worldwide, was to index fewer sites and pages, and to apply different threshold parameters to the process. It is no longer technically feasible to meet the old goal of indexing all the world's internet content. Something had to go, and apparently it was the interior pages of newer, smaller and lower PR sites. This is such an obvious business and technology decision and result, I am only surprised it hasn't happened earlier.

So, now the challenge is how to deal with the new reality that simply having a well constructed site with at least some useful and unique content is no longer sufficient to get the site into Google's searchable main index.

Matt C. gave us part of the answer: more quality IBLs to interior pages. OK. I'll buy that (or those). The other part of the answer, besides time, which was always a factor, seems to be adding lots more pages. I can do that, too, just like the big eCommerce sites. I'll just build a more dynamic site that creates a separate page for each variation of every product I sell instead of letting users search my online inventory database. After all, that's what created the great explosion of pages that have clogged up G's servers and indexes and have created the need for thousands of Gbots constantly and more frequently crawling the web. So, I guess I can get into that game, too.

But pity (if you can) the poor mom & pop shops who only have a limited number of unique products and services, and can (or used to) adequately serve the public with a nice looking and clean 20-30 page site. They simply will cease to exist in Google's main index, and therefore will never be found. I fail to see how this is "do no evil".

Halfdeck




msg:3154708
 1:46 am on Nov 13, 2006 (gmt 0)

Eh? They are always going to be ignored until the page is crawled again. The factor is how often that page is crawled.

Then let me try to clarify with an example:

Page A: Supplemental result, cache date: Apr 11, 2006.
Page B: Supplemental result, cache date: March 2, 2006, TBPR 0.
Page C: in the main index, TBPR 9, cache date: Nov 8, 2006 (daily cache refresh), 9 links on the page

Option 1: I add a link on Page C to Page A. Google picks up the link immediately and Page A pops back up in the main index two days later.

Option 2: I add a link on page B to Page A. Page A remains supplemental because 1) PageRank of Page B is low and 2) the link on Page B won't be picked up for months (last cache fetch was March 2, 2006).

Follow me?

It's a long way of stating the obvious: if you're going to link to a page to try to get it out of the supplemental index, link from a page that's already in Google's main index.

But if your site is 99% supplemental save the home page, then it makes no sense to add more links to your supplemental pages from within your site unless you put them all on the home page.

steveb




msg:3154720
 2:13 am on Nov 13, 2006 (gmt 0)

I've checked RonnieG's site and there is absolutely nothing new here. Near duplicate descriptions (the same basic sentence on multiple pages but with something like a different color isubstituted in the sentence text), multiple URLs showing the same content, no redirect from the old index.htm page to a new default file, and caches from six or more months ago (these are NOT supplementals from October like some sites have, but rather May and March).

If you are talking about a domain that has not been healthy, in Google's eyes, all other bets are off.

Google may be more ruthless in disliking pages now, but this is the same old thing we've gone over hundreds of times here. Webmasters have to make their sites healthy, and then wait up to a year before you'll start getting treated with full respect again.

====
"then it makes no sense to add more links to your supplemental pages from within your site unless you put them all on the home page."

Well that's what he said. If your domain has one healthy page, and the domain is under 100 pages, you would do well to link to every page on your site from the home page. This may be stupid looking, but Google's weak crawling requires you to do what you can.

It's telling that no one is here saying the pages are getting crawled a few times a week, and the pages not going in the main index. (Even if someone said that, if the domain had problems the pages could just be discarded.)

Getting more link paths so you get crawled more often is extremely important, as is cleaning up your domain in the ways mentioned hundreds of times here.

texasville




msg:3154823
 5:33 am on Nov 13, 2006 (gmt 0)

>>>>Getting more link paths so you get crawled more often is extremely important,<<<<<

And if you have enough resources you can make any site #1. Steveb, the point is how far can some of these small business people go to promote their websites. The whole point of this was to point out that they don't have these resources. Most do not even understand these factors. Most would have to hire someone to do it for them. Both build links and content. I am not talking about webmasters that you see here constantly that build 1000 page ecommerce or mfa sites. I am talking about the mom and pop brick and mortar businesses that have to try to compete with the walmarts of the web. The small site goes supplemental and the deep pocket sites control the returns.
Sorry but what you and I are talking about are two different things.

steveb




msg:3154871
 7:15 am on Nov 13, 2006 (gmt 0)

If your point is big businesses have advantages over small businesses, well, obviously.

That should hardly be the point though. Google favoring spammers over solid, legit businesses (small or large) should be. The fact that a small number of quality links is looked on a LOT less favorably than a large number of rotten links is the issue here.

Google needs to alter its crawl priorities so it crawls these legitmate small sites, and allow them to be beaten by legitimate large sites, rather than its current priority of favoring blog comment spam links at the expense of both small and large legitimate sites.

RonnieG




msg:3154955
 10:14 am on Nov 13, 2006 (gmt 0)

steveb said:
... Near duplicate descriptions (the same basic sentence on multiple pages but with something like a different color substituted in the sentence text

Duh! Very similar is not the same as exact duplicate. Mine is a real estate site. mytown is distinctly not the same as yourtown. This is not the same as simple color change or shoe size difference. I have examined several other similar sites, where this kind of minor difference in wording is common, and is not penalized.

... multiple URLs showing the same content, no redirect from the old index.htm page to a new default file,

False. What old index.htm? Where did that come from, unless from a bad IBL outside of my control, which is not my problem, and should result in a 404. Case specifically referenced was www.mysite/Default.aspx vs. a lower case version www.mysite/default.aspx, of which only the Default.aspx version was found in the site: results, with no cached page. And I had to request the additional omitted results to even see that. The mysite.com/ home page is indexed, and apparently is appropriately redirected to the sole Default.aspx url, for both www and non-www. So it seems that steveb randomly guessed that this might be an issue and picked up on what might be a common issue with some sites. However, this site is hosted on IIS, which unlike **nix servers, is not case sensitive, and the target url was the same exact canonical url in any case, not a separate page. I tested several other variations of the same url with random capitalization of various other letters in the URL, and they all went to the same proper and unique canonical URL. I did the same random letter capitalization test with several internal urls, including various letters in the folder names, all with the same clean results. Of course they all show the same content, since the landing page url is the same exact file! All this shows is that the site is IIS hosted. Nothing more.

... and caches from six or more months ago (these are NOT supplementals from October like some sites have, but rather May and March).

So what does that prove? The cached results were mostly March-April-May 2006, the same time lots of sites were first being hit by the same supplemental issues we have been discussing here, as evidenced by hundreds of posts on a now-locked WebmasterWorld thread that had to be continued 7 times to handle all the posts. All this shows is that the supplementals issues discussed in those extensive threads are still affecting some sites, and that the pages in the supplementals have not yet recovered from those issues. Since few or none of the pages have the usual problems that would cause them to be penalized / made supplemental / left out of main index purely for their individual page issues, this also seems to support my point that there is still some kind of a site-wide site size and/or PR threshold being applied to what interior pages are allowed in the main index. My site's home page is PR3, and has been there for over a year. This may not be wonderful, but it is not a PR0-PR2. G webmaster tools show my site crawl history chart and page hit numbers, which indicates a monthly full crawl, with daily hits to at least my home page, and average 9-16 pages per day, which is about what I would expect given the dynamic content of a few of the pages and periodic content updates. My web logs show similar crawl rates for googlebot/2.1. So the issue is NOT that the pages are not getting crawled. They, and/or the site, are just not "good enough" for G's main index for some reason.

It is just possible that, after 8-9 months in supplemental hell except for the home page, the next time the Gbot is in the neighborhood checking my home page and xml sitemap, and hopefully also crawling all of the pages of my site and counting them, it is just waiting for that magic moment when it crosses the mysterious threshold of time/pagecount/IBLs/etc., that means it can finally be allowed to index my interior pages again, perhaps because I have added another 30-40 pages to the site and have acquired more IBLs to my home page and to a few of the interior pages as well.

In the meantime, following the lead of some of the suggestions in the old threads, I have deleted and re-submitted my xml site map, and submitted a re-inclusion request through webmaster tools.

Google needs to alter its crawl priorities so it crawls these legitmate small sites, and allow them to be beaten by legitimate large sites, rather than its current priority of favoring blog comment spam links at the expense of both small and large legitimate sites.

No. G needs to be able to better recognize and index true quality content of legitimate sites, on a page by page basis, regardless of site size, as well as recognize and discount spam links. My site has a number of spam links from scraper sites and others that I never solicited or authorized. Those absolutely should be discounted, and it appears that they are, but I should not be penalized for them. On the other hand, I also have legitimate and relevant links from other small businesses in my industry, but G does not seem to be crediting those at all.

[edited by: RonnieG at 10:25 am (utc) on Nov. 13, 2006]

Marcia




msg:3154969
 10:49 am on Nov 13, 2006 (gmt 0)

It isn't only a matter of being "good" enough, which can be a perceptual thing, or relevant enough. It's a matter of also being "important" enough, which is a metric that relies on both PR and number of IBLs, adding up to a given link strength for a site.

Keeping it down to a basic, simple level there just has to be enough link "strength" within a site to trickle down through the pages of the site, and that can be relative to its size - and is somewhat controllable, to a degree, through IA (Information Architecture) - and other internal linking factors.

Personally, I've measured the inbound link strength and progression of some small sites and tracked it over time, and what those sites need is exactly what Adam suggested - more quality inbound links. Meantime, if I know that only a certain number of pages will either be indexed altogether or be indexed in the main and not the Supplemental index based on the current PR and/or number of IBLs, then it's up to me to figure out how to route whatever link strength there is to the more valuable sections and/or pages of the site.

There are ways to selectively influence that to a degree - but as they say, the devil is in the details.

texasville




msg:3155302
 6:06 pm on Nov 13, 2006 (gmt 0)

>>>>>>Google needs to alter its crawl priorities so it crawls these legitmate small sites, and allow them to be beaten by legitimate large sites,<<<<<

So much for "do no evil".
Why turn them supplemental and effectively ban them from the serps? It commands the catch 22 problem. Never find them and they will never get natural, organic links and never grow. That usurps quality for moneyed sites. It cheats the surfers. If they can get a better deal or find better information then that is what google originally was supposed to do. It also distorts the results. These small sites will never grow under google's thumb. I am not saying that google needs to favor the small ones but just don't effectively, kill them. Don't use them as an excuse to cover the fact your index power is limited.
I find it funny that msn and yahoo have not had to resort to this.
And not to change the subject but I have also noticed this weekend that results that used to return (for example) 3 million + are now returning 1.5 million and so on. Almost exactly half in every sector I searched.

g1smd




msg:3155521
 9:01 pm on Nov 13, 2006 (gmt 0)

>> Of course they all show the same content, since the landing page url is the same exact file! <<

Yes, that IS the duplicate content. Simply, Default.asp is a duplicate of dEfault.asp is a duplicate of deFault.asp is a duplicate of defAult.asp is a duplicate of defaUlt.asp is a duplicate of defauLt.asp is a duplicate of defaulT.asp is a duplicate of default.Asp is a duplicate of default.aSp is a duplicate of default.asP is a duplicate of DEfault.asp is a duplicate of DeFault.asp is a duplicate of DefAult.asp is a duplicate of DefaUlt.asp is a duplicate of DefauLt.asp is a duplicate of DefaulT.asp is a duplicate of Default.Asp is a duplicate of Default.aSp is a duplicate of Default.asP is a duplicate of DEFault.asp is a duplicate of DEfAult.asp is a duplicate of DEfaUlt.asp is a duplicate of DEfauLt.asp is a duplicate of DEfaulT.asp is a duplicate of DEfault.Asp and so on and on and on and on...

tedster




msg:3155558
 9:33 pm on Nov 13, 2006 (gmt 0)

an excuse to cover the fact your index power is limited.

Why do you make that assumption? I think going down that road in your thinking will likely lead you to false conclusions. Whether a url is regular or supplemental, the URL still is indexed.

RonnieG




msg:3155620
 10:32 pm on Nov 13, 2006 (gmt 0)

>> Of course they all show the same content, since the landing page url is the same exact file! <<

Yes, that IS the duplicate content. Simply, Default.asp is a duplicate of dEfault.asp is a duplicate of deFault.asp is a duplicate of defAult.asp is a duplicate of defaUlt.asp is a duplicate of defauLt.asp is a duplicate of defaulT.asp is a duplicate of default.Asp is a duplicate of default.aSp is a duplicate of default.asP is a duplicate of DEfault.asp is a duplicate of DeFault.asp is a duplicate of DefAult.asp is a duplicate of DefaUlt.asp is a duplicate of DefauLt.asp is a duplicate of DefaulT.asp is a duplicate of Default.Asp is a duplicate of Default.aSp is a duplicate of Default.asP is a duplicate of DEFault.asp is a duplicate of DEfAult.asp is a duplicate of DEfaUlt.asp is a duplicate of DEfauLt.asp is a duplicate of DEfaulT.asp is a duplicate of DEfault.Asp and so on and on and on and on...


So just because a site is hosted on an IIS server, and as a result, a user can enter various combinations of a url that takes that user (or robot) to the exact same canonical url, that makes it duplicate content? I don't think so. If that was the case, every IIS server in the world would have dup content issues, and no site hosted on an IIS server would ever have good SERPs.

That was the whole point of G's efforts to index the final canonical url in the first place. It no longer matters if the user (or robot) gets there via a 301 redirect, a 302 redirect, or mis-capitalizes a letter or two in the url. With BD and other G algo changes in the last year the final landing page canonical url is all that G sees or would index. Just look at the final url as displayed in the address bar once you get there. As long as it is properly formed, and the landing page is the same url, it doesn't matter what was entered in the first place.

Somebody needs to re-read Matt Cutts blogs and other explanations on G's work and handling of canonical urls.

[edited by: RonnieG at 10:37 pm (utc) on Nov. 13, 2006]

steveb




msg:3155647
 10:52 pm on Nov 13, 2006 (gmt 0)

RonnieG just insisting Google act as you say doesn't mean they will, or they should.

You need to read what Matt wrote about canonical URLs, and Google Guy here, and the literally thousands of posts on the topic. Then you need to clean up your domain. Your problems are plainly obvious, and not at all new.

To sum it up, to protect yourself you should have content accessible on only ONE URL, whatever that may be.

"Duh! Very similar is not the same as exact duplicate."

Double duh. Near duplicates get you in trouble. This of course is obvious when talking about a page, where changing one word won't magically make a page considered a non-duplicate, but there have been several threads here discussing problems with pages when descriptions were too similar, and how the pages immediately recovered from the "ommitted results" purgatory after near duplicates were changed.

"What old index.htm?"

The one archive.org shows was on your site. That one.

As stated before, clean up the problems with your domain, and then wait, even though the wait may take ayear or more. Maybe this all is stupid of Google, but there is nothing new here. These same issue have been effecting sites for two and a half years or so.

tedster




msg:3155650
 10:54 pm on Nov 13, 2006 (gmt 0)

If that was the case, every IIS server in the world would have dup content issues

I work with many IIS sites, and it is true that they have this increased liability for duplicate URLs. If the developers are disciplined about capitalization throughout their mark-up, and there are few to no capitalization oddities in their IBLs from other domains, then the dup content issues may be minimal.

But the issue are often there -- 80% of the time I'd say. Microsoft's server tends to encourage a certain lack of discpline in this area. I almost always find at least some of this trouble when companies bring me an already developed site served on IIS.

The best practice I have found is to use all lower case for html files names -- and forget the cutsie CamelCase thing. A rewrite utility such as ISAPIrewrite can go miles to fix any damage.

.NET sites on IIS have three or four liabilities that are as common as termites in old Goergia homes. The so-called "custom 404" that actually returns a 302 status is another common source of Supplemental URLs for .NET/IIS websites.

g1smd




msg:3155665
 11:09 pm on Nov 13, 2006 (gmt 0)

>> So just because a site is hosted on an IIS server, and as a result, a user can enter various combinations of a URL that takes that user (or robot) to the exact same canonical url, that makes it duplicate content? I don't think so. <<

Duplicate content is ""the SAME content available at a DIFFERENT URL"". That is it.

The URL may be different by:
having a different domain name (bestwidgets.com vs. cheapwidgets.com),
or may be www vs. non-www (widgets.com vs. www.widgets.com),
or may be a dynamic URL with different parameters (/page.php?item=111 vs. /page.php?item=111&printfriendly=true),
or the same parameters in a different order (/shirts.php?colour=blue&size=16 vs. /shirts.php?size=16&colour=blue),
or the same parameters with slightly different values (/item.php?item=widgets&perpage=25 vs. /item.php?item=widgets&perpage=50 where only 20 items are in the category for example),
or the same URL with different capitalisation (as above).

Your understanding of canonical URL is flawed.

For the root of a site, www.domain.com/ is the canonical URL. All other URLs that can reach the same content (whether or not you promote them) are DUPLICATES.

For internal URLs, the one URL format that you choose to promote by internal linking is, by definition, the canonical URL, and therefore, by definition, all other forms, whether promoted or not, are duplicates.

If your site contains links to the same content but with several different formats, then the duplicate content problem starts breeding from within your own site.

texasville




msg:3155678
 11:30 pm on Nov 13, 2006 (gmt 0)

>>>>>Why do you make that assumption? I think going down that road in your thinking will likely lead you to false conclusions. Whether a url is regular or supplemental, the URL still is indexed. <<<<

If it is indeed in the index then why isn't it served. This particular type of supplemental is never served. No matter what...EXCEPT in site:mysite search. There must be a reasoning and a conservation of resources.

texasville




msg:3155698
 11:50 pm on Nov 13, 2006 (gmt 0)

I can understand:
>>>>or the same parameters in a different order (/shirts.php?colour=blue&size=16 vs. /shirts.php?size=16&colour=blue), <<<<<<

but if you have a page that can be reached as mysite./widgets.htm and also mysite/WiDGets.htm and both resolve to the original page laying in the root folder created as widgets.htm, then that is duplicate?

g1smd




msg:3155705
 11:57 pm on Nov 13, 2006 (gmt 0)

If the URL is different in any way what-so-ever, then you have TWO URLs that serve the same content.

/WIDgets.html is a DIFFERENT URL to /widGETS.html.

Any more than ONE URL serving the same content is a DUPLICATE URL.

Why is this so freakin' difficult to understand?

You might only have one "PAGE", but you have multiple URLs for that one page.

Google indexes URLs, not pages. A page may have more then one URL that can reach that content, but you want to avoid that happening.

You have to design your site so that only ONE URL serves that content with a "200 OK" status, and that all other alternative URLs for that content always issue a 301 redirect pointing to the one canonical URL that you want that content to be indexed under.

That includes other capitalisation of those URLs, and any other change to the URL in any way what-so-ever.

.

... only 12 more posts to go before I repeat this for the thousandth time [google.com].

texasville




msg:3155721
 12:13 am on Nov 14, 2006 (gmt 0)

>>>>Why is this so freakin' difficult to understand? <<<<
Well, excuse my thickness.

Maybe because I don't understand how capitilization errors makes it a different url. I can understand session id's and the www. versus non. I just can't understand the capitilization. It seems like a whole different animal. I am not talking dynamic...I am talking straight htm. It also seems to be a flaw in apache and a plus in IIs. It also seems that human error of caps is something an algo can easily be adapted to. After all..you cannot go out and buy EATMYSHORTS.com if eatmyshorts.com has already been taken.

texasville




msg:3155723
 12:15 am on Nov 14, 2006 (gmt 0)

btw...I clicked on thousandth and google.com reports only 9,640. They lost a few.

This 110 message thread spans 4 pages: < < 110 ( 1 [2] 3 4 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved