homepage Welcome to WebmasterWorld Guest from 54.227.62.141
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 110 message thread spans 4 pages: < < 110 ( 1 2 [3] 4 > >     
Whats the Skinny on Supplemental index
Adam from google clarifies some issues
contentwithcontent




msg:3152334
 10:45 am on Nov 10, 2006 (gmt 0)

From Adams post at the Google webmaster discussion group...

"I thought I'd clear the air a bit:

1) Penalty?
When your site has pages in our supplemental index, it does *not*
indicate that your site has been penalized. In particular, we do not
move a site's pages from our main to our supplemental index in response to any violations of our Webmaster Guidelines.

2) Freshness?
You can expect to see a fresher supplemental index in the coming
quarters. By the definition of "supplemental," however, I don't forsee
it becoming as comprehensive or frequently updated as our main index.

3) Cure?
Get more quality backlinks. This is a key way that our algorithms will
view your pages as more valuable to retain in our main index."

more at... Google Groups discussion [groups.google.com]

[edited by: tedster at 11:12 pm (utc) on Nov. 11, 2006]
[edit reason] fix side-scroll [/edit]

 

texasville




msg:3155723
 12:15 am on Nov 14, 2006 (gmt 0)

btw...I clicked on thousandth and google.com reports only 9,640. They lost a few.

g1smd




msg:3155726
 12:21 am on Nov 14, 2006 (gmt 0)

Domain names are always assumed to be all in lower case.

Folder and file names can have any capitalisation that you want.

On Apache, the file /HELLO.HTM is a separate file to /hello.htm so you could upload one of each to the server, and they could have totally separate and different content.

On Windows/IIS they don't treat folder or file names as being case-sensitive, so a two letter filename has 4 different URLs that can access it (aa, AA, aA, and Aa). A three-letter file name can have eight different URLs, and so on.

That is the problem.

Google indexes URLs.

The flaw is with IIS allowing a single file to appear at multiple URLs.

With Apache, if there is one file called file.html and you request FILE.html then you correctly get a 404 error for that incorrect URL.

.

The words "URL" and "file" and "page" have explicit meanings here and are not to be confused.

RonnieG




msg:3155890
 4:35 am on Nov 14, 2006 (gmt 0)

Any more than ONE URL serving the same content is a DUPLICATE URL.

Aha! And therein lies the crucial difference!

A DUPLICATE URL REFERENCE SERVING THE SAME CONTENT IS NOT DUPLICATE CONTENT! MULTIPLE URLS TO THE SAME CANONICAL CONTENT ARE SIMPLY DIFFERENT PATHS TO THE SAME RESULT.

Duplicate content is multiple canonical urls with the exact same content, and is NOT two different paths to the same canonical url! This is a simple, but critical fact of what we are discussing. Does a url with a 301 redirect to a 2nd url mean that the content is duplicated? No!

Re: [en.wikipedia.org...]

Pay attention to this: "all refer to the same article"

And I have to believe that the G development staff and Matt Cutts clearly understood this when they addressed how canonical content is now handled by G in its indexing process.

Re: [mattcutts.com...]

The key here is that you should be consistent with internal links within your own site, which makes perfect sense. However, you have no control over external links and/or how a user may enter a url to get to a certain url/page on your site. Therefore, the internal consistency is what is important, not how some external link or user url entry may be used to get to the canonical target url to a particular content. G does not see or care what the user may enter, and if an external link can be resolved, by the spider or by the web server as it delivers the targeted content, G really does not care. What it looks at and indexes is the final resolved url that actually delivers the content.

I realize that this was not always true. This was a major point of contention, and the reason 302 hijacks worked, and why many dup content penalties used to occur, even if the webmaster was totally clean, and not at fault. One of the major goals of G's BD and other canonicalization development efforts was to eliminate the external causes of this kind of penalty, and to rely on the final canonical url 99.5% of the time, per Matt's blog.

The flaw is NOT in how a particular web server delivers content. The flaw was how G used to handle it. G, according to Matt's posts on canonicalization, and according to the results I see on my own site and many others, has fixed this. However, many webmasters seem to want to hang on to the old problems of the past, and not accept that things change. What used to be true, in this case, no longer applies, at least as far as uncontrollable external factors are concerned. How we implement links internal to our own sites probably does still matter, but how a user or external link refers to a url no longer does.

[edited by: RonnieG at 4:47 am (utc) on Nov. 14, 2006]

RonnieG




msg:3155921
 6:28 am on Nov 14, 2006 (gmt 0)

g1smd said:
Google indexes URLs.

Not precisely true, at least in today's world. In today's world, Google indexes canonical urls.

In the pre-BD world, Google, improperly as they finally realized and acknowledged and eventually corrected, did index as many urls as it found in any and all online sources.

[edited by: RonnieG at 6:32 am (utc) on Nov. 14, 2006]

ramachandra




msg:3155922
 6:43 am on Nov 14, 2006 (gmt 0)

Hello,

Here is my situation which I was facing earlier and now again with the canonical issue.

Resolved non-www and www issue long back by implementing 301 redirect script from non-www domain to www domain. With the BD implementation everything was fine with the site.

Last week my Ranking dropped suddenly to 700+, after analysis I noticed that G has indexed my index page twice again with 2 different URL one shows www.mysite.com/ and other www.mysite.com/index.asp.

I am trying to fix the issue by adding the 301 script from index.asp to root but its getting down into an endless loop.

I am not getting the solution to overcome this issue since long time, so I request you guys here to suggest so that G treats as the same for both.

Marcia




msg:3155930
 7:13 am on Nov 14, 2006 (gmt 0)

g1smd said:
Google indexes URLs.

And each URL is assigned a docID, correct?

RonnieG said:
Not precisely true, at least in today's world. In today's world, Google indexes canonical urls.

In the pre-BD world, Google, improperly as they finally realized and acknowledged and eventually corrected, did index as many urls as it found in any and all online sources.


So Ronnie, tell us which parts have changed:

The Anatomy of a Large-Scale Hypertextual Web Search Engine [infolab.stanford.edu]

And then tell us how the document should currently read in place of the parts that you're indicating are now erroneous in it.

g1smd




msg:3156010
 9:51 am on Nov 14, 2006 (gmt 0)

>> A DUPLICATE URL REFERENCE SERVING THE SAME CONTENT IS NOT DUPLICATE CONTENT! <<

YES IT IS. The same content served at multiple URLs is what duplicate content IS.
It is irrelevant how those other URLs arise.

.

>> MULTIPLE URLS TO THE SAME CANONICAL CONTENT ARE SIMPLY DIFFERENT PATHS TO THE SAME RESULT. <<

There is no such thing as "canonical content". I have no idea what you are talking about. Different paths to the same content are duplicates when all those URLs return the content as "200 OK".

.

>> Duplicate content is multiple canonical urls with the exact same content, and is NOT two different paths to the same canonical url! <<

You are clearly confused or delusional. You cannot have multiple canonical URLs. The word canonical, by definition, refers to ONE single URL, the "real" or "true" URL. Multiple URLs are duplicates. Duplicates are duplicate content.

.

>> This is a simple, but critical fact of what we are discussing. Does a url with a 301 redirect to a 2nd url mean that the content is duplicated? <<

This is very simple. This is another case entirely. When talking about duplicate content we are talking about multiple URLs that all return the same content, and all return it with "200 OK" status. I thought I had made this abundantly clear multiple times already.

If a URL serves "301 Moved" then that URL is a redirect, not something that serves content. Search engines don't index 301 redirects. They index the target URL of the 301 redirect, that is, they index the final URL that returns the content and returns it with "200 OK" status.

When a URL returns a 301 status, then that is a redirect. As the content is not directly served at that URL, then it cannot be duplicate content.

Marcia




msg:3156024
 10:15 am on Nov 14, 2006 (gmt 0)

When talking about duplicate content we are talking about multiple URLs that all return the same content, and all return it with "200 OK" status. I thought I had made this abundantly clear multiple times already.

It is very clear and that's exactly what it is, right on the button.

page1.aspx returns 200 OK
page2.aspx returns 200 OK with same content

2 pages with different URLs same content = duplicate content.

<side topic>
g1smd, your stickymail is full, you need to clear it out.
</side topic>

g1smd




msg:3156063
 11:24 am on Nov 14, 2006 (gmt 0)

>> In today's world, Google indexes canonical urls. <<

Google tells us that they are trying to, but in many cases they still fail to do so.

When the difference is the addition or lack of a www in the URL, then they are getting things more right than they used to. Likewise if you're talking about /index.html¦htm¦php versus "/" then it is often OK (but could always do with a helping hand).

When the difference is .com versus .co.uk then there are multiple types of "funky result" going on. When the difference is parameter number variations, parameter ordering, or capitalisation issues, then nothing is in place for Google to work out which URL you really intended to be "the one". They take a guess, and the chosen one changes on a regular basis as PR and links slosh about.

If you can set all your internal links to all reference one particular URL format, and you can get all other URL formats on your site to return a 301 redirect pointing to the one format you want to be indexed then you are already miles ahead of the game.

>> The key here is that you should be consistent with internal links within your own site, which makes perfect sense. <<

Yes. That is a key point.

Dynamic sites (like forums and carts) often get this wrong. In fact, poor indexing of those types of sites is nearly always down to duplicate content issues to do with parameter differences, and parameter ordering, not merely simply the fact that the URLs contain dynamic parameters at all.

photopassjapan




msg:3156120
 12:43 pm on Nov 14, 2006 (gmt 0)

Interesting how this conversation turned away from "what makes supplementals supplemental"... i was interested in that :)

Altho i don't agree... or would like to not agree on internal navs not playing a role in this <:)

We'll see.
Also on a sidenote... in my opinion meta has nothing to do with going supplemental, nor with the "health" of a URL, at least not directly. If all else is unique, meta only finetunes relevance, i'm seeing sites with sitewide metas and no metas at all that still rank well. Same metas could be but a trigger for further examination... i don't know, but what i DO see is that they are grouping otherwise normal results in an undesirable way, displaying the relevant highest PR/directory level page instead of the MOST relevant, and stuffing pages as "very similar" in the omitted search results list. ( But similar is not same. )

And there are supplementals on our site that have been crawled within a month, and nothing changed for them. See, unique metas or no unique metas, being crawled or not, pr0 pages are now supplemental while pr1, 2 ,3 ,4 ,5 are not. That's it, really, nothing to do with each other directly, that's what my post was all about, telling our experiences of meta NOT being linked to supplementals.

Being crawled can be an indication allright but only because frequency is tied to the PR of the pages linking to the ones in question. ( Or not, but that's what i believe :P ) To me it still seems that links on a PR0 page are less likely to be followed on-spot by Gbot. Links from a PR2 page are crawled at least twice a month, as many times they're indexed, as opposed to links on PR0 pages once in a month, or sometimes one and a half months. Which left us wondering that okay, they're not crawled that often, but when they will be indexed again, the supplementals they point to will get out, right? These are links after all. The answer was NO, they only got out if the referrer page was at least PR2.
End of the story.

...

An interesting observation though:

- Supplementals don't come up for any search, not even unique strings found on the page. ( you said it, and it is so )

BUT

- If you do a search for some string that is matched by at least ONE normal page from the same domain, and a supplemental as well, the supplementals are carried onto the result pages on the back of the normally indexed page.

So in such cases...
If your site comes up for a search, it either shows two results ( query is matched in normal and supplemental page as well ), or none ( query is matched in the supplemental page only ).

g1smd




msg:3156125
 12:49 pm on Nov 14, 2006 (gmt 0)

Yes, a duplicate meta description across multiple pages is not a reason for a URL to become Supplemental, but those results are instead merely hidden behind the "click for omitted results [threadwatch.org]" link because they are just treated as being "similar".

Gissit




msg:3156163
 1:42 pm on Nov 14, 2006 (gmt 0)

Hi
Sorry but I have to disagree. I have a site that has 4000 pages indexed and only three of these are not supplemental. If I run a search on category-model-product-few key words of desc (no phrases or anything to specific) I list at #1 of 35 with a supplemental page, none of the pages below are supplemental. I actually pick up a fair ammount of traffic from these long tail searches.

The reason most of the site is supplemental?
Zen has a few issues with /index.php etc needed a 301 to sort them but the real problems came from when I set up the Zen Cart shop a couple of years back on the site and in my ignorance added a load of links on the front page pointing deeper to various parts. Some of these links were through the site search facility and these actually generate a different URL than the standard navigation to the same item would. Where I had added direct links to content and then made later changes to categories the links still worked but the standard navigation URL changed. The effect of this is that there are at least four URL's that point to each product category and each individual item. Duplicate content gone mad...

This was not a problem initially and I had aroud 20,000 indexed pages for only 700 products, no wonder G has had to try to rationalise what it keeps where...

I sorted out the errant linking yesterday (been busy with a new site and not too worried about this one until now) and now need to 301 all the bad links back to where they belong. A quick lesson in regular expressions should sort that out I hope, but I am not expecting instant results. With all that said, the site still ranks top 20 for competitive keywords (1.3 million results in SERPS) despite the bulk of it being supplemental.

g1smd




msg:3156243
 3:04 pm on Nov 14, 2006 (gmt 0)

A vBulletin forum with 40 000 threads and 40 000 members will easily generate over a million URLs, over 90% of which do not need to be, ummm, should never be, indexed.

Most other CMS, forum, and cart software has the same sort of problems with what they present to search engines.

Your fixes will show a small number of instant (like a few weeks) results, followed by a growing number in the following months. Finger in the air, I would say that you will need at least two Pagerank updates out of the way for most of the problems to be fixed. Anything that turns Supplemental will hang around for a year, but your redirect means that the visitor still gets to see the correct content anyway.

Your measure of success is in making sure that the maximum number of "correct" URLs are properly indexed. Don't be distracted by counting supplemental results for URLs that are now redirects and which fail to go away. They will eventually disappear, and you cannot control how quickly that happens.

dethfire




msg:3156400
 5:24 pm on Nov 14, 2006 (gmt 0)

a few of my sites are still indexed with cache dates that are over a year old, any ideas?

g1smd




msg:3156405
 5:30 pm on Nov 14, 2006 (gmt 0)

It all depends what the specific URL represents.

If it is a URL that now redirects, is 404, or a domain that has expired, then Google will clean it up eventually.

If it is a URL that returns "200 OK" then it is likely that it is duplicate content, and the other URL that returns the same content is probably shown as a normal URL somewhere. The fix is to redirect all of the duplicates.

It might just be a "page with low PR" and poor linking from other sites, which some people report seeing in just the last few months.

fishfinger




msg:3156424
 5:52 pm on Nov 14, 2006 (gmt 0)

... the point is how far can some of these small business people go to promote their websites. The whole point of this was to point out that they don't have these resources...

... all big bucks and big time players, and realistically, no way to honestly move up the ladder for the rest of us, except to spend all our spare time searching and recruiting for backlinks instead of running our small businesses...

I assume that anyone affected by this who cares does so because they expect to MAKE MONEY ON THE INTERNET? Where does it say that you *should* be able to compete with a company with a six figure ad budget and an in-house SEO? Where does it say that you *should* expect search engines to work they way they have done for ever?

Any business plan that expects success without marketing know-how (whether personal or paid for) and a marketing budget is NOT a viable business plan. Anyone affected should be grateful that they've had it so easy to now.

Offline, if you don't

(a) have a shop on the high street
(b) advertise in (at least) the local and niche media
(c) network and source leads/customers

do you HONESTLY believe you deserve to compete with those that do?

Just because the internet has been different it doesn't give anyone the right to expect that it will stay that way. The writing has been on the wall for a long time now yet people still act suprised every time Google pull a little bit more of the carpet out from under their feet.

potentialgeek




msg:3156457
 6:19 pm on Nov 14, 2006 (gmt 0)

Google has gone extremely blog-happy in its priorities. PR is a very distant secondary consideration. Volume of crawl paths is what matters, regardless of how very poor quality those links are.

It may have been around for some time, but I've just recently seen a crop of new blogs started as the latest form of link farming. For Google to take blogs seriously is funny, silly, and annoying. The dregs of humanity get a seat at the king's head table. Lovely.

One person who doesn't think blogs are so hot... Joe Lieberman. :/

p/g

texasville




msg:3156513
 7:23 pm on Nov 14, 2006 (gmt 0)

>>>>I assume that anyone affected by this who cares does so because they expect to MAKE MONEY ON THE INTERNET? Where does it say that you *should* be able to compete with a company with a six figure ad budget and an in-house SEO? Where does it say that you *should* expect search engines to work they way they have done for ever? <<<<<

Because they are search engines and not directories. Because they allude to the searchers that they can provide the relevant information they are searching for. Not the most optimized websites with huge numbers of paid links. But the most pertinent to their quest.

Staffa




msg:3156536
 7:56 pm on Nov 14, 2006 (gmt 0)

Texasville, I second that.

"Because they are search engines and not directories." ....which we allow to fill their indexes for free while crawling our hard work (our work, our bandwidth=money)

Gissit




msg:3156616
 9:46 pm on Nov 14, 2006 (gmt 0)

Sorry back to topic, well almost.

For anyone that wants to sort out dupe content on a php / sql site that has gone supplemental I seriously suggest a bottle of something strong before you attempt to try to sort out the htaccess 301's to clear up the mess.

I'm no webmaster or software engineer but I do have a lot more experience than many. I think it is going to be very hard for some people to sort out this sort of thing without some proffesional help.

After a couple of hours of steep learning curve on regular expression syntax and mod_rewrite I am getting a bit closer now & redirecting many of the offending url's to a single address.

If you are suffering from dupe content and are not really technical, bite the bullet and find the budget to get some help. I doubt this will jut go away

g1smd




msg:3156632
 9:57 pm on Nov 14, 2006 (gmt 0)

Heh, I once spent a whole week trying to get a redirect to work, then realised the reason that it failed was simply because there was an error in the test data.

texasville




msg:3156642
 10:07 pm on Nov 14, 2006 (gmt 0)

Well...one site I have that is suffering from this problem is on a IIs server and I just contacted them earlier today and guess what...yhey have no idea what ISAPI rewrite is. And they insist this must be a google problem and not a IIS problem. I agree with them but still insisted that it makes no difference. Google is a lot bigger than I am and they probably ain't gonna change. They do have an apache server and I told them to either get it fixed, change me to the apache server or I was going to find another host next week.
Of course, it's a pain. I really am getting tired of running down these little possum trails google creates.

RonnieG




msg:3156646
 10:17 pm on Nov 14, 2006 (gmt 0)

Marcia said:
And then tell us how the document should currently read in place of the parts that you're indicating are now erroneous in it.

Excellent question. Actually, nothing is wrong with the paper, or the design it describes. Read it carefully. What I am saying is totally consistent with that document and the design it describes.

In 4.1 of that document:

.. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page.

So, according to this paper, a docID is actually assigned to a web page, the document, and not to a url. Otherwise, why wouldn't it have been called a urlID in the first place? As aptly pointed out previously, a url is not a page. They are two different things.

and a little further down, the real key bits:

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs.

The description of the function of the URLresolver is very significant. It clearly specifies that there is a difference between a relative url and an absolute url, a crucial difference that I have been trying to explain in previous posts. A relative url may be no more than an internal page reference, relative to the current page. However, it may also be a 3rd level domain reference, such as www, or may contain a totally different domain name and other addressing attributes that will subsequently convert (by DNS resolution, redirection and/or by the web server hosting the url) to an absolute url when it gets to the final target page. In any case, per the paper, Gbot does not create a docID for the page until it gets to the absolute url.

When the web server delivers a page, it also delivers the fully resolved absolute url of the page, including any corrections to improper capitalization that may have existed in the link but has been resolved by the redirect or by the internal workings of an IIS server, for example. Unless there is some kind of masking going on, that fully resolved absolute url as delivered by the web server is what the browser puts in the address bar when the page is displayed, and should be what Googlebot sees and captures as the absolute url for that page. It is that target page, and its associated absolute url that gets (or is supposed to get) a docID, not the original relative url that may be in a link. Obviously, if a webmaster does something stupid to dynamically mask or change the absolute url returned to the browser, Googlebot will also see that masked or changed absolute url, and it will base its indexing on that, instead of the actual file name on the server, so that should never be done.

I can not understand how some in the webmaster community have misconstrued all of this to the common interpretation that each and every url, including relative urls, gets a docID. Perhaps it is the lack of clarity about the difference between a relative url and an absolute url. More likely, it has been G's improper implementation of the original design that has caused this misunderstanding, and has caused the original issues with 302 hijacks and other dup content penalties as a result of that improper implementation.

The other significant point of this landmark document is that a link in a web page, with its relative url, is the "trigger" that tells Google that there may be another page (remember: docID = page) to index, once the relative url is converted to an absolute url. It takes submitting a page or url to Google for the first time, or Google "discovering" a link from one previously discovered page to another, for a docID to be assigned to the final target page and therefore to its absolute url. Plus, as made clear later in the document, the absolute url is actually an attribute of the page, and the docID is also an assigned attribute of the page, and is the common key used in the absolute url index to allow accessing the page by its absolute url.

Therefore, since docIDs are assigned to pages, and the url associated with a docID is (supposed to be) only the absolute url of that page, and not any of the many other relative urls that could eventually get a user or robot to the same page, and the page contents gleaned from the document, compared to contents of other unique docIDs, are what determines whether duplicate content exists or not, then it logically follows that only unique docIDs (which also have unique absolute urls) that have the same content as other unique docIDs (associated with other absolute urls), can actually be duplicate content, and that no combination or number of relative urls that may point to the same absolute url can cause duplicate content.

Therefore, true duplicate content can only be created by a webmaster storing the same content on different absolute urls, or physical web pages, and cannot be artificially "created" by logical associations of multiple relative urls or paths to the same absolute url. There may be a perception of duplicate content created by multiple relative urls that eventually resolve to a common absolute url and its page, but perception is not always reality.

So, if in fact Google is NOT following its own basic design specifications, as described in this paper, and is improperly assigning docIDs to pages based on multiple relative urls instead of or in addition to the absolute urls, due to faulty programming logic, then it is Google that is delusional, and Larry Page needs to direct the G programmers back to the original design paper so they get it straight.

Sorry to be taking this thread in a slightly different direction, but since dup content is one reason a page (or absolute url if you will) might go to supplemental, the technical definition of dup content is pretty essential to understanding how to avoid it, so I don't think this sidebar discussion is really that far off topic.

g1smd




msg:3156651
 10:37 pm on Nov 14, 2006 (gmt 0)

Relative and absolute URLs had nothing to do with what we had been discussing up until your last post.

Of course, Google resolves ../../somefile.html as domain.com/somefile.html if that link appeared on domain.com/folder/folder/otherfile.html - that is without question.

However, for relative URLs, where no 301 redirect from non-www to www is present, the relative URL means that from within the site you can resolve both domain.com/somefile.html (from within some other non-www page) and www.domain.com/somefile.html (from within some other www page) and that both return "200 OK".

Google has done some work to try to combine results, for that scenario, but for most other duplicate content scenarios they have not yet managed to do this.

Matt Cutts still recommends solving canonical issues by adding the 301 redirect from non-www to www, as well as specifying which one you prefer to use in Google Webmaster Tools.

.

>> Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. <<

To me that says, "every time Google parses a page of code and content and finds an (absolute, resolved with domain) URL that they have not seen before they assign a DocID to it. So each (full) URL (with domain included) has a DocID assigned.

I cannot see how you got to your interpretation of what that says. It seems clear to me. Each new URL is assigned a DocID.

That system does NOT prevent duplicate content, as I already explained.

.

>> The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. <<

What that says to me is that from different parts of a site they might find links to ../thatpage.html and to ../folder/thatpage.html and to thatpage.html and to /some/folder/thatpage.html and so the system will resolve all of them to one canoical form like domain.com/some/folder/thatpage.html - however as I said above, they will still get to see duplicate content: If URLs both with and without the www can be synthesised from the internal links and both are returned as "200 OK" both will be asssigned separate DocIDs, and this process does NOT, and cannot, cure that particular problem.

.

>> Gbot does not create a docID for the page until it gets to the absolute url. <<

True. But the www URL version of the page will have a different DocID to the non-www URL version of that same page.

A URL with the same parameters in a different order to some other URL using the same parameters but rearranged in some other order will also have a different DocId for each parameter order that Google finds. Here is your duplicate content.

.

>> When the web server delivers a page, it also delivers the fully resolved absolute URL of the page, including any corrections to improper capitalization that may have existed in the link but has been resolved by the redirect or by the internal workings of an IIS server. <<

It only changes the URL in the browser if there is a redirect installed on the server. Otherwise, if you ask for WiDgEt then that is what is returned, and if you ask for WIDget then you get that instead. Both will be returned as "200 OK" and both will be indexed as separate URLs and separate Documents - duplicate content. Each will have their own DocID. There is no process to say "these are the same thing and both should really be known as 'widget'", other than the webmaster setting up a 301 redirect to say it is so. The only thing that can solve this type of canonicalisation issue is a 301 redirect.

.

>> lack of clarity about the difference between a relative url and an absolute url <<

I do not know why you are talking about absolute and relative URLs. They have not featured in this disussion in any way until now. We know that Google resolves relative URLs into a full absolute URL with domain included. That is not an issue.

.

>> Therefore, true duplicate content can only be created by a webmaster storing the same content on different absolute urls, or physical web pages, and cannot be artificially "created" by logical associations of multiple relative urls or paths to the same absolute url. <<

Yes, multiple absolute resolved URLs is what we are talking about, and there are many ways to create those for a single physical file. Those extra URLs, if they return "200 OK" are the duplicate content.

Again, those issues are:
- non-www vs. www,
- different domains (either .com vs. .co.uk and/or best-widgets vs. cheap-widgets),
- different capitalisation of the URL (on IIS),
- same parameters but in a different order,
- slightly different parameters,
- and so on.

Relative and absolute URL links within a site do not figure here at all. We know how those are handled. They are not the problem in and of themselves.

[edited by: g1smd at 11:12 pm (utc) on Nov. 14, 2006]

Gissit




msg:3156652
 10:39 pm on Nov 14, 2006 (gmt 0)

Well that's a whole load of words. Simple fact is I have multiple url's that serve the same content and they result in supplemental pages in the index. g1msd has clearly had a great deal of experience of this if you take a look back through old posts on the subject.

There is absolutely no way for G to know that these url's point to the same content (page) or a seperate copy of the same content (page). It cannot possibly somehow join the two and call it one page.

RonnieG




msg:3156761
 12:50 am on Nov 15, 2006 (gmt 0)

There is absolutely no way for G to know that these url's point to the same content (page) or a seperate copy of the same content (page).

True, not during the crawl, it doesn't. As a cloned copy of the same content on different absolute url/page is crawled, a separate docID would be assigned. For multiple relative urls pointing to the same page/absolute url, as long as the redirects are done properly and the absolute url is not masked, the result should be the same absolute url and docID for both, and G will simply replace the previous crawl results with the new results under the same docID. However, not everything G does is done during the crawl. Recognizing that a page is cloned/duplicate content is a separate off-line batch process. When it gets around to its off-line batch processing of the crawl results, G will finally see and recognize duplicate content on more than one docID, and only then will it decide what to do with one page or the other.

g1smd




msg:3156766
 12:54 am on Nov 15, 2006 (gmt 0)

... and what it usually decides to do is to keep one URL in the main index, delete a few of the others, and chuck the rest of the duplicates into the Supplemental Index.

RonnieG




msg:3156786
 1:15 am on Nov 15, 2006 (gmt 0)

Again, those issues are:

- non-www vs. www,
* Agreed, if there is not a proper redirect in place from one to the other, such that only one absolute url can can ever be delivered for the same target page.

- different domains (either .com vs. .co.uk and/or best-widgets vs. cheap-widgets),
* Agreed, if the different domains are not properly redirected to the domain/page that holds the actual content, such as if domain name masking or absolute url masking is employed within the target page(s). If such masking is employed, separate docIDs would be created using whatever url the page actually delivers to the spider.

- different capitalisation of the URL (on IIS),
* Agreed for same content delivered by non-IIS servers using the same url with different capitalization, but disagree for IIS servers, since IIS delivers only the one absolute url to the spider, the actual page name, regardless of capitalization variations in the link.

- same parameters but in a different order,
* Agreed, since this this creates a different absolute url, and therefore a different docID.

- slightly different parameters,
* Agreed, for same reason as above

- and so on.
*?

Third level domains: Same issue as www vs. non www, since www is actually nothing more than a 3rd level domain. Again, no masking of any kind can be employed.

Gee, we really are basically in agreement except on the point of IIS servers!

texasville




msg:3156907
 4:15 am on Nov 15, 2006 (gmt 0)

>>>>>... and what it usually decides to do is to keep one URL in the main index, delete a few of the others, and chuck the rest of the duplicates into the Supplemental Index. <<<<

Or if this is truly what is happening to the one site I manage on a IIs server...then google is just chucking everything except one copy and IT goes only into the supplemental indes....since only the index page is in the main index. And believe it or not...the index page is ranking extremely well for all the terms it was designed for. But it still puts all the pages people really want to see into never neverland...never to be seen in a search.

fishfinger




msg:3157060
 9:20 am on Nov 15, 2006 (gmt 0)

Because they are search engines and not directories. Because they allude to the searchers that they can provide the relevant information they are searching for

Anyone (apart from ppc arbitragers) who spends money to be found for a service they don't provide is not going to be around for very long. Are you trying to say bigger companies are less relevant to the searches they target i.e. they are less able to provide the services people are looking for? I don't think so. In fact I think precisely the opposite. They are successful because they are good at what they do.

which we allow to fill their indexes for free while crawling our hard work (our work, our bandwidth=money

Well block them then. Or do you actually WANT the traffic? Well then do what you have to to get it!

Staffa




msg:3157096
 10:30 am on Nov 15, 2006 (gmt 0)

Well block them then. Or do you actually WANT the traffic? Well then do what you have to to get it!

With listings in supplemental there is NO traffic. G is already blocked access to a couple of sites and if the trend continues will be blocked to all sites.

Time better spent finding other venues for traffic then bending over backwards to try and suit G.

This 110 message thread spans 4 pages: < < 110 ( 1 2 [3] 4 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved