Link Popularity vs Duplicate Content

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Link Popularity vs Duplicate Content

Am I understanding this correctly?

spina45

8:28 pm on Dec 5, 2006 (gmt 0)

Because my rankings recently got hammered I've begun work on reducing duplicate content. I have an OsCommerce store and have many duplicate pages as a result of "osCsid" and "cPath" in the indexed URLs.

I modified my robots.txt to "disallow" following both "osCsid" and "cPath" URLs. I thought this was a good thing.

I just noticed that my PR dropped.

I then did a Link Popularity check and see that there are a few thousand less links pointing to my site. About the same number that robots.txt are now disallowing.

It seems that I gave myself more incoming links and higher PR by having duplicate content. Is this correct? And what's worse reduced link popularity or multiple URLs serving up same content?

tedster

8:03 pm on Dec 10, 2006 (gmt 0)

Did you disallow thousands of URLS with that change to robots.txt?

photopassjapan

8:49 pm on Dec 10, 2006 (gmt 0)

Uh... what?

You mean you had your PR updated for existing URLs just recently?
Are you sure?

Isn't it that you chose the wrong set of URLs to disallow? I mean you had duplicates of pages... so there was more than one choice on what to drop. On what did you base your decision on which ones to keep?

It could be that the ones you disallowed actually had a higher PR, or incoming links or... whatever.

When i read your post i kinda thought that you actually made a 3rd kind of URL for each page, disallowing the previous two. In that case i wouldn't wonder that the PR is gone. ( Why not make a sitewide redirect for the URLs if the pattern is this simple? )

I think a domain voting for itself was the second thing G fixed in 2001. Unless of course your main page doesn't have any links only subpages do, and they pass it up to the root level but... i thought that this was impossible. Okay at least i never saw any site that worked like this.

I'm just guessing though :P

tedster

10:25 pm on Dec 10, 2006 (gmt 0)

I think a domain voting for itself was the second thing G fixed in 2001.

Internal links do pass PR within a domain. That's one way Google sorts out your hierarchy of urls.

photopassjapan

2:16 am on Dec 11, 2006 (gmt 0)

Yeah, that's not what i meant. Those internal URLs will have no PR to pass on to the home page though, unless the home page had passed it on to THEM first, OR they had an incoming link directly pointing at the URL from another domain.

So what i meant was that if there were NO incoming links to these URLs then the idea of removing them ( and their links, votes for the home page ) could cause the home page to have a lower PR than before... just sounds silly. For in that case, reversing this logic, having umpteen million pages with no PR pointing to their own domain root could "raise the importance" of the home page, which was - i believe - one of the first filters G implemented when they saw people actually doing this back in the early days...

And basically that's why i asked whether any of these dumped URLs could have any incoming links... and whether the right version of the URLs have been disallowed.

I should learn to make my point better i know :P
Or to make a point at all.

spina45

5:08 pm on Dec 11, 2006 (gmt 0)

> Did you disallow thousands of URLS with that change to robots.txt?

Yes. I used Google Webmaster Tools to test BEFORE I made my Robots.txt change. Here's what I did...

Disallow: /*?osCsid
Disallow: /*?cPath=2*&
Disallow: /*&sort=

By doing this I removed 1000s of duplicate content URL variations. (Yes, I removed "bad urls" and tested that "good urls" could still be crawled.) It seemed like a good thing to do re: Dupe Content. In some cases the osCsid variable was producing 10+ different URLs for the same page.

In some cases people had posted a link on their site back to my product page -- and the link included one of the variables I listed above (i.e. "bad url".)

Anyway, my results are so hammered lately...I'm doing KW searches that previously listed my site in the top 5 results. My search KWs are in my Title, Desc, and Page Content. Now, I can't find myself unless I include my domain name in the search. New results that are appearing on page 1 don't even have these KWs in the Title, Desc, etc and only some of the KWs are in the page text.

Also, there is a site appearing on page 1 results for my previous stellar KWs that is loaded with hidden text and URLs (Ctrl-A produces a TON of junior-grade black hat methods).

Sorry to go off-topic above, but it's frustrating because it seems Google is broken or, perhaps, there is a random rotation of suppressing ecommerce sites to increase their adwords revenue. (?)

g1smd

11:53 am on Dec 13, 2006 (gmt 0)

That is similar to the sort of work that I recently did to a vBulletin forum to get hundreds of thousands of duff URLs out of the index, leaving only 45 000 threads (with just one URL for each) and the thread index pages listed.

spina45

2:44 pm on Dec 13, 2006 (gmt 0)

> to get hundreds of thousands of duff URLs out of the index

I've noticed that MSN and Yahoo don't seem to pay attention to the "disallow" parameter in robots.txt. Is there way to "disallow" across the SE spectrum?

g1smd

4:39 pm on Dec 13, 2006 (gmt 0)

>> don't seem to pay attention <<

I think that it is more a case of "... take for freakin'-ever to update the status ..."

spina45

7:37 pm on Dec 13, 2006 (gmt 0)

> "... take for freakin'-ever to update the status ..."

I notice that MSN's cache of my pages are dated AFTER I made the "Disallow" change to my robots.txt file.

On the positive side, when searching in MSN for my KWs, I'm not deluged with Ebay pages containing expired auction results and spammy portal pages. i.e, MSN is delivering more useful results for my KWs than Google. I wish the NY Times or Wall Street Journal would test the SEs and write a story about Google's diminished SERP quality. They are probably shareholders so that won't happen!