homepage Welcome to WebmasterWorld Guest from 54.204.141.129
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 38 message thread spans 2 pages: 38 ( [1] 2 > >     
Duplicate Content - part 2
Setting out guidelines for a site clean of duplicate content
optimist




msg:3140477
 1:07 am on Oct 31, 2006 (gmt 0)

< continued from [webmasterworld.com...] >

I'm gonna go back to an earlier post here:

Elixir

Duplicate Content has become a major headache for us. I am referring to stolen content. I spend hours each week sending out cease and desists. When there is a major issue such as another SEO company stealing entire sections of the site our rankings plummet. We resolve the issue and our rankings come back. I find the SEO's stealing the content and re-producing it on their site talking about their ethical approach highly offensive. Has anybody ever sued over duplicate content. I wonder if there is a legal case to sue somebody and make it a high profile case to try and deter companies stealing content as a short cut. I am not talking about scrapers either although that happens too I am talking about entire sections of content stolen with the name of the Copmpany deliberately changed. The greatest frustration of all is that when these unscrupulous thieves steal our content our rankings plummet.

The DMCA procedure takes too long. Has anybody had any experience of reporting the site to the plagiarists ISP?

At least you're coming back in. This filter is really bad on infringement. Maybe sometime soon someone at G will wake up for issues where there is a Registered Copyright or use a realistic dating system for the content, and or loosen the filter on sites that complain and file DMCA complaints.

Google does not do the proper job at filtering out the original content, nor in protecting sites from "infringement ranking attacks".

I am afraid you will only become disheartened with any and all approaches to infringed content, especially if they keep happening.

I also believe this may be a collaborative effort in some cases to purposely take out sites since Google makes it a bit easier now.

I have found contacting ISPs to be the biggest waste of time. The only thing that works is filing the DMCA and getting it removed. But you loose time to do these as they are time consuming and sometimes incomplete.

[edited by: tedster at 11:16 pm (utc) on Dec. 1, 2006]

 

ideavirus




msg:3140572
 2:51 am on Oct 31, 2006 (gmt 0)

Thank you g1smd.
can i post the complete code in my .htaccess file, so that you could view that and tell me how to fix it.

Otherwise, what would be the code for 301 redirect from

/index.php?
example.com/index.php?

Actually the redirect from index.php [and all combinations of this] to www.example.com started working only after post your code.

Thanks for any help.

jd01




msg:3148370
 6:48 am on Nov 7, 2006 (gmt 0)

RewriteCond %{HTTP_HOST} ^domain\.com [NC]
RewriteRule ^(.*)$ http://www.domain.com/$1 [R=301,L]

The above should redirect anything on domain.com to www.domain.com, including index.php... (The only page not effected would be index.html, which is being redirected by the previous rule which also applies the www.)

The other .htaccess solution for non-www to www is a negative match:

RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.domain\.com [NC]
RewriteRule (.*) http://www.domain.com/$1 [R=301,L]

Keep in mind the second set will effect sub-domains, EG anything-here.domain.com in most settings will be redirected to www.domain.com.

Justin

Whitey




msg:3153427
 11:01 am on Nov 11, 2006 (gmt 0)

Links pointing to your site through a "faulty" jump or tracking scripts, mistakedly using a 302 redirect from the referring web page to your site, could be creating duplicate content.

By it's more popular name, it's called a "302 hijack" of content and your PR, where Google can be made to think that the referring page is the original content.

Potentially this can invoke a duplicate content filter on your site, but also lower your positioning in the SERP's behind the referring site.

[webmasterworld.com...]

Suggestions to getting this removed would be to tell the webmaster that they do not have your permission to hold such links, or if the site is subjected to malicious intent by spammers or unscrupulous SEO, inform Google to seek to have the site banned for SPAM.

[edited by: Whitey at 11:02 am (utc) on Nov. 11, 2006]

photopassjapan




msg:3159768
 12:44 pm on Nov 17, 2006 (gmt 0)

We finally got our first faulty link ;)

Although not a 302 redirect, just someone copy pasting a URL with an anchor in it as the link to our site. The fun thing is that they use the link to the paragraph on the page that tells how to link to us PROPERLY :D ... with four versions and even the CODE shown in textfields to make extra sure. Aaaagh... okay.

You know... example.com/page.html#whatever

Since the site has internal links pointing to this page in both ways page.html and page.html#whatever, and only page.html has a cache, backlinks ( and both show the same TBPR ), i concluded that G can tell that there's just no difference, does not even index the anchored URL, let alone compare the two and think to itself that it's duplicate content. Which makes sense for it would then have to penalize half of the net, so...

So this can't hurt us.

...

...

...

...( waiting for people to raise their voices )...

...?

Or am i gravely mistaken with this <:D

g1smd




msg:3162442
 2:12 pm on Nov 20, 2006 (gmt 0)

Evevything after the # is simply ignored.

Check your log files. Do bots even request the # part?

I suspect not. too busy to look

Whitey




msg:3173090
 5:22 am on Nov 30, 2006 (gmt 0)

Do pages showing old cached pages in Googles cache, with old/wrong URL's on them [ e.g. /index.htm instead of "/" ] indicate Google's lack of recognition , that we have fixed our duplicate content problem?

When i check the site tool, [ ie site:oursite.com ] it showed no pages cached for a particular page.

When i go to the webpage and check the cache via the Google toolbar it shows an old cache of around 15May06 when we had incorrect links.

[edited by: Whitey at 5:22 am (utc) on Nov. 30, 2006]

g1smd




msg:3173526
 2:26 pm on Nov 30, 2006 (gmt 0)

If the page is shown as a Supplemental Result then it is likely that it sticks in the index for about a year, then gets silently dropped.

Whitey




msg:3174189
 10:49 pm on Nov 30, 2006 (gmt 0)

Sorry - I'm not quite clear.

Firstly, i don't understand what's in the index. the site:tool says the page isn't indexed [ not even supplemental ], the Google toolbar shows an old cache.

If the old cache sticks around in the index, as you say for a year, does it mean that this page/URL will be considered as a duplicate for that long?

[ One thing we did do was redirect all those "bad" links to "/" in any event. ]

Or, if the new page shows in either the site:tool or the Google toolbar, will this new page take precedence over the old cached page [ which i presume will become a supplemental result ]in the results/index?

[edited by: Whitey at 10:50 pm (utc) on Nov. 30, 2006]

tedster




msg:3174231
 11:23 pm on Nov 30, 2006 (gmt 0)

Google just keeps historical records, Whitey. If you've fixed the root cause of the problem (with redirects, 404, robots.txt, noindex meta tags, whatever) then those historical records should have no further negative effect.

Once the new content is indexed, then that is what gets top consideration for search results. A supplemental result is a url PLUS a cached date.

the current version of the page should be in the normal index, and the previous version of the page is held in the supplemental index...

Supplemental Results - what exactly are they? [webmasterworld.com]


g1smd




msg:3174232
 11:23 pm on Nov 30, 2006 (gmt 0)

How do you get to the cached page, if there is nothing in the index, nothing to click in the SERPs?

... oh, through the toolbar.

Hmm, that would appear to be a bug of some sort.

I would think that the page isn't harming things, the cached data just hasn't been cleaned up yet, but all the references to it have.

photopassjapan




msg:3174300
 12:32 am on Dec 1, 2006 (gmt 0)

Oh er...
I'm seeing this too actually.

Pages that have a cache, have been crawled recently, and don't show up in the index, not even as supplemental, not even for unique searches ( was trying to use the site search feature on our own pages... *sobs* ). I have one or two guesses on what this can be though. One is the recent data refresh being not so fresh on all servers that are being queried.

The other one is scary and i won't say it :P

Pirates




msg:3174320
 12:53 am on Dec 1, 2006 (gmt 0)

I am sorry I haven't read the whole thread but has anyone noticed the shopping sites having problems with duplicate content. I searched one of there results that I actually knew the history on and they changed the title and meta description for each duplicate page and have managed to serve up four or five duplicate pages per keyword.When you click similar pages on google you see them with main result. The site was owned by e fill in the blank.

Whitey




msg:3174377
 2:14 am on Dec 1, 2006 (gmt 0)

OK - I'm clear

It bothers me that Google's various [ site:tool and toolbar readings ]conflict with themselves and what may be really happening.

It's been going on a long time, i think.

ichthyous




msg:3174920
 3:48 pm on Dec 1, 2006 (gmt 0)

I haven't been following the supps issue for a while, but I do think Google is constantly playing with the supps algo. My site went live two months ago and didn't have s ingle supp for 3K pages. All the suddent this week I saw a handful of them and now a few more every day. For some it's clear why they went into supps as I had mistakenly used the same page title for them, but others are fairly unique content and there doesn't seem to be any rhyme or reason as to why they are in supps. We certainly have't seen the end of the algo tweaks and i think any of us who don't have a lot of links coming in or authority status are in for a bumpy ride over the long haul. It does seem like my SERPs change from week to week now

mcskoufis




msg:3175361
 10:02 pm on Dec 1, 2006 (gmt 0)

I just managed to get out of the suplemental hell on my sites. G1smd, Tedster and TheBear thanks a million for the incredible advice and insight on this issue. This was life saver.

Basically, even though all those 404s I mention in my previous message on this thread are still in the index but seem to have absolutely no effect at all.

For the supplementals I could NOT either delete/rename/change meta tags/etc. simply a disallow: line in my robots.txt did the job perfectly fine. They don't seem to affect my rankings at all.

Brilliant!

What is more promising is that my site now ranks for a very generic keyword in the top 40... I wasn't even daring to think about competing against those sites, and now it seems that I am well in there... Not sure if this stays, but it seems to be improving every 2-3 days...

I really don't know how to thank you guys!

g1smd




msg:3175369
 10:07 pm on Dec 1, 2006 (gmt 0)

Thanks! Just spread the word as to what worked for you, just so I don't have to repeat myself quite so often. LOL.

JoeHouse




msg:3175665
 4:58 am on Dec 2, 2006 (gmt 0)

Here's one I have not heard yet regarding possible dup content.

What if you send your feed up to lets say Amazon or some other shopping engine. In that feed of course is your content from your site's product pages.

Does google consider this dup content because the descriptions on shopping engines of your products you sell are also the same on your actual website?

Do you get penalized for this? Will it cause your site to go supplemental?

mcskoufis




msg:3175911
 3:02 pm on Dec 2, 2006 (gmt 0)

I would say it comes down to getting to know what URLs are generated by your system. I have developed all of mine and my customers' websites using Drupal, which I highly recommend. It is just that sometimes even if the answer is right in front of your eyes you just can't figure it out.

However this problem applies ONLY to new or low pagerank websites. If you take Drupal's homepage as an example (http://drupal.org/), down the bottom of the page, you can see the links to older homepage content. All 60 of them have the same titles, no meta tags and I don't think there is a dispute about this site being penalised...

Google has page 59 in it's cache...

So, I guess the best way to find out is to download yourself a copy of Xenu LinkSleuth, run it on your site and start examining carefully all the urls of your site for URLs which have identical titles, or those which should never have been indexed by search engines.

Now my other site which is a PR 5 and has been around for about 6-7 years was never affected by the switch to drupal... Even a massive URL restructuring had absolutely no effect on SERP ranking for a generic and highly competitive term. The site had over 15000 duplicated urls in Google's index and only has 3000 pages.

JoeHouse, those feeds do not seem to be causing any problems for the ranking on the page they appear. This is not exactly duplicate content. Otherwise all directories would be duplicates. Akso using an affiliate datafeed allows you to use the SQL LIKE operator to make custom pages, which can be unique to your site, provided you pay attention to your titles, descriptions and opening paragraphs.

G1smd, the number of times you have repeated yourself on this and other "sandbox" style threads is countless... However it did take me a while to figure out what you were saying, even though I was the developer of the site in question.

Many thanks again!

g1smd




msg:3175912
 3:03 pm on Dec 2, 2006 (gmt 0)

It depends on the rest of the code and content on their version of your page. If their page is an "exact duplicate" of yours, then you would be in big trouble as they are likely to have more PR and more "authority" than you. Yours would likely disappear.

However, in reality, their page is likely to be coded quite differently, have different navigation, and maybe some extra stuff wrapped around your content, and so your pages are likely to survive in the index -- but their version of the page is still likely to rank higher than you for that same content.

If their version of the page contains a direct link to the exact page on your site where the same content resides, Google might be able to figure out that you are the originator.

mcskoufis




msg:3175928
 3:26 pm on Dec 2, 2006 (gmt 0)

You got me wrong here... I am not saying that my site has the same content as drupal's... It is just that the architecture that it is the same.

Check the following URLs:

[drupal.org...]
[drupal.org...]
[drupal.org...]

It uses exactly the same architecture as my site. News items are posted on the homepage and on the page of each news section with that pager thing at the bottom which has links to older content like the examples above.

They have exactly the same title, no meta tags and a list of articles...

This is what was apparently part of what was hurting my site in Google. Being a Drupal developer there are numerous times I got one of those pages rank high for something I was searching for. So I am confident to say that they have no problem ranking.

JoeHouse




msg:3176187
 9:07 pm on Dec 2, 2006 (gmt 0)

I just called and had my account move to inactive on Amazon.

The reason for this is because I compared content of all the sites in my industry listed on Amazon that had same content on both Amazon as their website and they are all supplemental indexed by Google.

This is a clear indication that something triggered it. It appears google is giving Amazon credit for my content and making them the original owners of the content (when it is mine) of it while placing me in supplemental.

Wow what a mess!

JoeHouse




msg:3178003
 10:08 pm on Dec 4, 2006 (gmt 0)

g1smd

You Said:

"It depends on the rest of the code and content on their version of your page. If their page is an "exact duplicate" of yours, then you would be in big trouble as they are likely to have more PR and more "authority" than you. Yours would likely disappear.

However, in reality, their page is likely to be coded quite differently, have different navigation, and maybe some extra stuff wrapped around your content, and so your pages are likely to survive in the index -- but their version of the page is still likely to rank higher than you for that same content.

If their version of the page contains a direct link to the exact page on your site where the same content resides, Google might be able to figure out that you are the originator."

Would that apply also to products listed on froogle? I would not think so seeing that froogle is a part of google?

So then, why would different rules apply on different shopping sites such as Amazon vs Froogle?

Spine




msg:3178076
 12:04 am on Dec 5, 2006 (gmt 0)

I wonder if the outbound link pattern on some of my translated pages got me in trouble on a site.

I added a directory with a different language version of my pages. The translated pages would have had the exact same outbound links (different on each page) as the English version did. They would also have shared the same photos that were linked to external sites.

Since shortly after adding the translated version, I've had some issues with supplemental listings, and my pages being filtered under the site:command as similar.

Having another site totally copy my content (including architecture, navigation, the whole bit) didn't help much this summer either I'm sure.

Tami




msg:3178999
 7:06 pm on Dec 5, 2006 (gmt 0)

RewriteCond %{HTTP_HOST} ^domain\.com [NC]
RewriteRule ^(.*)$ [domain.com...] [R=301,L]

The above should redirect anything on domain.com to www.domain.com, including index.php... (The only page not effected would be index.html, which is being redirected by the previous rule which also applies the www.)

The other .htaccess solution for non-www to www is a negative match:

RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^www\.domain\.com [NC]
RewriteRule (.*) [domain.com...] [R=301,L]

Keep in mind the second set will effect sub-domains, EG anything-here.domain.com in most settings will be redirected to www.domain.com.

Justin,

I have been looking for the code to redirect all non www to www and at the same time - not disturb my sub-domains but I want to include the index page. What rewrite code should I use?

Thanks
Tami

jobonet




msg:3179045
 7:51 pm on Dec 5, 2006 (gmt 0)

I also posted this on Google Groups > Google Webmaster Help > Suggestions and feature requests

Use sitemaps to gain efficiency and relevance

I use sitemaps to map over 15,000 URLs. I think that Google should use Sitemaps to not only include “hard to find” URLs, but to exclude URLs that a webmaster does not want in SERPs. Some dynamic sites using PHP or ASP have problems with duplicate content because of URLs that contain queries, case issues, http vs. https, www vs. non www that would all result in the same data (page) being displayed for different URLs.

In my case, I have changed my URLs from php extensions to html. (example: search.php?make_select=Chrysler&model_select=300 to Chrysler-300.html). Months after this change, and the subsequent three ring circus of mod rewrite through .htaccess I still see SERPs containing php extensions and Googlebot still parses these URLs.

Not every programmer/webmaster is versed in mod rewrite, robots.txt, meta tags and the like, and Sitemaps would be an easy way for webmasters to determine which of their URLs should be parsed by your bot and end up in SERPs and which should not. This would also remove the possibility of dupe content issues that arise from outside links from Adwords, or blog and forum entries where a typo or unscrupulous competitor might point to an URL with an appended query, use https, IP address, leave off the www, or change case from upper to lower resulting in duplicate SERPs, bot parsing, penalties, etc.

I think that this would affect an extraordinary increase in efficiency for bot crawling, storage of URLs, and would also enhance accuracy by allowing more relevant URLs to rise to the top that would otherwise be buried in supplemental results. This would also give webmasters and programmers back millions of hours to devote to content and organization instead of the Google dance.

Web_Savvy




msg:3184511
 9:29 pm on Dec 10, 2006 (gmt 0)

I just managed to get out of the suplemental hell on my sites.

mcskoufis, please tell us how long it took you to do this.

JoeHouse




msg:3185074
 1:11 pm on Dec 11, 2006 (gmt 0)

I am starting to think this whole duplicate content issue is not true at all.

My current website which about 18 months old has 100% unique content. All but the homepage is listed in the supplemental index.

I also have another website about 4 years old which has 100% DUPLICATE CONTENT which is from manufacturer's product descriptions.

The website will all duplicate content has ALL IT PAGES listed in the main index and ranking very well.

So I ask you why? They only difference is the age of the websites. The newer site is all supplemental while the much older site main index.

I am starting to think that its a goole issue and they cannot handle all these pages and websites.

Think about it for a moment a large majority of websites which are older do not seem to be in the supplemental while many newer sites are. That tells me clearly google is struggling handling the work load and are reaching for Public Relations excuses for damage control.

photopassjapan




msg:3185092
 1:39 pm on Dec 11, 2006 (gmt 0)

I guess there's just isn't enough CPU to compute the entire index again based on this filter, so instead, G checks pages it indexes, reindexes, and can match up to others, based on a couple of easy to spot parameters. I mean there have to be some flags that will trigger the algo to even bother to check a page against another one. And so they could at least guess which one that "other one" should be.

Title and meta may be a trigger for example. If either is a match in the index, they check the contents, if not, they don't. It's just not possible to check every page against every page on the net based on some search-for-this-sting type of script.

...

It's so not possible.

photopassjapan




msg:3185147
 3:01 pm on Dec 11, 2006 (gmt 0)

Oh some other things.
( trying to clean up our initial mistakes... but... )

-inurl:www doesn't seem to work for us anymore.

At least it won'd exclude the urls that have www. as their subdomain :P

It mixes up both www and non-www then unless there are fewer, it says there's about 1000 results, which when looked up by directories matches up to more like 1600+ ... which is funny, it's pretty accurate when the number is below 1000.

While trying to figure out how this could be related to the "set domain preference" in GWT which will of course only DISPLAY all non-www's as www's ( or so they say -.- )... noticed two things... first one was whether these were "historic supplementals" as g1smd used to call them. Nope, these are www versions, some URLs didn't even exist in a non-www fashion, and there's a .htaccess working just fine in place redirecting to www ever since. The second was the "these terms only appear in links to this page" thing.

Er..
Which is just wrong.

For example i was looking up a pic on a page, and entered its description ( the text on the page ) to see whether it's www in the index or not... the description was... let's say...

"bla ahem cough yup nope whatever"
So G highlighted these words one by one with those ugly prime colors on the page, then... kindly informed me that "ahem" was ONLY found in links to this page. Although it highlighted it like 3 times in the text.

...

I was worried for a second when we saw that sooner or later we'll have more non-www pages indexed than www's ( even though we set up the redirect and the preferred domain over three months ago and there's probably no or very few links without www )... but now i'm soooo relieved.

It's not that our pages haven't been indexed since August, nor that they are kept being indexed on example.com instead of www.example.com.

Rather it's just that we won't be able to CHECK this anytime soon.

Are we doing something wrong?
Or i shouln't even bother...

...

Does anyone know the issue with this -inurl:www / GWT preferred domain stuff?

This 38 message thread spans 2 pages: 38 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved