homepage Welcome to WebmasterWorld Guest from 54.242.231.109
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 193 message thread spans 7 pages: < < 193 ( 1 2 3 4 5 6 [7]     
Duplicate Content - Get it right or perish
Setting out guidelines for a site clean of duplicate content
Whitey




msg:3060900
 12:00 am on Aug 26, 2006 (gmt 0)

Probably one of the most critical areas of building and managing a website is dealing with duplicate content. But it's a complex issue with many elements making up the overall equation of what's in and what's out, what's on site and what's off site, what take precedence and what doesn't, how one regional domain can/cannot co exist with another's content, what % is same , etc etc and how the consequences are treated by Google in the SERP's.

Recently, on one of Matt's video's he also commented that the matter was complex.

When i looked into these forums [ unless i missed something ] i could see nothing that described the elements into a high level format that could be broken down and translated into a framework for easy management.

Does anyone believe they have mastered the comprehensive management of dupe content on Google into a format that can be shared on these forums?

 

Nimzovich




msg:3128251
 10:33 am on Oct 20, 2006 (gmt 0)

I assume that they were indexed simply because the server sent a "200 OK" response code in the HTTP header, and not a 301 response.

I see. Thanks for the explanation :-)

Whitey




msg:3130560
 9:51 am on Oct 22, 2006 (gmt 0)

I believe there is another angle on duplicate content which has not been emphasised or discussed yet.

We have really only talked about meta descriptions and meta titles in terms of page content ( meta ).

Duplicate or non orginal "body" content may be causing many folks a lot of headaches without them realising it, by dumping their results well down the lists. Sure, it doesn't always show supplemental a lot of the time, but I 'm confident the algorithm is picking this up.

However, i question if a site which has a mix of duplicate and original content, is assessed page by page or site wide, or a combination of both.

I also wonder what the % of content needs to be, to be original and unduplicated to be safe.

toothake




msg:3130608
 12:14 pm on Oct 22, 2006 (gmt 0)

This is very important . Does anyone from the Googlemasters or you guys can confirm,that if I fix my duplicate content I won't get another 1 year + m£$%^ penalty because I change 200 pages (ie changing titles ,meta tags,add unique content ech..)?
By the way I discover today how my biggest competitor get away from dup penalty ,a very clever trick.... his site has about 200 pages of contents but about over 1000 pages with just one picture page ( some of them PR5 ...yes sir...until now all those pages had the same title like untitled or page 1.htm )you know what he did?
That is the best trick I guess to avoid penalties.
He does not have meta title ,meta description or keywords, so all those pages have just a unique URL.Clever ,clever.......

[edited by: tedster at 2:03 pm (utc) on Oct. 22, 2006]
[edit reason] fix typo [/edit]

Reilly




msg:3130610
 12:24 pm on Oct 22, 2006 (gmt 0)

guys,

my site contains about 8000 Pages and all indexed - and all are really similar only 3 values changed from Page to Page.

Title is diffrent , meta keywords are the same, and meta description is same and h1 Tag are the same.

Google take 2 month ago 1000 Pages in the index and then kicked them out then Google took 2-4 weeks nothing anymore - but i still let them crawlable - then after 1 month all are in the index and they are all ranking well.

i think - you need to wait 1, 2, 3 month or more - if you think that this information are usefull for users let them crawlable and don't change anything - the user will be happy and google will check it also, but it will take time - more then ever.

toothake




msg:3130615
 12:35 pm on Oct 22, 2006 (gmt 0)

"duplicate content"
do you want me to show you a million pages that rank #1 with same description and title +keywords with only one word change?>>> just the place( ie california widgets)?
Well it happens that those sites never have been penalised,and what the Google masters are talking about unique content I can write it in my old shoes ....just take a look at wikipedia that ranks at #2 and #3 for most big cities with the same content and just a diferent URL ,my God it makes you think ,is there any justice of this planet? or is just a planet of the strong ones?
the recent ~Iraq~ events have proven that the big fishes can be defeted.

photopassjapan




msg:3130616
 12:36 pm on Oct 22, 2006 (gmt 0)

Identical meta tags and titles will not make your pages count as "duplicate content". They might be the trigger to match up a page to other pages, but if code and content is different, they're NOT duplicate content. But hey... if the code and content is the same as of another page, that site owner shouldn't be whining about the filters, right? :P

So no, identical meta and title do not count as duplicate content.
I've tested it over the last five weeks on our site and on another i'm involved with, so this isn't guessing, it's a fact.

Instead, if these tags are not unique, pages become "very similar" ( this is a Google SERP term ) which will not make you supplemental, neither make you fall for the key term, rather make all pages except the first - and most relevant one - invisible, until the "repeat the search with the omitted results included" link is clicked.

This can be easily mended by making the meta tags and titles unique, and during subsequent crawls the pages will start to appear individually for sitewide and unfiltered searches, where more than one/two URLs can be present from a single domain.

Also the pages that were "very similar" may then outrank or accompany another page on your own domain, in case they would have been more relevant just didn't have other of their parameters high enough ( PR for example ) to be displayed because of a higher ( directory ) level URL with which it shared the same title/meta combo.

This issue does not bring "penalties", and is very fast to resolve ( The next crawl of the given page ) thus is in fact different from the effects of duplicate content filters which are there to comb out stolen, recycled, scraped content.

...

There is probably no "partial duplicate content" percentage either.
Imagine doing searches on partial strings of the length of half an article. Even if being scaled, and only making comparisons based on the already decided relevancy level ( ie. sites with similar or identical keywords get compared to each other ) the CPU load would be enormous.

As for proof for this latter issue, i have none, but i've made a search a couple of weeks ago when this was discussed here on Webmaster World, a search on an... ahem... "article" of mine :) that is more or less like... personal experiences in an area, and got some... four pages that stole entire paragraphs of the text. Very little additional content was added, code was obviously different because of the design, but... still. It was stolen.

And seeing that such a random little text can end up as someone using it for the mission statement of a company... talk about plagiarism. I mean what does it tell others when someone has a mission statement copy/pasted from another page on the net =D

But all in all, neither pages were excluded from G index ( that's how i found them ), they had TBPR, IBL and all. Even though they were pretty much... THE article from a VERY often crawled other page.
If the code ( ie design ) is different, the title and meta tag is different, and there is additional content added, no matter how little, G still has ways to go in identifying duplicates.

But that kind of media-watch is easy to do by hand, then report it to whoever you want to ( the ISP, registrar, G, the "author", your attorney... ) and see if it makes you any happier that you busted a crappy site that was probably not being competition at all.

It's the automated content "mixing" is what i'm curious how Google filters out. I mean it obviously has no means for grabbing scraper sites disguised as blogs for they steal a little from here put up a little RSS from there...

...some are so cunning in trying to become original enough for Google that they become original enough for visitors too :D

They're like portals from the old days :)

Okay kidding aside, Google, bust them, will you? :P

spina45




msg:3130968
 8:25 pm on Oct 22, 2006 (gmt 0)

I've read this and other threads (until my brain got squishy!) and yet could not find the answer I need. I'm hoping someone here can help...

Brief Background:
Iím not a programmer, nor super technical. I launched an html site six years ago using a desktop application for building websites. I was selling widgets and accepted PayPal. After performing SEO activities I ranked very well for broad-based keywords. Then I hired a web developer to build me an OsCommerce shop. I was told that because the OsC URLs contain question marks (see example URL) http://www.example.com/shop/product_info.php?products_id=123, the pages would not be indexed by major search engines.

My web developer installed a module that converted "?products_id=123" to "/product-123.html" and created a sitemap for spiders to follow. Very soon I began to appear in search results based individual product name searches. This was all very good and revenue soared. I did notice however that what the search engines were displaying was the "?products_id=123" version and NOT the "/product-123.html" version.

Back then nobody was talking about duplicate content.

Now, within the past month, I have almost vanished from Google (still okay in other SEs). I'm beginning a rigorous project of cleaning up any duplicate content and applying "noindex" to URL variations that produce the exact same page.

So, my question is this: Should I standardize on the "?products_id=123" URL formatting (which seems to be able to be indexed just fine and produced excellent results until just recently) or the "/product-123.html" (which I rarely saw appearing in my previous high-ranking results)?

Doe anyone have personal experience on this?

g1smd




msg:3130974
 8:33 pm on Oct 22, 2006 (gmt 0)

>> http://www.example.com/shop/product_info.php?products_id=123, the pages would not be indexed by major search engines. <<

Search engines index URLs with a? and dynamic parameters just fine. The problem comes when more than one URL leads to the same content: using either slightly different parameters or the same parameters but in a different order.

Standardise on one format, whatever it is, and then stick with it. If possible, avoid underscores in URLs. If you are already well indexed using underscores then stick with it. Always avoid using spaces in URLs.

excell




msg:3130976
 8:38 pm on Oct 22, 2006 (gmt 0)

So no, identical meta and title do not count as duplicate content.
I can certainly testify that identical title & metas certainly do not help!

I had a couple of sites that I launched quite some time ago that were not quite "finished" i.e. they had their sections title & meta just copy and pasted ready to finish. I got side tracked and they sat there - all supplemental, all going nowwhere anytime soon in ranking results.

As I get around to doing this final polish - the sites have come on in leaps and bounds... so even if it is not dubbed as "duplicate content" (because the page content was different, it sure did have a huge negative affect on the intentions of the pages and the subsequent success of those pages after the final work was done.

Whitey




msg:3131051
 10:20 pm on Oct 22, 2006 (gmt 0)

This issue does not bring "penalties", and is very fast to resolve ( The next crawl of the given page ) thus is in fact different from the effects of duplicate content filters which are there to comb out stolen, recycled, scraped content

I think we're talking the same thing. Just to clarify IMO - pages/sites that have re appeared may still not have addressed body content issues which the algo is scoring "penalties" for.

A site that has fixed meta title and description, but has identical body copy is in trouble.

A site that does not have original copy will be filtered lower than competitors with stronger "TRUST" in the priority of it's showing [ potentially filtered out ]. Artificially, IMO , the page may be ranked with IBL's above better sites, but here's an interesting underlying thought:

If the page is sufficiently unique and unduplicated, IBL's would not be necessary [ provided the page was being indexed of course ].

What does that tell you?

There's still some questions above which may have a critical bearing on this.

spina45




msg:3131060
 10:28 pm on Oct 22, 2006 (gmt 0)

g1smd

Thank you very much for your crisp and clear answer. Most appreciated. Now the work begins!

ideavirus




msg:3139688
 1:27 pm on Oct 30, 2006 (gmt 0)

Hi,

I am using this...


RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html? [NC]
RewriteRule ^(([^/]*/)*)index\.html?$ [domain.com...] [R=301,L]

RewriteCond %{HTTP_HOST} ^domain\.com [NC]
RewriteRule ^(.*)$ [domain.com...] [R=301,L]

To do the 301 redirect. However,

domain.com/index.php? and
www.domain.com/index.php?

Redirects to www.domain.com/?
Do i have to make any changes to the above code in .htaccess to redirect to www.domain.com

Thanks for any help.

Cheers

g1smd




msg:3140391
 11:54 pm on Oct 30, 2006 (gmt 0)

That example code doesn't do anything at all for index.php.

You must have placed that other redirect somewhere else in your code.

< continued here: [webmasterworld.com...] >

[edited by: tedster at 11:17 pm (utc) on Dec. 1, 2006]

This 193 message thread spans 7 pages: < < 193 ( 1 2 3 4 5 6 [7]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved