homepage Welcome to WebmasterWorld Guest from 54.205.247.203
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 176 message thread spans 6 pages: < < 176 ( 1 2 3 [4] 5 6 > >     
Adam Lasnik on Duplicate Content
tedster




msg:3192969
 6:06 am on Dec 19, 2006 (gmt 0)

Google's Adam Lasnik has made a clarifying post about duplicate content on the official Google Webmaster blog [googlewebmastercentral.blogspot.com].

He zeroes in on a few specific areas that may be very helpful for those who suspect they have muddied the waters a bit for Google. Two of them caught my eye as being more clearly expressed than I'd ever seen in a Google communication before: boilerplate repetition, and stubs.

Minimize boilerplate repetition:
For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.

If you think about this a bit, you may find that it applies to other areas of your site well beyond copyright notices. How about legal disclaimers, taglines, standard size/color/etc information about many products, and so on. I can see how "boilerplate repetition" might easily soften the kind of sharp, distinct relevance signals that you might prefer to show about different URLs.

Avoid publishing stubs:
Users don't like seeing "empty" pages, so avoid placeholders where possible. This means not publishing (or at least blocking) pages with zero reviews, no real estate listings, etc., so users (and bots) aren't subjected to a zillion instances of "Below you'll find a superb list of all the great rental opportunities in [insert cityname]..." with no actual listings.

This is the bane of the large dynamic site, especially one that has frequent updates. I know that as a user, I hate it when I click through to find one of these stub pages. Some cases might take a bit more work than others to fix, but a fix usually can be scripted. The extra work will not only help you show good things to Google, it will also make the web a better place altogether.

[edited by: tedster at 9:12 am (utc) on Dec. 19, 2006]

 

rohitj




msg:3194796
 5:32 pm on Dec 20, 2006 (gmt 0)

You also have to realize that many of you are thinking that google can't tell the difference between html, javascript, and meaningful words. If you have the same html/javascripts on each page of a domain, then chances are that's the template of the site--a very large percentage of sites use templates. They are not going to penalize for that and, if they did, they'd be penalizing a very large portion of their index. That blatantly defeats the very purpose of a penalty.

i'm willing to bet that they implement some learning algorithms, that can figure out consistent menus/templates/structural aspects of a site and ignore such aspects when determings the SERPS. Its not a hard thing to do and they have the computing power necessary to crawl each site in that type of depth.

whitenight




msg:3194805
 5:42 pm on Dec 20, 2006 (gmt 0)

lol then please explain to me what "boilerplate" repetition is?!

if G's algo is smart enough to recognize TEMPLATE content v. UNIQUE page content...

Why are we even having this discussion?

pageoneresults




msg:3194812
 5:44 pm on Dec 20, 2006 (gmt 0)

Why are we even having this discussion.

Because the statement from Google was vague and some of us may be misinterpreting what is between the lines in that statement. ;)

calicochris




msg:3194832
 6:01 pm on Dec 20, 2006 (gmt 0)

I think we're having this discussion as the fear is that we're going to have to do all kinds of redesign if Google's definition of duplicate content is not made a little clearer. They surely (hopefully) can distinguish between template content and content, but I did not see that distinction in the statement. The first thing that I need to understand, in a way that I can explain it to customers, is just this basic thing -- what does G consider duplicate content inside of one site or one domain? Surely we could have a clear statement? I don't have too much true duplicate content on the sites that I work with, so I'm not too worried, excepting if my stuff gets ripped off and published somewhere else.

Duplicate content, in a more pure view of content, i.e., the words and images used to describe our stuff on our sites, is another story altogether I think. The people with the red widget/yellow widget/blue widget have content problems methinks!

CainIV




msg:3194847
 6:12 pm on Dec 20, 2006 (gmt 0)

Any discussion and input from a traditionally non-webmaster communication orientated Google is a good thing.

Marcia




msg:3194849
 6:14 pm on Dec 20, 2006 (gmt 0)

What is dupe content?

a) Strip duplicate headers, menus, footers (eg: the template)

This is quite easy to do mathematically. You just look for string patters that match on more that a few pages.

b) Content is what is left after the template is removed.

Comparing content is done the same way with pattern matching. The core is the same type of routines that make up compression algos like Lempel-Ziv (lz).

This type of pattern matching is sometimes referred to as a sliding dictionary lookup. You build an index of a page (dictionary) based on (most probably) words. You then start with the lowest denominator and try to match it against other words in other pages.


[webmasterworld.com...]

Disclaimer: Conjecture

It seems that nowadays dupes within the same site are just being ignored, rather than the site being "penalized" - if I'm interpreting right. But nevertheless, even assuming that global navigation elements may be stripped away to determine dupes from the remaining page content, there can still be "fingerprints" that are identical within certain site sections or sub-groups of pages and sometimes what that looks like according to my definition - is what I call "page stuffing." There's fine line between legitimate white hat content spamming and crossing over into page stuffing.

Despite much weeping and gnashing of teeth, the supplemental index seems to take care of that very effectively.

rekitty




msg:3194861
 6:28 pm on Dec 20, 2006 (gmt 0)

Please don't go changing your navigation structure for Google to avoid duplicate content! Of course it's OK do duplicate your nav. The core of the content on the page is what matters. Is the core content duplicate? That's the question you should be worrying about, as many have said.

How Google handles duplicate content has been well established. Go to google.com/patents and search for "google duplicate content" and the light reading there will answer most all the detailed questions posed here.

Question: do those in the know think Google will be changing their well established approach to duplicate content based on their recent comments?

My guess: I don't really see anything new in their comments, but I could be missing something.

Marcia




msg:3194866
 6:30 pm on Dec 20, 2006 (gmt 0)

BTW, I'm not saying that near-dups are necessarily what will shove pages into the Supplemental index. But many "page stuffed" or large, highly repetitive sites simply don't have enough Pagerank to support enough PR distribution to a kazillion pages, so the redundancies go south.

Duplicate content, in a more pure view of content, i.e., the words and images used to describe our stuff on our sites, is another story altogether I think. The people with the red widget/yellow widget/blue widget have content problems methinks!

That's an example of page stuffed sites, and unless there's enough PR it doesn't matter whether it's a database/dynamic issue or deliberate hand-rolled redundances, the end effect is the same.

That's wny, IMHO, a certain very well known site where all kinds of people sell all kinds of stuff that people can bid on can get away with spamming the hell out of Google with multiple near-dups on spammed out multiple sub-domains.

They're over the magic dividing line with high enough PR and authority status. The most I've seen is that useless swill (much of which isn't even available any more) taking up 5 out of the first 10 results on long tail searches. They can get away with it, and when people say things are OK because such and such a site does it (like linking patterns for example) it's an unfair comparison to make for the average webmaster who hasn't crossed over the magic line into immunity - including being immune from getting whomped for duplicate or near-dup content.

[edited by: Marcia at 6:44 pm (utc) on Dec. 20, 2006]

photopassjapan




msg:3194887
 6:49 pm on Dec 20, 2006 (gmt 0)

...ooh kay i'll ask again then.
Can i quote Adam? I'm not sure everyone read the post ;)
if not please edit it out

This filtering means, for instance, that if your site has articles in "regular" and "printer" versions and neither set is blocked in robots.txt or via a noindex meta tag, we'll choose one version to list. In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering rather than ranking adjustments ... so in the vast majority of cases, the worst thing that'll befall webmasters is to see the "less desired" version of a page shown in our index.

Doesn't this mean the following:

- If there's a duplicate page on your site ( and nowhere else on the net ), no problem, they'll list one of the urls and make the rest either supplemental or drop out.

- If the site gets a manual review because of its pattern for dupes, and is seen as using this practice to climb up the SERPS with no real content, they apply a manual penalty

- There are no automatic sitewide penalties for dupe content within the same domain

...
Isn't this news?

Oliver Henniges




msg:3195088
 9:04 pm on Dec 20, 2006 (gmt 0)

I have the impression that some of you massively underestimate the level of abstraction google has reached meanwhile in strutural and statistic analysis. Just take a look at how far google is meanwhile translating pages in many many languages!

For instance, there's no need to perform any byte by byte comparison in order to discover duplicate (=similar) content. Marcia already pointed this out. Some discrete cosine transformation easily reveals interesting results, has long become a core technique in any lexical analysis and is quite fundamental for any SE. Exchanging some word order here and there doesn't suffice.

Similarily, google wouldn't be where she is, if she wasn't able to identify your navidgation-sceme as such and distinguish it from what Adam called boilerplate keyword repitition. I'd suspect the latter has reached a much more sophisticated level at the successful spammer's side: My own linguistic studies are quite a while ago now, but I assume some very good lexical databases with very good semantic status-entries and crossconnections are available on many places. So all you need is a set of rules of transformational grammar and you might arbitrarily stuff your pages with hundreds of very naturally looking lines of "content." I assume many top-spammers used such material, thereby continuously tinkering their on-page-optimization-coefficients built into those structures.

The problem for google is, that linguistic analysis (i.e. "understanding" natural languages) is always a bit more complicated than artificially letting a machine write nonsense, considering latest scientific research. I'd rather more interpret Adam's lines as a warning that some major modifications will target that area in the near future, maybe due to some recent new patents or breakthroughs in reseach. Between the lines: Don't rest on what your scripts spit out for you, if you did not really care for your visitor whilst coding. Take back, what you don't fully control and what you only added in order to fill your pages with stuff you thought might help your ranking.

whitenight




msg:3195115
 9:28 pm on Dec 20, 2006 (gmt 0)

Oliver if what you say is true
So then I go back to my original comments in this and other current threads.

Is this a scare tactic?
If so, who are they trying to scare?
If they are trying to scare "spammers", why are they announcing this to them?
(Spammers don't really care about what "might" happen. They work with the algo as it is now and naturally adapt faster due to not being confined by "scalablity" constraints)

If they are trying to "inform," rather than scare, then darn it, INFORM!

What's this mismash of half statements, proclamations of impeding doom about future penalties and changes that may or may not happen and how it may or may not affect your site, "informing" anyone if half the people on this thread have no clue what it exactly means?

Back to "religious" interpretations of what the Gods on high have meant by sending lightning and thunder, are we?

That's not communication! and it's actually the opposite of communication as it starts mis-information threads like this (this is a "featured discussion" isn't it) where we get 50 different opinions of what the augury means?

"I saw one bird fly to the east"
"Nope, I saw a bird fly north and then head south"

[edited by: tedster at 9:59 pm (utc) on Dec. 20, 2006]

steveb




msg:3195122
 9:37 pm on Dec 20, 2006 (gmt 0)

"I would read this as including things like menus and other navigation features."

That's neither what he said or implied.

Repeating anything over and over obviously has no value for any particular page. The point is each page has to have a significant "something" in comparison with the repeated stuff. I see at least about a third of photo pages deindexed, plus pages whose entire content is a sentence or two of unique text sometimes disappearing. This is bad search engineering on Google's part because the pages on the sites I'm looking at are all unique, but instead of indexing them Google is influencing the web by saying for such pages to be indexed that superfluous drivel should be added to each page.

Getting rid of URLs that are endless copies of the same product page is a great thing. Getting rid of the *one* URL bearing unique content just because it has only a small amount of text not found anywhere else is brainless.

Whitey




msg:3195184
 10:32 pm on Dec 20, 2006 (gmt 0)

Every major search engine, including struggling little old MSN, is working with dividing pages into blocks and then separating the common template elements out for separate analysis. Google's been doing it for years. They can recognize a templated menu appearing across the site for exactly what it is.

Tedster - could you please clarify your understanding of this.

I previously took the SE's to interpret block elements around the edges of the page, not the central area.

For example, what if the "blocks" are occurring in the central body content area. Suppose a navigation menu [ or an article ] that produces a large proportion of text is substantially repeated once or several times in the same "central area" of many of the pages . Will those search engines see this as "template" or "duplicate content"

Oliver Henniges




msg:3195188
 10:39 pm on Dec 20, 2006 (gmt 0)

> Back to "religious" interpretations of what the Gods on high have meant by sending lightning and thunder, are we?

In a way, yes. Don't do evil.

> That's not communication!

I recall some "communication" with my wife, which indeed wasn't far away from throwing lighting and thunder or knives and plates and TV-sets.

Seriously: Don't expect google to communicate any details of the search algos. View it as a partnership. Google tries to find good quality content on the internet and present it as a result of a relevant search query, you will try to bring that quality onto your website. Both for the benefit of your visitor, who may have used google. It is not your task to tinker your website on what you think are important ranking factors, and I'm really wondering a bit that pageoneresults asked for a precise percentage figure. He does know better;)

I hardly spend five percent of my time following ww threads. And whatever I find here, will rather more work in the back of my mind than immediately influence my website. I don't know what you are aiming at, but my major task is to run my B&M-store. It's website has become a very important factor now, but I think primarily because I try to make my site usable for my visitors and crawlable for a search engine.

I found it quite interesting that SEs are "blind", that text is better than flash animations, that images require alt texts, that pages should if possible be w3c-conform, and that a dmoz-entry is quite helpful. That was in 2001. I found the importance of meta-tags (particularly the title tag) quite interesting, and quite recently that the ww-people stressed how important it is to have unique titles all over the website (I had anyway). This year I wrote a little macro to generate my sitemap.xml-file and helped google to define my www.subdomain as my preferred domain. That's all. I doubt that's more than any SEF beginner's course would cover, and I don't SEO more than this, except probably exchange some backlinks here and there. For my major niche kw I'm on the first page for years now, have climbed to #1 very stable this year, and am continuously conquering one long-tail-subniche after the other.

It's a partnership, in which both partners try to help each other, though the communication is naturally assymetric. Considering the complexity of the internet I think google is really doing a great job, but of course isn't perfect. Ceterum censeo: 'brutto' 'netto' 'stueck' and 'euro' are stopwords on european commercial b2b websites, and hardly ever have anything to do with the website content. They are no boilerplates and required by law due to the Preisangabenverordung. I know that these are the ones that occur most frequently on my website. Your webmaster central console statistics could do better than that.

Marcia




msg:3195197
 10:49 pm on Dec 20, 2006 (gmt 0)

Every major search engine, including struggling little old MSN, is working with dividing pages into blocks and then separating the common template elements out for separate analysis. Google's been doing it for years. They can recognize a templated menu appearing across the site for exactly what it is.

Theoretical paper (from Microsoft) from 2003:

Vision Based Page Segmentation [research.microsoft.com]

And here's the nitty-gritty of its application (also from Microsoft) for link weighting:

Block Level Link Analysis

ftp://ftp.research.microsoft.com/pub/tr/TR-2004-50.pdf

And a simplified summary [research.microsoft.com].

[edited by: Marcia at 10:52 pm (utc) on Dec. 20, 2006]

Oliver Henniges




msg:3195207
 10:59 pm on Dec 20, 2006 (gmt 0)

I see at least about a third of photo pages deindexed, plus pages whose entire content is a sentence or two of unique text sometimes disappearing. This is bad search engineering on Google's part

steveb, as I said, that was lesson one for me five years ago that search engines cannot recognize images. Of course this is a pity for photo-pages, but you are simply expecting too much here.

Seems as if google is working very hard on this subject currently, e.g.

[google.com ]
[images.google.com ]

But this whole project will take a while.

tedster




msg:3195210
 11:03 pm on Dec 20, 2006 (gmt 0)

what if the "blocks" are occurring in the central body content area.

That does sound like a potential duplicate issue to me, Whitey. In fact I recently had a client make changes in a scenario something very much like that, and we saw a boost in google traffic to the pages involved. Any practice that creates major chunks of identical text within the content area of several pages can be dicey and hurt most of those pages' ability to gather search traffic.

Then again, many sites do not need to get direct search search traffic to absolutely every page. In such cases, perhaps the user would be better served by giving them the duplicate blocks of text. It really is a judgement call sometimes.

I do think the operative word here is "substantial". Certainly a small box with, say, 3 links to the same features on every page is not a problem -- unless the pages otherwise only contain a few distinct phrases.

steveb




msg:3195256
 11:53 pm on Dec 20, 2006 (gmt 0)

"but you are simply expecting too much here"

Well obviously not. In fact its silly to suggest that they sudeenly can't do what they have done for years and years.

It again obviously is not expecting to much to index unique pages, with unique titles, and unique descriptions, and unique content.

Let's try and stay real here. They can clearly do this. There is no difficulty at all, and it is user-friendly to do so. The point is they are CHOOSING to de-index pages that merit indexing simply because they are inept at handling real duplicates. They are using a sledgehammer because they are incapable of doing the job right.

(Image labeler isn't involved in indexing web pages so I don't know why that was mentioned.)

Marcia




msg:3195267
 12:15 am on Dec 21, 2006 (gmt 0)

I don't think it's that they don't recognize images, it may be because a page does consist of an image and not enough besides.

It's possible that they're looking for a minimum amount of (unique) text to consider a page having value. There happens to be a paper out there that even mentions a specific number of words, very distinctly, so it's not outside the realm of possibility.

It may come down to looking for "signals of quality."

steveb




msg:3195278
 12:26 am on Dec 21, 2006 (gmt 0)

The point though is they are now choosing to ignore signals of quality. They ignore regularly ignore unique tiles, descriptions and alt text. They ignore value of domain. They obviously ignore the value of the image to users.

The point again is they ignore quality and user considerations because they are inept at signals of quality. They instead resort to the sledgehammer tactic of deindexing unique content because they do a poor job with duplicate content.

steveb




msg:3195282
 12:31 am on Dec 21, 2006 (gmt 0)

And now the karma Gods have another laugh.

I quoted this pile of nonsense by Adam yesterday: ""Don't fret too much about sites that scrape (misappropriate and republish) your content."

And now today I see a PR6 page of mine drops 450 spots in the results because some theif put its content in hidden text on PR0 zero page. Only page on the Internet, according to Google, that has stolen it. Now the theif's page ranks higher for a string of random text from the page, and mine is omitted/penalized.

Are they really and truly this blatantly out of touch down there at the plex? It appears so, but while being clueless is not a sin, giving insane advice to webmasters is evil.

SEOPTI




msg:3195294
 12:44 am on Dec 21, 2006 (gmt 0)

Google will need years to develop a technology which makes their search results look better and avoid collateral damage. I really have no idea why their is still such a big hype about their stock price but I think it's a big bubble.

[edited by: SEOPTI at 12:45 am (utc) on Dec. 21, 2006]

Whitey




msg:3195304
 12:49 am on Dec 21, 2006 (gmt 0)

what if the "blocks" are occurring in the central body content area.

We'll run an experiment, splitting up the various versions and remove the drop down text boxes positioned in the central area by different amounts on each site. If it makes a difference I'll pass it on - should take a couple of weeks.

Again, our simple sites without drop down menus are working well. The others are heavily filtered.

eddytom




msg:3195314
 1:09 am on Dec 21, 2006 (gmt 0)

This has been the most revealing thread i've seen here in a while, I wish Adam would answer some of these questions. This last post has me alarmed about blocks of content and filters. Please someone read my issue below and respond, it would be greatly appreciated.

I have two menus, one horizontal with drop downs, then one on the left column vertical with dropdowns. The menus, plus two footer links makes my site template have about 60 total links. That, plus the logo, a slogan, and a couple banners are duplicated across more than 2000 pages of my site which runs original news content in my niche. Is this all duplicate content that is hurting me?

We are in google news and the site gets crawled by google incessantly, however we have 1900 pages in the supps. Is it because of the menus? If so that's just crazy and would cause us to redesign for google instead of the users, again. I am getting sick of having to do things to accomadate to google, such as taking out our related news module under each article. Thanks for any help.

Whitey




msg:3195343
 1:46 am on Dec 21, 2006 (gmt 0)

what if the "blocks" are occurring in the central body content area

eddytom - Have you cleared all other aspects of duplicate content?

If so is this potentially the last consideration?

I'm very interested by the onpage body duplicate content element which Adam has opened the lid on , which hasn't been discussed greatly in the thread over here: [webmasterworld.com...]

Great emphasis was put on site architecture and meta descriptions/tiles , interlinking etc. but nothing was discussed [ i think ] in relation to the above.

I think this can occur within a page, between linked pages [ and of course we know it occurs between sites ]. Do it too often and i think the site or pages will be filtered according to the severity. [ IMO ]

Body content and linking are the two key remaining elements in the overall structural presentation of pages and sites for the search engines on this subject, as far as i can see. It might be highly useful to be clarified. [ IMO ]

- just a hunch.

[edited by: tedster at 4:48 am (utc) on Dec. 21, 2006]
[edit reason] fix link [/edit]

Adam_Lasnik




msg:3195351
 2:04 am on Dec 21, 2006 (gmt 0)

I wish Adam would answer some of these questions.

I will indeed! I also need to catch up on our own Webmaster Help Google Group; my colleagues and I have been reading tons of posts in the meantime, though.

Thanks for your patience, and keep the thoughtful questions coming...

Whitey




msg:3195352
 2:06 am on Dec 21, 2006 (gmt 0)

Adam - your a champ - what with all that multi tasking Christmas shopping to do as well!

[edited by: Whitey at 2:07 am (utc) on Dec. 21, 2006]

pageoneresults




msg:3195353
 2:06 am on Dec 21, 2006 (gmt 0)

We are in google news and the site gets crawled by google incessantly, however we have 1900 pages in the supps. Is it because of the menus?

Doubtful. It is probably due to PageRankô and internal linking structure (the flow of PageRankô), not the replication of natural navigation elements.

If so that's just crazy and would cause us to redesign for google instead of the users, again.

Just because your site may not be doing as well as you would like it too doesn't mean that you should be making knee-jerk reactions to an ever-changing algo. ;)

I am getting sick of having to do things to accomadate to google, such as taking out our related news module under each article. Thanks for any help.

Why would you take out the related news module? Was it good for your visitors? Then why take it out? Are you absolutely positively sure it was the cause of something?

Marcia




msg:3195363
 2:11 am on Dec 21, 2006 (gmt 0)

Duplication in body text has come up and been talked about many times. For a few years, in fact.

Whitey




msg:3195369
 2:20 am on Dec 21, 2006 (gmt 0)

Duplication in body text has come up and been talked about many times. For a few years, in fact.

It passed me by :) . Any takes on the best thread and key points in the context of the issue of the "block elements" mentioned by Tedster before?

Anything discussed re text based menus etc would be greatly appreciated as food for thought.

I'd really like to pin the specifics more so that Adam might choose to pass a comment on it.

Jane_Doe




msg:3195396
 2:58 am on Dec 21, 2006 (gmt 0)

Adam - Thank you for taking the time to post here. The negative comments regarding your suggestions are not shared by everyone here. Personally, I appreciate any tidbits you and the other Googlers care to pass on to us web publishers.

And now today I see a PR6 page of mine drops 450 spots in the results because some theif put its content in hidden text on PR0 zero page.

Steveb - Association does not prove cause and effect. Perhaps your site has a penalty and the other site outranking it is an effect, not a cause, of the penalty on your site.

[edited by: Jane_Doe at 3:03 am (utc) on Dec. 21, 2006]

This 176 message thread spans 6 pages: < < 176 ( 1 2 3 [4] 5 6 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved