Adam Lasnik on Duplicate Content - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Adam Lasnik on Duplicate Content

«
1
2
3
4
5
6
»

tedster

6:06 am on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Google's Adam Lasnik has made a clarifying post about duplicate content on the official Google Webmaster blog [googlewebmastercentral.blogspot.com].

He zeroes in on a few specific areas that may be very helpful for those who suspect they have muddied the waters a bit for Google. Two of them caught my eye as being more clearly expressed than I'd ever seen in a Google communication before: boilerplate repetition, and stubs.

Minimize boilerplate repetition:
For instance, instead of including lengthy copyright text on the bottom of every page, include a very brief summary and then link to a page with more details.

If you think about this a bit, you may find that it applies to other areas of your site well beyond copyright notices. How about legal disclaimers, taglines, standard size/color/etc information about many products, and so on. I can see how "boilerplate repetition" might easily soften the kind of sharp, distinct relevance signals that you might prefer to show about different URLs.

Avoid publishing stubs:
Users don't like seeing "empty" pages, so avoid placeholders where possible. This means not publishing (or at least blocking) pages with zero reviews, no real estate listings, etc., so users (and bots) aren't subjected to a zillion instances of "Below you'll find a superb list of all the great rental opportunities in [insert cityname]..." with no actual listings.

This is the bane of the large dynamic site, especially one that has frequent updates. I know that as a user, I hate it when I click through to find one of these stub pages. Some cases might take a bit more work than others to fix, but a fix usually can be scripted. The extra work will not only help you show good things to Google, it will also make the web a better place altogether.

[edited by: tedster at 9:12 am (utc) on Dec. 19, 2006]

mattg3

3:24 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I disagree.

I was talking about a Wiki not a review site. I think most of it is taken care of at least in Mediawiki that it will lead to a script. Though some users might type the url directly and a sitemap cronjob would pick these up. I have never seen ie Wikipedia history pages and index.php?bla files in Google so I hope the topic is hopefully mute.

But we see again life is more complex and opinions differ.

activeco

3:57 pm on Dec 19, 2006 (gmt 0)

10+ Year Member

I'm confused! One of my sites is simple .html and .css for style with an extensive menu structure. Blocks and blocks of duplicate content in this menu structure repeated on every page of the 150 or so pages.
Is this menu repeating on every page duplicate content? All I'm doing, is giving the site visitors a repeating simple structure to navigate the site.?

That's a very good point Calicochris.
I am not authority on this but me think that duplicate CONTENT relates only to the stuff outside <> tags.
In other words structural templates are probably not penalized/filtered, even if they represent majority of total "content".
However I would like Adam to confirm this.

[edited by: activeco at 4:03 pm (utc) on Dec. 19, 2006]

europeforvisitors

3:59 pm on Dec 19, 2006 (gmt 0)

I was talking about a Wiki not a review site.

And I was talking about "stubs" in the broader context of this discussion (as the term was used by Adam Lasnik), so maybe we're discussing apples and oranges. Still, in an era when Wikis have been acquired by corporations and turned into profit-making enterprises, the temptation to create Wiki stubs for "long tail" search referrals may be too great to resist--which wouldn't be good news for users or for search engines.

mattg3

4:12 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

And I was talking about "stubs" in the broader context of this discussion (as the term was used by Adam Lasnik), so maybe we're discussing apples and oranges.

When you quote me directly and say I disagree, then I assume you disagree. ;)

Wiki, Forums and so on are widely used software. Surely it would all be easier with basic HTML, still technology has moved on beyond the abysmal Frontpage kind of site. There are x gazzilion kind of technologies out there that make publishing easier and more accessible. The potential for abuse does not mean abuse. A situation where only the SE are allowed to use technolgy and the rest of the world has to carve their letters into stone is undesirable.

It's the old argument, shoot everyone to prevent crime or live with the reality. Everyone a suspect until proven innocent is a prehistoric strategy.

idolw

4:29 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

i agree with EFV as far as travel is concerned.
there is one site that dominates the SERPs for especially smaller keywords by presenting nothing but auto-generated pages waiting for content added by users.
I don't like it. But Google likes it very much.

walkman

4:50 pm on Dec 19, 2006 (gmt 0)

Javascript:
Adam, and others:
I have a 30 line side menu of the main categories, probably like most of you. I have "enough" text on each page, but just in case, why not do the menu like the adsense so it doesn't count from G's point of view...that's what I meant. The menu is needed and it's there, but Google doesnt see it as text.

Adam, I have also placed a noindex on over half of my pages, many were relatively thin, but did the same to many others just in case. Those pages with noindex, although they still reside on my server, they do not cout as far as Google is concerned, correct?

Thanks,
I'm debating on whether to send a re-inclusion or wait for Googlebot to sort it out....

Ride45

5:34 pm on Dec 19, 2006 (gmt 0)

10+ Year Member

Adam,
Since Google has made this a public FYI now, why not build into your webmaster toolkit something like a "Duplicate Content" threshold meter. That would at least automate things like identifying what is boiler plate and what is not, and elimiate the "what if" questions, and all the millions of potential scenarios that webmasters are now scratching their heads about.

whitenight

6:06 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Adam,
Since Google has made this a public FYI now, why not build into your webmaster toolkit something like a "Duplicate Content" threshold meter. That would at least automate things like identifying what is boiler plate and what is not, and elimiate the "what if" questions, and all the millions of potential scenarios that webmasters are now scratching their heads about.

This needs to be read again by "the powers that be" that lurk these forums.

Google, if you are TRULY interested in "communicating" with webmasters...not the public relations scare tactics you currently call communications,

APPLYING THE ABOVE RECOMMENDATION IS HOW YOU COMMUNICATE IN A USEFUL WAY THAT MAKES YOUR JOBS AND OUR JOBS EASIER.

RonnieG

6:55 pm on Dec 19, 2006 (gmt 0)

10+ Year Member

Avoid publishing stubs:
Users don't like seeing "empty" pages, so avoid placeholders where possible. This means not publishing (or at least blocking) pages with zero reviews, no real estate listings, etc., so users (and bots) aren't subjected to a zillion instances of "Below you'll find a superb list of all the great rental opportunities in [insert cityname]..." with no actual listings.

Since he mentions real estate specifically, this discussion reveals that Adam, and Google, probably do not really understand that the vast majority of real estate sites will have basic city/community descriptions and a minimal amount of community specific text on a page, and then use an IFramed IDX database home search to provide the EXACT content the user is seeking. If this looks like a "stub" to Google, because it doesn't see the IFramed database engine and its data entry forms, then it is crazy. The user doesn't see an empty page. They see a rich, flexible database lookup tool that allows them to search for specific types of properties, using their own price range and other home search criteria. The way many top SERP real estate sites get around this is to publish listing details for every single property in their local MLS database. When hundreds or thousands of agents in a market do that, it is duplicate content to the max, since they all publish the same listings, and it absolutely is NOT user friendly. It is simply gaming Googlebot into thinking that site has more content than a site that uses a user input driven database engine to serve the same content. IMO: Every page on a real estate site that contains or publishes IDX listings, and G will know it because there is always a required disclosure to that effect on the page and/or listing, should immediately be tagged as duplicate content and sent to supplementals. If they did that, literally hundreds of millions of pages of real estate IDX listings would go to supplementals, leaving lots more room in G's primary index for real content.

[edited by: RonnieG at 7:06 pm (utc) on Dec. 19, 2006]

texasville

7:16 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Ronnie-
If it is framed IDX and the broker or agent wants to avoid the penalty shouldn't they just use the noindex, nofollow meta tag? It would seem that it would cure the dupe content and stub problem.

EFV- You and I agree completely on this.

Simsi

7:29 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I'm with bwmbwm on this one. This crosses over into the Retail thread, but if you have products that fall into different categories and you have pages for each category, then it stands that the descriptions will appear multiple times.

Thus without writing the same description in lots of different ways for each product (which isn't practical or even sensible) you are left with "boilerplate" content. At least, that's how it sounds as if Google will see it, although the user is likely to benefit from having the products organised clearly into categories if relevant.

The idea of duplicate content on the same website causing a penalty to rankings doesn't sound fair to implement until the algo can automatically work out how the data on the site is categorised. Not easy I'm sure, but...

Cheers

Simsi

[edited by: Simsi at 7:30 pm (utc) on Dec. 19, 2006]

asher02

7:36 pm on Dec 19, 2006 (gmt 0)

10+ Year Member

"If it is framed IDX and the broker or agent wants to avoid the penalty shouldn't they just use the noindex"

I always see someone posting about the magic cure "noindex" tag.

The problem is that it kills lots of long tail searches as for example if you have a blue widget page that is similar to a red widget page and you use the noindex tag on the red widget page , you will not be found on serps for the "red widget" even though people do search with the long tails keywords.

So using noindex on page that deserve to be indexed just to make Google happy looks like a bad practice to me unless you are willing to get just a small portion of the traffic you deserve.

pageoneresults

7:40 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

We really need a definition for Boilerplate Content. We also need to know where that threshold is from a percentage viewpoint. If the page is 80% Boilerplate, does that cross the threshold?

I have yet to run into issues such as these and I'm going to chalk it up to using as many variables that I can from the database to make the page unique. Changing a word here and there isn't going to work. The meaning of that page needs to change. That means a top to bottom use of variables that break the Boilerplate mold.

I truly believe that the structure of the page is a determining factor in the Boilerplate discussion.

Whitey

8:41 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

pagoneresults - Boilerplates - Very important issue - IMO -

We ran pages with 3/4 navigation alternatives per page because we thought the user would like the various options presented this way in drop down menus and on page link navigation.

The result is 3/4 times the quantity of similar content. This has gotta be a red hot issue.

The question is: Is it? [ 99% likely IMO ] and how to structure it appropriately. Maybe it's time for REM scripts to take the repeated info off the page.

When you add in Stubs then not only is there potential for a page filter, there is also the high risk of a site wide filter tipping out one's entire site, or allowing sporadic results to appear on less pages.

We are seeing this. Google says on the site:tool we see all your pages, but only these are worth listing, and even then they are filtered out of the way. If you have a high PR you might be less effected, but ours are PR 5 and 6 and still having issues. Best to get it right in the first place.

Good point's pageoneresults - but i sense we know that 80% "boilerplate" or something similar on stubs will cause Google to throw it's hands up and say "too similar". We wouldn't repeat visible content like this would we?

Try 15% or less IMO

A search analysis of the top 5 results on key terms revealed that not one of our competitors had boilerplate pages. They restricted themselves to one drop down [ or non ]which varied on every page. All of their menu driven pages [ elsewhere ] were geared for SEM not SEO.

How on earth could we miss something so obvious!

[edited by: Whitey at 8:48 pm (utc) on Dec. 19, 2006]

Oliver Henniges

8:48 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

asher02, I concede, I do see your point, so: did your site really suffer?

On the other hand: You gave an example with text snippets of - I guess - less than 300 characters per product. Do you really think it necessary to design a single page each with just these 300 characters as unique content? An alternative would be (as I said) to group these products together and present an "view-large-image" - link. The long-tail-argument, as you use it, doesn't work: If your customers are searching for those unique phrases, they will also occur on pages with several products and google will also index the phrases. Even better: the key phrases will automatically be repeated, which a search engine would expect, if these occur in the meta-tags.

I admit: If I intended to buy some jewelry online, I'd probably expect the site owner to put some care in his presentation. Maybe twelve watches or rings, each worth several k, might look a bit strange grouped on one page. But then I thought: If you stick to the "one product one page"-concept: Why not add some poetry on each page? Unique content, the product deserves it, flattered customers, and there's probably quite a number of poor poets out there looking for a job.

As Saint-Exup�ry's "little prince" once put it: "It's the amount of time you spent with your rose, which makes this rose so important." I guess the same holds true for html-pages in the eyes of a search engine.

Tapolyai

8:54 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

This can be quite problematic, aspecially with legal disclaimers.

Lawman might chime in, but if I recall correctly there has been cases where a disclaimer on a single location was not sufficient, no matter how links were constructed.

So, the question is - do I reduce risk, but damage opportunity, or vice versa?

pageoneresults

9:10 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I'd drop those disclaimers into an <iframe>. You do have control over how much content is replicated across the pages. You also have control of where that content appears in the source.

It would be great if we had something like this...

<noindex></noindex>

Those who are really good at this type of stuff are serving one page to the visitor and another to the bot so this issue of boilerplate is a moot point for them. ;)

Oliver Henniges

9:12 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

> We also need to know where that threshold is from a percentage viewpoint.

Why? google won't tell you, anyway.

> I truly believe that the structure of the page is a determining factor in the Boilerplate discussion.

I believe that structural diversification of the SITE is the best antidot against tanking. And probably google is looking at this from a site-perspective: If you have the same - lets say - links-footer on every page, this may be viewed as a boiler plate. But if your whole site comprises three, four or even more completely different strucures, each of which deserves a different footer, the percentage of duplication is automatically diminished sitewide.

Maybe google has means to find out how many scripts you probably wrote to generate your x-thousand pages. Each script generates a different structure. In google's eyes, the importance of your site, the care you took for your visitors, is only partially defined by the number of pages (i.e. the complexity of your database), but also - if not mainly - by the number and complexity of the scripts you wrote. And the amount of time you spent on (writing, not tinkering) these scripts.

Whitey

9:31 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Those who are really good at this type of stuff are serving one page to the visitor and another to the bot so this issue of boilerplate is a moot point for them.

Can you elaborate?

Can this apply to menu's as well?

sja65

9:40 pm on Dec 19, 2006 (gmt 0)

10+ Year Member

I don't know what other people will do, but I am not going to remove menus, footers, instructions, phone numbers, necessary definitions of terms or any other content that is duplicated on a large percentage of my pages. I also won't hide this data with javascript or iframes or other html trickery which may break older browsers. The information on my pages is there to make it easy for the customers to find what they are looking for and I won't take it away just to make it easier for bots.

On the other hand, I would expect google to be able to identify this "boilerplate" on a site basis and just discount it from the page while still seeing the unique content. Technically this would not be too hard (and I expect that they are doing it this way).

cabbagehead

9:58 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I have the same concern/question as Walkman. I have a DHTML menu. Mine amounts to probably 120 lines of code. I have 1000s of pages containing this "duplicate" HTML and on some pages, this is as much as 50-60% of the total HTML in a given page.

I have some concerns about this and am very curious about the SEO and duplicate content filter impact on this design decision.

Should I be worried?!?!?

glengara

10:06 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Relieved to see that smiley P1R.... :-)

pageoneresults

10:13 pm on Dec 19, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Relieved to see that smiley P1R...

But wait, it is a high tech solution to a high tech problem. :)

smells so good

10:26 pm on Dec 19, 2006 (gmt 0)

10+ Year Member

I don't think the dup content is necessarily about items like menus, definitions, numbers, etc. that may be repeated on each page, UNLESS, that is the bulk of the page content.

Every page on my site offers a static menu at the top, and generic site links at the footer. That's nearly 1500 pages of duplicate content? Nope, not at all. I use separate pages to elaborate on some basic information - my About Page, Contact info, site policies. To my knowledge I have not triggered any dup content filters, except where I mentioned earlier about some pages going supplemental. My site certainly is not penalized in any way as I'm still at the top of the SERP's. It would be foolish for Google to consider such things as dup content, UNLESS, that's all I have on page after page (can you say stub?). Take a look at this site (WebmasterWorld) - each page is built around a template with identical info in the same place on each page. Just like mine, or vice-versa.

A few other things may help to reduce the chance of pages looking alike to G-bot. Good page design includes proper use of meta tags. Keywords, descriptions and page titles should reflect the page content. This thread offers a few other really good gems about how to reduce the duplication, while still displaying the content on the page. (I'm almost ready to hire myself a poet and put a couple of hundred pages back in the SERP's)

It takes a little imagination to describe different products that are essentially the same. How many ways can you say a bottle is about 10 inches tall and and made from plastic? I personally think a lot of duplicate content on the web is intentional, created by lazy people with no imagination or ambition other that to create more useless content. I know that my own boilerplate pages suffer from a dearth of content, and so they probably look very similar to an algo, even though they are uniquely different to a set of human eyes.

mhodgdon

12:08 am on Dec 20, 2006 (gmt 0)

10+ Year Member

Hello All,
I read this and have a question in regards to

"What is duplicate content?
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Most of the time when we see this, it's unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and -- worse yet -- linked) via multiple distinct URLs, and so on. In some cases, content is duplicated across domains in an attempt to manipulate search engine rankings or garner more traffic via popular or long-tail queries."

We syndicate a version of a forum we own for mobile users. It is identical content on 2 separate domain names (one is a .mobi) but the purpose is clearly not to generate better search placement (i.e. not malicious) it is to service users interested in accessing our content through their phone. The way I read the statement above it doesn't seem that would be a problem but it is definitely not totally clear. Anyone have any thoughts? Any help would be appreciated.

rohitj

1:50 am on Dec 20, 2006 (gmt 0)

10+ Year Member

But those stub pages might be critical to really driving more reviews -- and may be inherently necessary for sites (Wiki type sites have these structures and it isn't going anytime soon). Its possible for the webmasters of such sites to inform / stop bots from indexing these useless pages -- until some content is actually there and I'm thinking that's what they're getting at.

Tapolyai

2:30 am on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I am sorry to say, but if this becomes an issue of "duplicate content" by Google, she will become irrelevant.

Some of us here know how to cloak. Some of us here know how to make appropriate iframes, meta tags and such.

Majority of the web site owners do not know how to do these. That majority installs a default package, makes minor look-and-feel changes, and starts adding content. The menus, disclaimers, headers and footers are the same all across these sites.

At what percent Google begins to punish such duplicate content? (Rethorical question - I know they will not answer.)

wanderingmind

3:56 am on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

How about a related articles box with some text and links that appear on a specific type of articles, say, all political stories on a site?

I think this hit me one massively, when i had a huge related articles include which displayed the same text and links on a hundred pages on my site. I cut it short, to just ten links on ten pages only, and the rankings came back.

Now was that a duplicate content penalty too? I think so, but would like to know your opinion.

docbird

7:28 am on Dec 20, 2006 (gmt 0)

10+ Year Member

Those travel-related stubs don't seem to care whether anyone comes along to add content.
Just trying google search for small place in Hong Kong, readily find a listing promising maps, weather and airports for this place - don't make me laugh, there's no airport there; nor does page show a map of the place, or weather recorded there. Just scamming people

Another site boasts of being the best travel forum for this place: has zero posts, so quite how this makes it best I dunno.

Another offers travel deals there: yet there are no hotels etc for miles.

Bah! Humbug!

At least, as far as I could be bothered checking the search result, not seeing a page for flower shop there (even tho no such shops for miles). I have seen for some other small places.

I've emailed google about pages like these; as this thread shows, they remain commonplace - and google even encouraging their creation.

Not so much a personal gripe - I have page at top of the results (it's a really, really small place!); but there are pages on this place, with info and photos, yet they are jumbled in with the stubs, so google results not a boon to users.
- we see Google advising re webmasters making sites that work for users.
Google could likewise better help users.

tedster

7:54 am on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

This is just my sense of things, but I wouldn't be surprised to see some serious SERPs changes early in the new year -- changes that focus on these duplicate issues that have so far been slipping past the Google radar.

First Vanessa gave Rand a video interview on the topic, and now, soon after that, Adam has given us a more detailed blog post. He's even sharing some useful vocabulary to help further our discussion and comperhesion.

You can tell where at least part of the search quality emphasis is right now at Google. So this current focus might also be a bit of a storm warning for the wise. It's happened before. The way I see it, public statements don't just emerge from a vacuum.

This 176 message thread spans 6 pages: 176

«
1
2
3
4
5
6
»