Does Google penalise for duplicate HTML vs. text? - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Does Google penalise for duplicate HTML vs. text?

Concerns on HTML being regarded as duplicate content

rescendent

2:52 pm on Mar 25, 2005 (gmt 0)

10+ Year Member

I have a site and most of the raw HTML between pages it identical (SSI) for the top, the left and bottom of the pages (i.e. template style).

This is the general site boilerplate like the logo on the top a navigation strip under for the main sections; (e.g. Home, search, Forums, polls, about us etc.) a navigation strip down the side for links to sub-sections; and a copyright notice on the bottom with links to a site map and privacy links.

The main viewable body of all the pages is different depending on where the user is. However the boilerplate is about 50% of the pages raw HTML though a very small portion of the visible text (about 50 words), which just sets up all the layout of the pages ensuring a constant feel to the site. Most of this is front loaded as tables are used to handle this layout.

Also for each page again most of the HTML in the body is identical, ensuring a consistent layout for each page. The reason the boilerplate is c.50%+ of the page is most of the pages are kept small so that the pages fit onto a single screen; with a link to a second page, related articles, etc. as necessary. Also because of this most of the links are identical, other than the ones relevant to particular article you are reading.

However the text for all the pages is different as they are all about different things.

So my question is as most of the HTML and links are identical for every page, even though the text is always different will these pages be marked as duplicate content? i.e. does Google use the page text or the entire page (HTML and all) to determine duplicate content?

rescendent

4:43 pm on Mar 25, 2005 (gmt 0)

10+ Year Member

Also if most of your links are the same on a page does this count as duplicate?

Cheers for your thoughts

Crush

5:00 pm on Mar 25, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I would say not but this a question I have been thinking about.

birdstuff

6:34 pm on Mar 25, 2005 (gmt 0)

10+ Year Member

I have one site that has the header, left-hand navigation and footer all called via SSI on several thousand pages. These are pretty large files too (at least half of the total bytes on each page is identical). Virtually all of my pages are indexed and most are ranking decent to very well.

foxtunes

6:39 pm on Mar 25, 2005 (gmt 0)

10+ Year Member

This has also been a concern of mine, as I use font tags etc instead of CSS. The HTML validates with 0 errors, but the validator advises I should use CSS.

After Allegra one of my sites was lost in the serps wilderness, if i did a &filter=0 command i was back at #1, without my site was listed in the supplemental results....Every page had a different (original) 2-300 word article, the only thing similar was the html layout, tables, link menu etc.

Last few days site has made a dramatic comeback, so i'm wondering if google have relaxed the duplicate content filter.

theBear

11:45 pm on Mar 25, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Html tags are striped.

You can test this by searching in Google for the word html.

It has to occur at least once in every one of the 8 billion pages it does not.

Likewise search on font.

Google therefor must strip the html.

theBear

11:55 pm on Mar 25, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

foxtunes,

Do a site:yourdomain

Pay attention to the domain names

Do you see www and non www forms of your domain, do you have a lot of supplemental pages, are there more pages counted by the Google site: search than pages you have on the site?

Do you use relative urls?

If so then you have a problem it comes in several forms.

rescendent

3:11 am on Mar 26, 2005 (gmt 0)

10+ Year Member

I checked it...

I get what you are saying theBear but if google strips the HTML why does font come up with 47,900,000 hits and HTML with 2,330,000,000 hits?

theBear

3:21 am on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

<html>

<table><tr>

<td> Html html html </td>

</tr></table>

</html>

The stuff inside the < ..... > is the html that gets striped.

The stuff inside the > ...... < is the content

So this page has html as content.

mrMister

9:12 am on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

However the text for all the pages is different as they are all about different things.

That's your answer right there. no, they won't be penalised because they won't appear in the same set of Google results.

You only really have to worry about duplicate content when you are duplicating another page in your subject area.

mrMister

9:18 am on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Html tags are striped.
You can test this by searching in Google for the word html.
It has to occur at least once in every one of the 8 billion pages it does not.
Likewise search on font.
Google therefor must strip the html.

The similarities logic and the ranking logic don't have to use the same source data (your test was on the ranking logic so is therefore invalid)

I'm confident that Google does take in to account the HTML when determining similarity. On pages I've developed with little text and large templates the duplicate ontent filter kicks in. This is even though the main text will be very different

foxtunes

10:25 am on Mar 26, 2005 (gmt 0)

10+ Year Member

".....Do you see www and non www forms of your domain..."

Thanks for your input Bear, no I 301 to [www,...] so I don't see www and non www versions of my site in the serps.

theBear

2:04 pm on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Actually mrMister my test was on the indexed content.

Google does go for content.

I have never seen Google nail templates however it is possible that after striping out common words that the word count outweighs the actual content when determining what is a duplicate.

For all any of us know Google might decide all pages with fewer than x unique words are all the same, if you get my drift.

mrMister

2:43 pm on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Actually mrMister my test was on the indexed content.

You have direct access to Google's database indexes? Wow!

If you wern't looking at the data used by Google's similarities processing then yor test result data is invalid.

rescendent

2:49 pm on Mar 26, 2005 (gmt 0)

10+ Year Member

I just meant for the searches (HTML, font, etc.) it seems google has indexed all the HTML therefore it would seem reasonable that it might take this into account when marking duplicates. Thus making templates dangerous...

My concern stem from Yahoo marking a .com site of mine and a .net site as duplicates (Google's not indexed me yet) which under normal circumstances would be expected, however looking at the cached versions a major site update to place between the two indexings meaning the the front page content was radically different between the two sites... therefore they were not duplicates at that time. However the bulk of the HTML template remained the same.

So my conern is that Google, since it definately does store the HTML, will mark every page as a duplicate. Unless I'm wrong an there happens to be over 2 billion pages talking about HTML, which seems unlikely...

chrisnrae

3:18 pm on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Templates are not dangerous. Using templates when you have little to no content on the pages using those templates can be. Google looks at the source code of your page. It often caches the source code of your page. Simply because <html> doesn't get indexed doesn't mean that they don't see it.

Regardless, even if you strip all of the HTML tags out of your page, if you have a massively heavy template repeating the same navigation links, phrases and words as a huge percentage of the text et al. on the page and little unique content on the page, it could easily be seen as duplicate content.

Simple solution - make sure your pages contain enough content worth creating a page for.

[edited by: chrisnrae at 3:30 pm (utc) on Mar. 26, 2005]

theBear

3:29 pm on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

mrMister,

I don't have access to G's databases however it makes zero sense that G would waste time continually processing pages to determine if there were duplicates within returned search results.

Now what G does only G knows and some folks aren't even certain that G even knows.

rescendent,

.net and .com do not indicate nor are they the same domain be very careful you will get a dup content problem if you assume that and both get spidered.

In regards to Google's results for any term:

If the term is in the link text pointing to the page it is counted. Look at G's cache of a page that is returned when you do a search and you don't see that word highlighted in the SERPS.

theBear

3:36 pm on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

chrisnrae,

Yes you hit the nail on the head.

It is the content and all of the content

that is all of the stuff

> .... here .... <

that counts.

There are tons of tools that will provide a view and statistics on the content of a page.

bakedjake

3:47 pm on Mar 26, 2005 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

thebear, the amount of misinformation you're dishing out here is amazing.

I have never seen Google nail templates however it is possible that after striping out common words that the word count outweighs the actual content when determining what is a duplicate.

I have. I've had it done to me. Several times.

I don't have access to G's databases however it makes zero sense that G would waste time continually processing pages to determine if there were duplicates within returned search results.

So what if they did it at search time? You know, part of the lookup function that returns results from the database and presents them to the user?

Personally, I think it's a backend process where they're linking doc IDs, but it could be a searchtime process.

It has to occur at least once in every one of the 8 billion pages it does not.
Likewise search on font.
Google therefor must strip the html.

Or they treat it as a limited stop word. You know, because there are so many darn results with "html" in it?

[google.com...]

Google's not stripping html to store it in its database.

theBear

3:56 pm on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yup, I could be spewing all sorts of garbage.

Google could be treating things on the fly all the time and google could have tons of crazy lists all of which just points out that no one here knows for certain what gets done with what.

Now is it the template (pattern) or the content of the template that nailed you? You can decide.

I use templates all over the place however the content that goes into the template isn't the same.

5x54u

4:38 pm on Mar 26, 2005 (gmt 0)

10+ Year Member

I have a site that is totally static HTML via IIS processing HTML as ASP that uses templates in that the header, left nav and footer were included via includes with filname.html. Pages have plenty of relevant to page content but this action affected a penalty for dupe content.

Solution for me (take to note your server variables and DNS could be different on your box, so may not help you but...)

I switched the files to filename.inc and in the following update no more dupe and all pages indexed properly.

Again, this is my result, yours may vary.

mrMister

5:08 pm on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Solution for me (take to note your server variables and DNS could be different on your box, so may not help you but...)
I switched the files to filename.inc and in the following update no more dupe and all pages indexed properly.

Eh? What have server variables and DNS got to do with renaming files?

What exactly did you do? You changed the name of your server side includes from filename.inc to filename.asp? Is that right?

Marcia

6:27 pm on Mar 26, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Templates are not dangerous. Using templates when you have little to no content on the pages using those templates can be.

And experience is the best teacher. :)

I'm looking at one of my own sites right now with 7 pages out of a total of 66 hand-rolled pages hit with URL only with one already out altogether (though it still shows PR, which has nothing to do with it).

It's *very* easy to look at all those pages with site: right now and see exactly why. With those, it's the percentage, or ratio, of what's unique main body content in relation to the weight of the content of the global template.

Not only that, but even with a decent paragraph or so of unique body text, there had better be a heavier balance of that text relative to the number of affiliate <a hrefs> on the page. Not that affiliate links are necessarily being hit, but by their nature they aren't unique.

There are two pages within that group, widgets.html and widgets-2.html that actually have enough unique text, BUT it appears there's *possibly* something else in operation. Just conjecture, but the extreme similarity in the filepath may be contributing to the problem with those, coupled with similarities on the page in spite of the unique text.

I say *possibly* because I couldn't state it for sure without it being verified by a second opinion or by more evidence - but the same thing happened with stuff.html and stuff-2.html - so it gives me a slight suspicion that groupings like that, closely linked with each other on the same site, could *possibly* need special attention to avoid problems, especially if there are structural similarities in the layout of the product display section.

Also, a couple of those pages that got hit in one particular section - that have pitiably little content on them aside from the global elements, are very poorly linked to from the rest of the site; they're only linked to from a page or two. No idea if that has any bearing, but it's another thing I'll be fixing.

Aside from this site of my own, I've just begun working with a site that has gotten most of the site in the supplemental index. Very LOW amount of main body content in relation to the global template, and excessive repetition in the filepaths besides. By the nature of the site, the remedy is to create several very content heavy pages with plenty of unique text and rely on those for ranking.

It's more than just duplication of text on pages, and Google is *very* good at picking up near duplicates, or maybe a kinder way to put it is to call them non-unique pages.

Added:

I had only one page go URL only on another site - and its got just a short introductory paragraph with links to other pages in the section. So even though they are not affiliate links, but links to other pages on the site, the ratio of characters in text vs. characters in links vs. the amount of characters in the global template elements isn't good enough to give the page "value" - speaking strictly from a user perspective.

I can't know for sure, of course, if that's really why that partcular page got hit, but there is no other reason except for what can be seen with the naked eye.

5x54u

9:14 pm on Mar 26, 2005 (gmt 0)

10+ Year Member

What exactly did you do? You changed the name of your server side includes from filename.inc to filename.asp? Is that right?

No - Previously the way the browser saw the page was with the includes calling out to file.html. This created a dupe filter trip on said site.

Changed them to filename.inc ..... no more filter trip. This is not to say it will work for this site or that site as there are many variables that play with respect to hosting setup and what not.

rescendent

1:32 am on Mar 27, 2005 (gmt 0)

10+ Year Member

What a pain... looks like I'm going to have to drop my tables and go for css layouts to cut down the amount of markup in my pages.

And I just got it looking good... :-/

rescendent

9:52 pm on Mar 29, 2005 (gmt 0)

10+ Year Member

Ok I've don't some crazy stuff and it's working well for me, will be back soon when I know whether its just crazy or is good

idonen

10:28 pm on Mar 29, 2005 (gmt 0)

10+ Year Member

Hmmm, so maybe if one has a fairly extensive navigation menu on every page, and only a paragraph or two of content (I'm thinking mostly of some blogs I run), perhaps it would be better off to put the navigation menu into a javascript src file, which the browser would then include at runtime, and google would ignore? It would screw over people who have javascript disabled... but maybe do a different "light" menu for those people.

What do you think?

rescendent

10:48 pm on Mar 29, 2005 (gmt 0)

10+ Year Member

I think a blog or a forum is the last place you have to worry about using boilerplate. You content is going to grow to such massive ratios vs. your mark-up (depending how you limit the content per page and how much surrounding stuff you put in obviously) That you shouldn't worry about it. Still, keep the user experience high and the mark up low.

Re: My recent experience I've cut my boilerplate mark-ufrom 5K a page to 1.2K and it works nicely; but why can't all browsers play fair?

rescendent

10:49 pm on Mar 29, 2005 (gmt 0)

10+ Year Member

Also keeping your mark-up low means less for the user to download = more responsive page = win win!