Forum Moderators: Robert Charlton & goodroi
This is the general site boilerplate like the logo on the top a navigation strip under for the main sections; (e.g. Home, search, Forums, polls, about us etc.) a navigation strip down the side for links to sub-sections; and a copyright notice on the bottom with links to a site map and privacy links.
The main viewable body of all the pages is different depending on where the user is. However the boilerplate is about 50% of the pages raw HTML though a very small portion of the visible text (about 50 words), which just sets up all the layout of the pages ensuring a constant feel to the site. Most of this is front loaded as tables are used to handle this layout.
Also for each page again most of the HTML in the body is identical, ensuring a consistent layout for each page. The reason the boilerplate is c.50%+ of the page is most of the pages are kept small so that the pages fit onto a single screen; with a link to a second page, related articles, etc. as necessary. Also because of this most of the links are identical, other than the ones relevant to particular article you are reading.
However the text for all the pages is different as they are all about different things.
So my question is as most of the HTML and links are identical for every page, even though the text is always different will these pages be marked as duplicate content? i.e. does Google use the page text or the entire page (HTML and all) to determine duplicate content?
After Allegra one of my sites was lost in the serps wilderness, if i did a &filter=0 command i was back at #1, without my site was listed in the supplemental results....Every page had a different (original) 2-300 word article, the only thing similar was the html layout, tables, link menu etc.
Last few days site has made a dramatic comeback, so i'm wondering if google have relaxed the duplicate content filter.
Do a site:yourdomain
Pay attention to the domain names
Do you see www and non www forms of your domain, do you have a lot of supplemental pages, are there more pages counted by the Google site: search than pages you have on the site?
Do you use relative urls?
If so then you have a problem it comes in several forms.
However the text for all the pages is different as they are all about different things.
That's your answer right there. no, they won't be penalised because they won't appear in the same set of Google results.
You only really have to worry about duplicate content when you are duplicating another page in your subject area.
Html tags are striped.
You can test this by searching in Google for the word html.It has to occur at least once in every one of the 8 billion pages it does not.
Likewise search on font.
Google therefor must strip the html.
The similarities logic and the ranking logic don't have to use the same source data (your test was on the ranking logic so is therefore invalid)
I'm confident that Google does take in to account the HTML when determining similarity. On pages I've developed with little text and large templates the duplicate ontent filter kicks in. This is even though the main text will be very different
Thanks for your input Bear, no I 301 to [www,...] so I don't see www and non www versions of my site in the serps.
Google does go for content.
I have never seen Google nail templates however it is possible that after striping out common words that the word count outweighs the actual content when determining what is a duplicate.
For all any of us know Google might decide all pages with fewer than x unique words are all the same, if you get my drift.
My concern stem from Yahoo marking a .com site of mine and a .net site as duplicates (Google's not indexed me yet) which under normal circumstances would be expected, however looking at the cached versions a major site update to place between the two indexings meaning the the front page content was radically different between the two sites... therefore they were not duplicates at that time. However the bulk of the HTML template remained the same.
So my conern is that Google, since it definately does store the HTML, will mark every page as a duplicate. Unless I'm wrong an there happens to be over 2 billion pages talking about HTML, which seems unlikely...
Regardless, even if you strip all of the HTML tags out of your page, if you have a massively heavy template repeating the same navigation links, phrases and words as a huge percentage of the text et al. on the page and little unique content on the page, it could easily be seen as duplicate content.
Simple solution - make sure your pages contain enough content worth creating a page for.
[edited by: chrisnrae at 3:30 pm (utc) on Mar. 26, 2005]
I don't have access to G's databases however it makes zero sense that G would waste time continually processing pages to determine if there were duplicates within returned search results.
Now what G does only G knows and some folks aren't even certain that G even knows.
rescendent,
.net and .com do not indicate nor are they the same domain be very careful you will get a dup content problem if you assume that and both get spidered.
In regards to Google's results for any term:
If the term is in the link text pointing to the page it is counted. Look at G's cache of a page that is returned when you do a search and you don't see that word highlighted in the SERPS.
I have never seen Google nail templates however it is possible that after striping out common words that the word count outweighs the actual content when determining what is a duplicate.
I have. I've had it done to me. Several times.
I don't have access to G's databases however it makes zero sense that G would waste time continually processing pages to determine if there were duplicates within returned search results.
So what if they did it at search time? You know, part of the lookup function that returns results from the database and presents them to the user?
Personally, I think it's a backend process where they're linking doc IDs, but it could be a searchtime process.
It has to occur at least once in every one of the 8 billion pages it does not.Likewise search on font.
Google therefor must strip the html.
Or they treat it as a limited stop word. You know, because there are so many darn results with "html" in it?
[google.com...]
Google's not stripping html to store it in its database.
Google could be treating things on the fly all the time and google could have tons of crazy lists all of which just points out that no one here knows for certain what gets done with what.
Now is it the template (pattern) or the content of the template that nailed you? You can decide.
I use templates all over the place however the content that goes into the template isn't the same.
Solution for me (take to note your server variables and DNS could be different on your box, so may not help you but...)
I switched the files to filename.inc and in the following update no more dupe and all pages indexed properly.
Again, this is my result, yours may vary.
Solution for me (take to note your server variables and DNS could be different on your box, so may not help you but...)I switched the files to filename.inc and in the following update no more dupe and all pages indexed properly.
Eh? What have server variables and DNS got to do with renaming files?
What exactly did you do? You changed the name of your server side includes from filename.inc to filename.asp? Is that right?
Templates are not dangerous. Using templates when you have little to no content on the pages using those templates can be.
I'm looking at one of my own sites right now with 7 pages out of a total of 66 hand-rolled pages hit with URL only with one already out altogether (though it still shows PR, which has nothing to do with it).
It's *very* easy to look at all those pages with site: right now and see exactly why. With those, it's the percentage, or ratio, of what's unique main body content in relation to the weight of the content of the global template.
Not only that, but even with a decent paragraph or so of unique body text, there had better be a heavier balance of that text relative to the number of affiliate <a hrefs> on the page. Not that affiliate links are necessarily being hit, but by their nature they aren't unique.
There are two pages within that group, widgets.html and widgets-2.html that actually have enough unique text, BUT it appears there's *possibly* something else in operation. Just conjecture, but the extreme similarity in the filepath may be contributing to the problem with those, coupled with similarities on the page in spite of the unique text.
I say *possibly* because I couldn't state it for sure without it being verified by a second opinion or by more evidence - but the same thing happened with stuff.html and stuff-2.html - so it gives me a slight suspicion that groupings like that, closely linked with each other on the same site, could *possibly* need special attention to avoid problems, especially if there are structural similarities in the layout of the product display section.
Also, a couple of those pages that got hit in one particular section - that have pitiably little content on them aside from the global elements, are very poorly linked to from the rest of the site; they're only linked to from a page or two. No idea if that has any bearing, but it's another thing I'll be fixing.
Aside from this site of my own, I've just begun working with a site that has gotten most of the site in the supplemental index. Very LOW amount of main body content in relation to the global template, and excessive repetition in the filepaths besides. By the nature of the site, the remedy is to create several very content heavy pages with plenty of unique text and rely on those for ranking.
It's more than just duplication of text on pages, and Google is *very* good at picking up near duplicates, or maybe a kinder way to put it is to call them non-unique pages.
Added:
I had only one page go URL only on another site - and its got just a short introductory paragraph with links to other pages in the section. So even though they are not affiliate links, but links to other pages on the site, the ratio of characters in text vs. characters in links vs. the amount of characters in the global template elements isn't good enough to give the page "value" - speaking strictly from a user perspective.
I can't know for sure, of course, if that's really why that partcular page got hit, but there is no other reason except for what can be seen with the naked eye.
What exactly did you do? You changed the name of your server side includes from filename.inc to filename.asp? Is that right?
No - Previously the way the browser saw the page was with the includes calling out to file.html. This created a dupe filter trip on said site.
Changed them to filename.inc ..... no more filter trip. This is not to say it will work for this site or that site as there are many variables that play with respect to hosting setup and what not.
What do you think?
Re: My recent experience I've cut my boilerplate mark-ufrom 5K a page to 1.2K and it works nicely; but why can't all browsers play fair?