I too would love more info on how Google identifies websites that have pages that are very similar in content. I think I may have been flagged, but how did they identify me?
I guess the reason I ask is that there are many websites, competitors of mine, that are much worse in terms of duplicate/similar content - so how was I identified and not other larger websites?
There are several reasons that a url may be classed as supplemental. One of the most important situations to watch out for is when more than one url can directly retrieve the same content from your site. That duplication can happen in many ways -- here are just a few: with and without the "www"
both https: and http:
the use of a session id or other kinds of click tracking in the url's query string
a "custom error page" that actually returns a 200 header for the bad url
a subdomain that can also be accessed directly as a subdirectory
forum software packages that create different urls for accessing the same content
-- there are many ways for multiple urls to give the same exact content, and none of them are a happy situation for a site. Plug those technical holes first, if you have them.
My interest in learning more about how Google identifies websites with duplicate or similiar content. Again, how do the catch some websites, but not others'?
There are many types of Supplemental Result.
Some are associated with duplicate content as Tedster explains; and there are many types of duplicate content as you can see.
There are other types of Supplemental Result though. One is where a Supplemental Result represents a page that is now 404, or even where the whole domain has expired. Google shows the result long after the real content is no longer available.
Another type of Supplemental Result is where the page is simply the previous version of the page. The current version is shown as a normal result, but if you search for keywords that were on the page some 8 to 30 months ago (and which are no longer on the current version of the page) then you see the same page as a Supplemental Result. The snippet will usually also show that same old content, but the cache will always be the one from recent days or weeks (except for a brief time last week when the old cache would show against the old results in several datacentres).
There are other types, too, and I am still working on what triggers their appearence, and what factors are required for that data to remain indexed.
In the BigDaddy datacentres there are Supplemental Results with cache dates going back to 2004 January. In the "experimental" datacentre some type of Supplemental Result have greatly increased in number and the "exact match quoted search" no longer works. In the "cleanup" datacentre most Supplemental Results with cache dates prior to 2005 June have been chucked away (at long last), but a large number of newer (dated 2005 July to 2006 March) Supplemental Results (mostly for other pages) have now appeared instead.
Once you have pages in Google's supplemental results, how do you get them removed?
We over over 30k pages in their supplemental index, and I think this all happened recently.
That depends on what "type" of supplemental result it is.
If the supplemental is the "old version" of the page, then Google will hang on to it for years. Don't worry about those.
If it is as of a result of "duplicate" content (e.g. www vs. non-www) then get the redirects in place, and hope it all works out.
If the supplemental is for a page that is 404, then make sure the URL really does return a 404 status, and wait for Google to update their index.
How can I tell if the supplemental is the "old version" or "duplicate" content (ie. www vs non-www)?
These show your www and non-www pages:
Make a search for some content on your page that exists right now, and then do it for some words that were on the old page but are NOT on the current version of the page. See if any of those searches show the selected page as a normal result or as a supplemental result. Look at the cache date in both cases too.
I was quite lucky to spot this, as my example was a telephone number that had changed. It was very easy to see where Google showed pages that supposedly had that old number on them (there it was in the snippet) but when you clicked on those results you got through to a new version of the page with the old number nowhere to be seen, or else you got a 404 error, or found that the whole domain had expired.
You might have to think a lot harder to find a useful search phrase that will apply to your site.
# with and without the "www"
# both https: and http:
# the use of a session id or other kinds of click tracking in the url's query string
# a "custom error page" that actually returns a 200 header for the bad url
# a subdomain that can also be accessed directly as a subdirectory
# forum software packages that create different urls for accessing the same content
None of these apply to my site, yet on some datacenters I have a ton of supplemental pages. I have some supps that are just old pages from 2005 but others are current and have been classified as supps for some reason. The only thing I can think of is that they are just too similar and are being penalized as duplicate content. I have begun changing the pages and making each one a bit more unique...we'll see if it helps.
How do you plug https and http hole?
|How do you plug https and http hole? |
Google recommends separate robts.txt files. See [webmasterworld.com...] for more.
Wow...if anyone wants to explain how you have 2 files called robots.txt I am all ears. As far as I know it is not possible to have 2 files by the same name in the same place.
That thread in completely inconclusive by the way. Nobody lists an actual tested and workable solution to the problem of https and http. Which by the way is a NEW google problem to add to the growing list of other stuff google can't do.
I dealt with this in another thread at [webmasterworld.com...]
I set out the best course of action in the last post. It worked well for us and Google were actually very helpful.
All the best.
Google really just spawns way to much work. The exception is if you’re spamming them, have been deemed authoritative, or in a low competition area. Those are the only ways you'll have stability to earn cash flow in Google.
I mean really how many people have had to spend countless hours of their time trying to figure out why Google crushed them instead of improving their own sites. I calculate my time spent on correcting MSN mistakes at near zero.
outland that really just about sums it up!
If you understand the few simple technical issues surrounding all of this then most of the fixes that you can implement are very simple. They are basic SEO and design.
Make sure that every page of the site has a unique title and meta description.
Make sure that every page of the site links back to "/" and to the main section indexes.
Make sure that all domain.com accesses are redirected to the same page in the www.domain.com version of the site.
If you have mutiple domains, then use the 301 redirect on those such that only one domain is indexed.
If you have pages that say to bots "Error. You Are Not Logged In", for example "newthread", "newreply", "editProfile" and "sendPM" links in a forum, then make sure the link has rel="nofollow" on it, and the target page has <meta name="robots" content="noindex"> on it too.
If you have a CMS, forum, or cart that has pages that could have multiple URLs, then get the script modified to put a <meta name="robots" content="noindex"> tag on all but one "version" of the page.
Use the site: search to see what you have indexed, and work to correct these issues. The presense of Supplemental Results, URL-only entries, or hitting the "repeat this search with omitted results included" message very quickly are all indications that you have stuff that needs fixing.
It is a sad fact that systems like vBulletin, PHPbb, osCommerce, and a whole range of popular scripted sites, have a large number of SEO-related design errors built in to them. The designers are clever programmers, but have no clue about SEO or how their site will interact with search engines; and the situation isn't getting any better.
Run Xenu LinkSleuth over your site, and run a few pages through [validator.w3.org ] too - just in case.
If you have done all of that, then you'll just have to wait for Google to fix whatever they have broken at their end.
|a "custom error page" that actually returns a 200 header for the bad url |
This is a big problem. I'm running into various site structures where content is short term (e.g. real estate listings, job postings, etc.). Once the listing is removed, the page is still there with all the template stuff and a message stating that the listing is no longer available. The server is returning a 200 response instead of a 404/410. I've come across sites with thousands upon thousands of these pages and it's a mess.
I also see many that do a 302 redirect to their error page.
Are they MAD or something?
|Are they MAD or something? |
I've actually had discussions with a few who do this. They state that they are maintaining the PageRank with the template page and will populate it with other content when it becomes available. Yup, their MAD. Especially when there are thousands upon thousands of pages like that. ;)
OK, so they have heard about "PageRank" but completely missed all of the stuff about "duplicate content"?