Forum Moderators: open
Interesting points, Woz. Aside from any other considerations, they've displayed enough ethics and concern for the integrity of their search so that it's doubtful they'd want to be doing that.
If the cache were completely archived by Google indefinitely it would take an astronomical amount of storage, but it isn't inconceivable that it's archived for at least a limited amount of time. We've seen cached pages switched back and forth more than once. They couldn't get involved in disputes, but it's possible there is some documentation available to them, at least for some amount of time.
Slightly OT but I really don't see the sense of handing out penalties under any circustances as the web by its very nature is a mixture of Companies and People from Professionals to Mom and Pops, from Highly Ethical Netizens to the same Lo-Life we have in real life. But who is to say who is which?
It seems to me that a modicum of restraint and perhaps negation of any percieved value or gain due to what could perhaps be questionable activities would be the better path to follow.
Or, to put it more simply,
"Is that Deliberate Spam"?.
Thats a "Definite Possible Maybe".
Onya
Woz
For quite a while a client's web master would post a page(s) "quickly".
Then decide to move the page(s) to a more appropriate page name/directory not realizing visitors already linked to the page.
On recognizing this error he put them back (and also left the page(s) in the more appropriate place.
Hundreds of these (dup/mirror) pages were strewn throughout their site and google indexed many of them.
Although a re-direct and no unindexing may have been the a more appropriate solution... they did not get penalized, ever.
Two different sites have alot less in common?
Our measure of success and failure here, are based solely of observations and myth and rarely facts.
So a mirror site, separate domain, not linked at all with the primary site, is fully acceptable to Google
IMO yes - simply because in many cases googlebot can't see all the similarities.
As an SEO it is important to take as much control over how your content is presented to the SE's, I don't like to leave it up to whatever dupe detection/removal algo they decide on this month.
Though I would think this is an exception to the topic.
[vlib.org...]
[cui.unige.ch...]
- the highest Pagerank is no option. This has nothing to do with "who was first" even if Google would take into consideration things as the "age of links" see the Google programming contest"
[google.com...]
- archiving all cache versions would be better, however what if you copy and "web-publish" parts of an article that has never been on-line, and the original article becomes on-line a year later?
Example:
A trainee in my company used some sentences from a Reference book (previously not available on-line) on a webpage. We got a remark from a surfer that it contained copied sentences without source from a specific book. We corrected and mentioned the source, however, we also finally found out that that specific book the surfer had mentioned had copied those sentences from the book our trainee had used, or vice-versa...
Even off-line it is an impossible exercise to know who really was first.
Exactly. Googlebot isn't in a position to make value judgements about which address derserves to be in its index for a particular piece of content. Also, Google don't want to list ten copies of the same content for each search (like a popular engine of old often used to).
It's up to site owners to resolve problems with duplication, and it's up to Google to return relevant content for each search.
Their have been serious problems with duplicate content in the past (including large scale result hijacking), but the 'keep the version with the highest PageRank' approach seems to work pretty well.
And not only are they exact copies, but they are done in Frontpage with common borders, themes, java buttons, all the things we say hurt rankings.
Both sites are PR7!
Checking the back links they both have 474 links with exactly the same linkees. Which tends to suggest that they are mapping the two domains to the same files.
I am not sure if that proves or disproves any theories but thought I would throw it in as I found it quite surprising PR7 was attainde against all the odds.
Onya
Woz
Some people see this as a penalty, but I'd much rather have one URL in Google credited with both sets of inbound links than two URLs credited with half each. This is absolutely the best thing to happen if you have duplicate or near duplicate content.
(I'm assuming that Woz is seeing the 'duplicate' effect, as it fits his description nicely.)