How does Google determine dup. cont. anyway? - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

How does Google determine dup. cont. anyway?

I see three or four copies of one page from my site . . .

berli

1:07 am on Jun 17, 2003 (gmt 0)

10+ Year Member

With everyone talking about duplicate content, and having their pages vanish because of spammers ripping their pages off, I find myself perplexed by what goes on in my category.

First of all, I've noticed that sometimes articles get "reprinted" online more than once, and they may show up in search results more than once, at various rankings.

In my particular situation, I allowed two different people to archive a page of mine. The second person actually archived it TWICE, on two different domains. I'm kind of tee'ed at him because he refuses to link to my website, but does "helpfully" include a mailto: link to my email address. (I half-succeeded in getting him to remove that, at least). The page is also on my site. The same text also appears in groups.yahoo.com's archive because it was posted as an article in a relevant Yahoo group. Google spiders these archives (I can confirm this because pages from that group's archives regularly turn up in SERPS, often near the top because it has fairly high PR).

The html pages have almost the same text and I think almost the same html. Should be identical in at least one case.

I have not seen any version get filtered out. My own page is not there, but only because I moved my pages (with 301 redirects) in March and April deepbot data has vanished. Since Google has only spidered a few of my new pages since then, much of my site is not there. Anyway, the point is, it's not a filter at work, just freshie's light nibbling at work.

So how does the Google algo know it's hit duplicate content? It obviously missed this case (which is just as well, imo -- hey, I wrote it, my name is on it, and I'm vain).

vitaplease

7:10 am on Jun 17, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

From what I remember at the Pubcon Boston, Google is much less focussed than for example Teoma on leaving out duplicate content. (or it was an excuse for Teoma to have a smaller index ;))

I would think their main focus would be on exact copies of domains/pages being hosted under similar url base names.

A Comparison of Techniques to Find Mirrored Hosts on the WWW [citeseer.nj.nec.com] (1999)

Msgraph had a thread on Duplicates and the challenges search engines face [webmasterworld.com]

The &filter=0 tip Googleguy gave was something new to me though.

Jenstar

7:20 am on Jun 17, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

From what I understand, when an article is reprinted on another website, there is enough content on that page itself that would keep it from being an exact duplicate.

For example, on your site you would have the article content itself, as well as your site's header, navigation side bar, and footer. And on the reprint site, that site would also have its own header, nav bar and footer. So for most average length articles online (300-600 word count), there would be enough content unique to its own site to keep it from being a duplicate. I might be concerned if they just print the article itself, without their own unique content added to the page.

Now, if they also reprinted your header, nav bar and footer, rather than using their own, or if only the article itself appears on each page, without the unique content on either, chances are that would trip a duplicate content filter, because it would be identical. And that I would be concerned about.

berli

9:18 pm on Jun 17, 2003 (gmt 0)

10+ Year Member

Now, if they also reprinted your header, nav bar and footer, rather than using their own, or if only the article itself appears on each page, without the unique content on either, chances are that would trip a duplicate content filter, because it would be identical. And that I would be concerned about.

Well, this is my exact situation . . . We're not exactly talking web geniuses here. The one guy I let host my article is probably using the exact same html file on both pages, although it might have some Quatsch code added by the free hosting services he's using. The other person hosting that article is also probably using the same html or very, very close. On my own site ('cause I'm lazy) it's probably the same html with only very minor changes. I'm converting my articles over to a new css format, and at that point I'll be adding some navbar stuff (I'd been neglecting that up to this point because I've been serving all that stuff as text files) and a lot of the code will change.

So it actually has surprised me quite a bit that it hasn't triggered the duplicate content filter.

I do know that the directory structure and internal linking among the various sites is completely different, so as long as Google is also checking for that, I guess we're all covered.

markymarky

7:51 am on Jun 18, 2003 (gmt 0)

10+ Year Member

Hi, many thanks for the links to the documents and the ideas put forward so far. Unfortunately I am still not sure whether I have sites that will be treated as mirrors/duplicates. I (probably wrongly) understand from the algorithm dosumentation that I am on the borderline (The maths is well beyond me).

Basically, what I have is 4 country specific websites, on a shared hosting server, with identical formats and layouts, with ASP generated pages, based upon 2 key elements - geography and categories. The pages run into large numbers as I am covering Country through to locality, with differences largely accounted for by changes in the titles of the pages, descriptions, headings etc.

My page URLs will therefore look something like this:

mydomain.com?C=32&S=25& etc. etc. or
mydomain.co.uk?C=21&S=15& etc. etc.

Variants can be creatd by users but, as a new site these are not too many at the moment. No page actually, in the real world, has the same content as another, but the content to an SE will look similar.

Is this going to be treated as duplicates of a mirror. Thanks in advance for your answers as this is very important to the success of the sites.