Forum Moderators: open

Message Too Old, No Replies

Differentiating Duplicate vs. Serialized Content

How to avoid Google penalizing weekly columns and index archives?

         

MikeNoLastName

12:27 am on Aug 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have been studying the way Google handles and indexes periodical content. In the case of our website, which consists of about 2/3 magazine format the other third static content, we have about a half dozen professionally written columns, by the same authors, with the same column name (e.g. The Widget Times presents - Joe Smith's Weekly News Column) which are replaced every week with current news. The template for each page (header, links, title (see below) ) are identical due to the mass of content we need to put out weekly, we've tried to automate as much as possible so non-html-programmers can ready the pages for publishing. No frames are ever used!
The previous week's column page (which begins as www.domain.com/directory/index.htm) is then renamed with a weekly designator such as js080203.htm and saved in the directory with a prev link, next link, domain.com home page, etc. The new week's one takes it's place as index and is set to point to the renamed one as it's prev. This way none need ever be changed once they are archived and always form a continuous chain. When someone offsite links it for refernce, it never goes away. There is also a mutually linked archive page which very briefly outlines each week's content and links to each page, for the last year's worth of content. Many of our regular readers access the past content regularly and have requested it, so we have made the archives available for searching and indexing. We even recently paid to add a sitewide search engine to help users retrieve this content since Google won't index it all.
Over the last few months I've noticed that while alltheweb and many other SEs find and index every single archive page Googlebot visits a few times a week but tends only to index the most recent issue of each column (title) and maybe a handful of past articles. At times they may even get every single column going back 1-2 months, but these later vanish and get replaced with the newer ones. Each column main page or current week's page is at least a PR4+.
I'm assuming some of this may be based on Google deciding that since the title and 95% of the links, and about 10% of the text (mostly near the top) of the page are the same, and the page has a number in the file name that it thinks they are intentional duplicate content.
Over the last few months we've tried changing each of the titles, descriptions and keyword metas to reflect and describe each week's unique content instead of having the same title as the column and author, and instead of just describing the general topic of the column, although it is significantly more work (when multiplied by the number we are doing each week), with only very minor success at getting more indexed.
I've noticed many of our competitors and daily newspapers, national magazines, etc. manage to have every single past news column indexed (even ones which incorporate a highly copied syndicated newsfeed) and searchable, even when all that changes in the title is a date. Some end up with 10's of thousands of backlinks and PR8 with many apparent duplicate columns.
I've also noticed a lot of intentional spam sites which simply list pages full of dictionary terms to reroute and even they routinely get indexed.

1. Should G be indexing archival content and doing a better job determining what is an important repeating content vs intentional spam?
2. Does anyone know of a better way of designating similar but different content (e.g. historical sports scores, stock market reports, recipes, similar products with a separate page for each, etc.) so that each page gets indexed, without starting over creating on every one? Is there maybe a Google registry for periodicals?
3. Does anyone have experience concerning the best interlinking structure to use in this scenario for the highest PR for both the most current column page and the domain home page without G thinking that one is simply spamming? Is it excessive to link 20-30 static navigation pages and other related news column home pages from every weekly article margin?
4. Some of our columns are syndicated in print and online although we are the original source. When someone else copies 90% of a page of content weekly either by permission or without. How does G decide who is the original source and who is copying?

Mike

bilalak

9:08 am on Aug 12, 2003 (gmt 0)

10+ Year Member



I suggest you use a different directory name for each issue.

google does not know which is the original copy of the material because it indexes pages at different time and even if it would try there mignht be some confusion for the bot to decide.

I have programmed a script for a paper in Arabic and all pages almost are indexed or reindexed from time to time. the best interlink structure is to follow the logical human indexing. Year/Month/Week/Day which is more suitable.

Try to make use of META tags for the articles. It is very important for regular articles.

Luck!

eztrip

1:38 pm on Aug 12, 2003 (gmt 0)

10+ Year Member



Does this affect pages that have things such as a Tip of the Day or something like that were some small sidebar text changes on a daily or even on a per page request basis?

Anyone know? I have a sidebar that has tips on it that are pulled from database of tips that are randomly selected per request. This isn't to get Google to reindex but is there for the user's information.

Thanks

tribal

2:14 pm on Aug 12, 2003 (gmt 0)

10+ Year Member



If you are allowed to use scripting on your server (ie. asp/php), you could dynamically create an index-page of news articles, with links to all articles on the page. Then, the crawler should be able to find the articles better. Manually is an option too ofcourse, but would require a lot of work.

MikeNoLastName

7:12 pm on Aug 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for all the suggestions so far. We've seen some improving results by changing the metatags and titles from week to week as suggested, which we started a couple months ago, but I suppose it will take time for each to be revisted.
My primary point was a concern over what G uses to determine if a page is a duplicate or near duplicate of another and the penalties associated with it.
For instance, when the title (of all the same authors' weekly columns) is the same for programming simplicity and consistency, PLUS nearly all the links the same AND a part of the text near the top is the same each week, is this enough to get penalized? When Google enters a directory full of numbered (dated) files like this does it assume it is a directory of spam? We're talking over a thousand 12K-20K weekly files, of 6-10 different templates, of which maybe 1K-2K is duplicate header text and 36 out of 38 links are the same. Is this typically sufficient, from anyone's experience, to trip G's dup trigger?
In our case G obviously finds them all, when they are new and gives them a PR4-5, but then seems to forget about them once they are a little older and resets most of them to PR0. An additional note on this, which prompted me to start this thread, is that when I allinurl: the domain where all this is on G, I only get TWO listings, and a blue bar which says:

"Results 1 - 2 of about 738"
and the typcial
"In order to show you the most relevant results, we have omitted some entries very similar to the 2 already displayed. If you like, you can repeat the search with the omitted results included."

Doesn't this indicate that Google has decided the rest of the columns are so similar as to not warrant mention and thus possibly a penalty?

Mike