homepage Welcome to WebmasterWorld Guest from 54.234.147.84
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google's Historical Data Patent Gets an Update
tedster




msg:3747166
 3:06 pm on Sep 18, 2008 (gmt 0)

Arguably the biggest stir ever generated by a Google patent came from the 2005 Historical Data patent application [webmasterworld.com]. The real eye-opener for many was that Google was considering and tracking so many potential factors in the algo. Some thought it might even be a smoke screen - surely Google can't be this ambitious, right?

In the three years since that patent, we've seen that Google IS measuring these historical factors, and they keep getting folded into the SERPs in new and sometimes frustrating ways.

In March of this year, Google officialy was granted their 2008 Historical Data patent [patft.uspto.gov] by the USPTO. The language is more readable, and the patent goes beyond the earlier version by spelling out some of the practical implementations more explicitly. (please see my note below - the language in the new patent is NOT revised in the way or degree that I originally thought. -tedster)

...if the content of a document changes such that it differs significantly from the anchor text associated with its back links, then the domain associated with the document may have changed significantly (completely) from a previous incarnation. This may occur when a domain expires and a different party purchases the domain... All links and/or anchor text prior to that date may then be ignored or discounted.

....or

The dates that links appear can also be used to detect "spam," where owners of documents or their colleagues create links to their own document for the purpose of boosting the score assigned by a search engine. A typical, "legitimate" document attracts back links slowly.

A large spike in the quantity of back links may signal a topical phenomenon (e.g., the CDC web site may develop many links quickly after an outbreak, such as SARS), or signal attempts to spam a search engine (to obtain a higher ranking and, thus, better placement in search results) by exchanging links, purchasing links, or gaining links from documents without editorial discretion on making links.

...or even this tidbit, that hints at the Supplemental index, or whatever it has morphed into these days:

In some situations, data storage resources may be insufficient to store the documents when monitoring the documents for content changes. In this case, search engine 125 may store representations of the documents and monitor these representations for changes. For example, search engine 125 may store "signatures" of documents instead of the (entire) documents themselves to detect changes to document content.

This version of the patent show us even more thoroughly that the algo's logic is getting "fuzzy". The same factor is sometimes a plus and sometimes a minus, depending on other factors.

There's language in here that shows Google making a clear distinction between "authority" and "trust". There are also sections that address the ways that freshness might sometimes be good and sometimes be bad. Google is simply getting better at seeing some kinds of manipulation, and the main take-away for me is "get it right the first time." Changes made only to improve ranking are under inspection!

[edited by: tedster at 10:51 am (utc) on Sep. 27, 2008]

 

tedster




msg:3747204
 3:32 pm on Sep 18, 2008 (gmt 0)

Another new factor in this patent [* again, please see the above note] is the way that traffic is mentioned prominently as one of the factors whose changes are measured:

It may be discovered that there are periods when a document is more or less popular (i.e., has more or less traffic), such as during the summer months, on weekends, or during some other seasonal time period. By identifying repeating traffic patterns or changes in traffic patterns, search engine 125 may appropriately adjust its scoring of the document during and outside of these periods.

...and

search engine 125 may monitor one or a combination of the following factors:
  1. the extent to and rate at which advertisements are presented or updated by a given document over time...

  2. the quality of the advertisers...

  3. the extent to which the advertisements generate user traffic to the documents to which they relate (e.g., their click-through rate).

Search engine 125 may use these time-varying characteristics relating to advertising traffic to score the document.

There's even another mention of "impressions" in the SERPs - not clicks, just impressions.

Yet another query-based factor may relate to the extent to which a document appears in results for different queries. In other words, the entropy of queries for one or more documents may be monitored and used as a basis for scoring.

[edited by: tedster at 5:01 pm (utc) on Sep. 19, 2008]

waynne




msg:3747216
 3:45 pm on Sep 18, 2008 (gmt 0)

I've seen a few sites do link building campaigns get a short term boost in ranking and then drop off the serps again. Google need to see a nice steady stream of new links over time.

I believe they are looking at the RATE of link aqusition as a factor which this patent touches on. The webmaster tools link report gives an idea of the data Google use and we need to remember that they can archive this and compare month on month link rates.

The build and coast days are over (and have been for a while!)

Kazim_Shah




msg:3747791
 11:12 am on Sep 19, 2008 (gmt 0)

hello you read a very knowledgeable thing

tedster




msg:3748060
 4:55 pm on Sep 19, 2008 (gmt 0)

I've got a big apology to make. The language of the sections I'm looking at - and the quotes I included - are not new. The 2005 patent has the same exact information. I was relying too much on memory and I really didn't remember the first patent as having a lot of these points. I was wrong and I should have verified my impressions before posting.

It is remarkable how many areas that people considered "smoke and mirrors" back then have now shown up in the real world. I'm thinking this patent is worth a very close study - especially with eye to factors that we may not be noticing at present. Odds are, they either are already being folded in or they will be soon.

dstiles




msg:3748257
 8:57 pm on Sep 19, 2008 (gmt 0)

A worrying thing about google is their apparent insistence on natural back-links. Many sites never attract backlinks. Their clientele isn't the kind to post links on their web sites even in the rare cases where they have such a thing. Often the visitor may come into the site once, complete a necessary transaction of some kind and never return. The response of some webmasters/owners is to buy or cajole links, which falsifies google's view of the site and which they now appear to penalise - if they can prove it.

Expiring domains isn't the only reason for significant change of page content, and the content may still be relevant whilst being very different from an earlier incarnation.

And the seasonal time period one is guaranteed to send webmasters and site owners into a depressed spin! Two weeks before the start of an expected summer or christmas rush and still no sign of the site! Actually, if google is playing with this one it could explain the current distress of one of at least two of my customers.

g1smd




msg:3749720
 6:30 pm on Sep 22, 2008 (gmt 0)

No apology needed Tedster. I am sure there are many people that are not all that aware of the 2005 discussions, or weren't in a position back then to really understand what was being discussed. So, thanks for breathing some life into this topic..

suzukik




msg:3750015
 12:34 am on Sep 23, 2008 (gmt 0)

According to a further implementation, the analysis may depend on the date that links disappear. The disappearance of many links can mean that the document to which these links point is stale (e.g., no longer being updated or has been superseded by another document). For example, search engine 125 may monitor the date at which one or more links to a document disappear, the number of links that disappear in a given window of time, or some other time-varying decrease in the number of links (or links/updates to the documents containing such links) to a document to identify documents that may be considered stale. Once a document has been determined to be stale, the links contained in that document may be discounted or ignored by search engine 125 when determining scores for documents pointed to by the links.

According to another implementation, the analysis may depend, not only on the age of the links to a document, but also on the dynamic-ness of the links. As such, search engine 125 may weight documents that have a different featured link each day, despite having a very fresh link, differently (e.g., lower) than documents that are consistently updated and consistently link to a given target document. In one exemplary implementation, search engine 125 may generate a score for a document based on the scores of the documents with links to the document for all versions of the documents within a window of time. Another version of this may factor a discount/decay into the integration based on the major update times of the document.

I can hardly understand the second paragraph.
(English is not my native tongue.)

Would anybody explain what "dynamic-ness" means?
Preferably with an example.

tedster




msg:3750024
 12:48 am on Sep 23, 2008 (gmt 0)

It means the links on the page change frequently - usually in an automated or dynamic manner. Think of the home page for many blogs, where the links "slide off" the home page, or websites that feature "news of the day" on the home page.

tedster




msg:3750034
 1:04 am on Sep 23, 2008 (gmt 0)

It was just pointed out to me that the original application date for this patent was 2003 ut it wasn't published until 2005. The 2003 application date appears on this final version, just down a few lines.

The 2005 publiction date is shown on this version of the patent application [appft1.uspto.gov].

This has been a funny patent to watch over the years. The original disappeared from the USPTO website for a while last year.

At any rate, this "version" represents the fact that the the patent is now granted, as described in the March 18, 2008 document linked from the opening post. The other documents represent applications and the changes are apparently minimal and mostly of interest to the lawyers, not webmasters.

Marcia




msg:3750083
 2:20 am on Sep 23, 2008 (gmt 0)

Application date December 31, 2003 and published in 2005, but the infamous Florida update was just a month before that original application, in November, 2003.

There were also some other major changes earlier during 2003. I'm wondering if it isn't likely that some of the precepts from the patent started to show subtle signs of appearing throughout 2003, prior to when the actual application was filed.

suzukik




msg:3750182
 6:10 am on Sep 23, 2008 (gmt 0)

Thank you for your quick answer, ted.

So, links that have "dynamic-ness" may not be valued as highly as ordinary links even if they are fresh.
Am I right?

thegypsy




msg:3757377
 1:05 pm on Oct 2, 2008 (gmt 0)

Marcia - I have to believe that most of the patent filings have been implemented or at lest tested on select DCs, by the time they file them. I doubt they wait until it's awarded. Depending on the time elapsed in between, they mave have tested and departed from the method or implemented them in some way.

Anyway, I wrote about this patent re-release awhile back over 4 posts... still interesting stuff. There are also Microsoft patents on using historical factors for Spam detection...

This patent does also have some hints at 'Query Deserves Freshness' that Google has discussed.

tez899




msg:3757843
 12:51 am on Oct 3, 2008 (gmt 0)

What happens when you scrape websites from before Google was born, or even say before the 'Historical Data Patent' was made. Even after that, who dares to see if content is really 'duplicate content' when it's from the olden ages. How much data can the dupe content filter hold? The 'duplicate content' filter is a myth, I insist on unique content for my new projects, but at the end of the day read this;

/ Google is not case sensitive when it comes to our SEO work.. but as Andy mentioned acronyms, that's a different story. Ranking for 'Cheapest X' and 'Cheap X' is similar, although to the human eye it's practically the same and also with the Google eye. Google is a 'human', which can therefore recognise captial letters and phrase similarities.

/

Written from me - Terry Bytheway

Why does the first website indexed with the 'unique' content and make it the first whom published the content. A human cannot see that it's a copy or duplicate, nor can Googles bot. After all, Google is a bot? A human needs an instinct, and let's be frank a bot cannot tell right from wrong nor can it decide what company has sole rights to pictures/stories in a split decision.

My words expand to say that the 'Google Bot' you know, it isn't just a server running 2 trillion processes a second. A human is stood behind the bots decisions. A human with a basic instinct.

Love is Hate, after all Google does love and hate no? Since when can a machine understand emotion. -- Terry Bytheway

Marcia




msg:3798888
 7:44 am on Dec 3, 2008 (gmt 0)

How much data can the dupe content filter hold?

There's no "filter" that works like that. The processes are clearly described in their patents (and patent applications) on detecting duplicate or near-duplicate content. Patents and papers about duplicate detection date way back to even when Alta Vista was alive and kicking and the hot search engine of the day. Alta had a patent which Overture, then Yahoo! acquired, and Yahoo! has since had other papers published, including an informative one by Andrei Broder about detecting duplications using "shingles."

A human cannot see that it's a copy or duplicate, nor can Googles bot

I can, and so can other people, and so can Google. And I can definitively state beyond a shadow of a doubt (and have seen for myself) that they can - and they do - detect duplicate or near-duplicate pages. The original will be included, as can pages with some content scraped from it, but a page on another site that's enough of a near-duplicate will be totally filtered out - and that's for an exact match quote (in quotes) of 10 to 12-14 words copied and pasted directly from the near-duplicate page.

[edited by: Marcia at 7:50 am (utc) on Dec. 3, 2008]

Lorel




msg:3799151
 3:15 pm on Dec 3, 2008 (gmt 0)

Re Google knowing what's duplicate content and who published it first I often find scrapers who copied my articles word for word but never find them ranking anywhere in Google so I usually ignore them.

jpservicez1




msg:3799382
 7:50 pm on Dec 3, 2008 (gmt 0)

I have to ask...

....
the extent to and rate at which advertisements are presented or updated by a given document over time...

the quality of the advertisers...

the extent to which the advertisements generate user traffic to the documents to which they relate (e.g., their click-through rate).
...

how can Google tell a quality advertisements if the ads are in javascript. i thought google couldn't index or follow javascript.

potentialgeek




msg:3799451
 8:50 pm on Dec 3, 2008 (gmt 0)

The main take-away for me is "get it right the first time." Changes made only to improve ranking are under inspection!

+1. I agree 100%.

This is the first time I've seen the Patent, so I have a few comments.

...if the content of a document changes such that it differs significantly from the anchor text associated with its back links, then the domain associated with the document may have changed significantly (completely) from a previous incarnation. This may occur when a domain expires and a different party purchases the domain... All links and/or anchor text prior to that date may then be ignored or discounted.

What are the mechanics of this? Either Google has to keep on file an original copy of the entire document, or, it simply compares the old IBL text with the current document. Given the amount of space it would take to keep every single page Google ever visited the first time, presumably it's the latter. It's a pretty simple process to compare the old link text to the current page. Most times people put the page title (exactly or with minor changes) as the link text.

Incidentally, this is probably one reason why it's risky to change page titles... if they don't look too similar to old link text, their value could be diminished. We always have to remember history and the dynamics of site changes. I'm not suggesting you'd lose all link value, but it could erode the value if Google grafts the patent principle to other parts of its algo.

search engine 125 may monitor one or a combination of the following factors:
the extent to and rate at which advertisements are presented or updated by a given document over time... the quality of the advertisers... the extent to which the advertisements generate user traffic to the documents to which they relate (e.g., their click-through rate). Search engine 125 may use these time-varying characteristics relating to advertising traffic to score the document.

Explanation? What is the advertising quality valuation model? Are AdWords advertisers inherently the best? ;/ Seriously, how does Google propose to figure out how many people click on ads unless they have that data? They don't have access to my raw logs or the logs of the target URLs. They don't know who clicks. The only advertising data they have is AdWords/AdSense clicks. So does this mean I get penalized for having a poor CTR on my site, because the advertiser writes lame ads, or I didn't put the ads high enough on the page to get lots of clicks?

Did Google ever elaborate on the idea in another patent or elsewhere?

p/g

youfoundjake




msg:3799673
 2:59 am on Dec 4, 2008 (gmt 0)

What I really find interesting is that Google appears to discount links accumalated to a website, if that website changes owner and the content is changed, to the point that all the backlinks in custom crafted anchor text go out the window..

youfoundjake




msg:3799683
 3:17 am on Dec 4, 2008 (gmt 0)

What are the mechanics of this? Either Google has to keep on file an original copy of the entire document, or, it simply compares the old IBL text with the current document. Given the amount of space it would take to keep every single page Google ever visited the first time, presumably it's the latter.

This ability to compare pages over a long period of time has to be related with google's recent event of having the SERPS go back 10 years.. I sure am curious about the mechanics of this...What does google use to back up the internet? heeh
[webmasterworld.com...]

misterjinx




msg:3805006
 9:19 am on Dec 11, 2008 (gmt 0)

add to your URL the &as_qdr parameter with one of options (d for day, m for month, from y1 to y9 for years)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved