|Thoughts about how a natural selection algorithm might work |
| 6:25 am on Apr 16, 2011 (gmt 0)|
First off, 'algorithm' is not the appropriate word. Basically, the ordering of results is based on a set of data (some 200+ factors). The bots go out and collect the raw data and this undergoes some basic calculations to gve the 200 factors.
The ordering is then dependent on how each factor is weighted.
So what of 'machine learning'? The common conception is that it is simply noted which results are selected, which cause a quick return to the results page with another selection made, and a few other similar possibilities. But what to do with this feedback? This information just tells about individual sites, though it could be used by storing this additional 'in the field' data as an extra factor (or factors).
But perhaps there is more. One speculation is that the order could be randomly jiggled a bit to 'try out' lower ranked sites at a higher position. This would again be a feedback on individual sites (more accurately pages).
What I would do is this. Test out hundreds of different 'algos' to see which was best. Best could be determined by noting how often the user selected a higher result and how 'successful' it was - the user not returning quickly to select another (in these kind of ways).
So how to run hundreds of different algos?
Well, because the order depends only on the factor weightings, this can be modified real-time and different weighting sets used, data collected and new weighting sets created, somewhat randomised but favoring the direction of change (lower or higher for factor x or y or z etc) depending on what worked better in the previous testing period. This is natural selection; over time it would discover the best weighting for each factor (for the current factor set; because of interdependence in many cases these relative weightings would change when additional factors are added). Create say a thousand new weighting sets with random 'mutations' (subtle, say +/- 20% of a factor's weighting, either one or several co-mutations). Take the best performers after have sufficient (statistically significant) data, and create new mutations of these. Rinse and repeat, keeping a league table of best ever performers.
This is not trying out different pages in different positions - it is something much better - trying out different algos; visitors to Google will also, imperceptibly be scoring algos (weighting sets) a 'scalable solution' :)
Since I heard mention of a 'breakthrough' at Google I have been trying to think what it could be. If G does not currently do this, I commend it to do so. The tweaking is done by every user and would naturally tend to the actual ideal.
| 1:46 pm on Apr 16, 2011 (gmt 0)|
I think you've got part of the Google picture. They even told us that they tested 12000 algo changes last year at a low level, and then 500 of them were launched.
| 2:14 pm on Apr 16, 2011 (gmt 0)|
@Tedster - sounds like this was second guessing how to weight various factors. 12,000 is a good size but why not implement full natural selection? If the goal is to present the best results as judged by users, that's the way to go (actually put the tweaking power in users hands).
As an aside, an interesting alternative would be to offer the user a set of sliders on the search page. Such as concise content - detailed content; popular - academic; textual - multimedia rich;
| 6:44 pm on Apr 16, 2011 (gmt 0)|
|So how to run hundreds of different algos |
You look for clustering and focus in on a final result. Outliers would get excluded. This is what happens in any environment that relies on multiple numerical models (algorithms), each using different equations, physics, parameterizations, etc. Over time, the machine (statistical algorithms) can learn to identify an expected result based on a statistical dataset (probability distribution) of previous events/cases/results or from tests using random variables.
| 7:28 pm on Apr 16, 2011 (gmt 0)|
Exactly - Matt Cutts calls them "edge cases". And with Panda, that's what Google asked the webmasters for within a week, with this thread on their Webmaster Help Forum [google.com].
It's also possible that eHow was an outlier in the other direction with Panda 1 - and the tweaks they made for Panda 2 are the reason eHow then took a hit.
| 7:49 pm on Apr 16, 2011 (gmt 0)|
Well, I think you miss the point. What G does now is just a feint approximation of what could be done. The present system is a half-witted approach; but it keeps them in work fiddling with statistical analysis and their fingers on the controls.
I really shouldn't have bothered with the post.
| 7:53 pm on Apr 16, 2011 (gmt 0)|
What else do you think Google could be doing?
| 8:03 pm on Apr 16, 2011 (gmt 0)|
|I really shouldn't have bothered with the post. |
This is a good post! Don't get frustrated. This discussion gets us thinking about the numerical/statistical aspects versus conspiracy and single-factor theories. Google has a over a decade's worth of data, patents, equations, filters, billions of documents on the web, enormous language/semantic datasets, etc... With so much data and experience, it's evolving into a complex exercise in statistical modeling at this point.
| 9:20 pm on Apr 16, 2011 (gmt 0)|
only ehow.co.uk took a hit. ehow.com is still stronger than ever