Page is a not externally linkable
Oliver_Henniges - 6:37 pm on Feb 17, 2007 (gmt 0)
If I understood the patent correctly, this is all done "on the fly", whilst crawling, evaluating and indexing a certain bunch of a couple million pages on the web. At least the spam detection patent is NOT applied to the whole index in one big loop. How is this subset of a few million pages preselected? By accidence and link structure in the normal crawl? It is impossible to intermediately store the co-occurance matrix, unless you concentrate on a core of a few thousand most-spammy keywords and phrases. Again: If we want to proceed towards a closer understanding (and perhaps simulation) of the mechanisms at work, it is essential to narrow down the problem to a level computable on a normal PC. If I'm completely wrong with this, please enlighten me about the passages I overread.
The universe of "all posible phrases" is gigantic, even for three-word-phrases and even for one single language. To me the key-issue seems to be those mechanisms, by means of which google narrows down this mass.