PR algorithm you cannot cheat

Forum Moderators: open

Message Too Old, No Replies

PR algorithm you cannot cheat

re5earcher

1:10 pm on Jun 13, 2003 (gmt 0)

Guys, we got to help google to develop the ultimate PR formulae.

How about this algirithm to start with:

a. The more the _number_of_different_ queries hited your _site_ divided by _number_of_pages_ on the site the _lower_ your site PR - calculated over time.

This means that you cannot cheat on it either by placing tons of popular keywords on one page or placing tons of small pages with with keywords.

b.Title counts in a bit higher PR - it's limited in length anyway

c.Backlinks and links does not count at all - not every home page is required to get listed in top directories

d.Layout does not count at all - page can be <h1><b> and still look nice with css.

e. If multiple pages have the same PR they _got_to_be_ dipslayed in _random_ order - this is the natural way our memory works

Kackle

11:05 pm on Jun 13, 2003 (gmt 0)

So for a given site, you get a per-page average of the number of different search terms that hit on the site. This number is essentially an indication of how wide a net ("net" as in fishing for traffic) the site attempts to cast, as opposed to how focused the site is on a particular mission. You take this number and subtract it from the site's rank. Equal sites get displayed in random order.

I like it. Two questions:

1) This implies a per-site rank instead of a per-page PageRank the way Google does it now. We should call it SiteRank. Is this your intention? I think this is worth a try; I'd like to see what sort of results come back.

2) Have you calculated the overhead for keeping track of sites this way? It's probably not that bad compared to the overhead for PageRank as it stands now.

Hurry up and file a patent for SiteRank.

re5earcher

11:26 pm on Jun 13, 2003 (gmt 0)

Yeap, a SiteRank.

The first rule basically will force:

- webmastes - to _limit_ their usage of popular keywords, most often not related to the content of the site - and moreover to create pure-content, small size pages.

- SE users - to make their query more precise - overtime general queries will give only general results - when no sites will be using general keywords.

So there will be interesting situation - sites will be rushing down in size improving the content quality while rushing up in ranking.

The overhead for search engine:

The overhead is _much_ smaller, because the tracking of unique queries is performed for _domains_ and not for _pages_.

So let's hope Google pick up this technique

Kackle

11:42 pm on Jun 13, 2003 (gmt 0)

The overhead is _much_ smaller, because the tracking of unique queries is performed for _domains_ and not for _pages_.

Not only that, but the overhead can be scaled down. In other words, you can sample the hit rate for sites if you have to, instead of counting all hits.

I really like your idea. It's brilliant. I can't stop thinking about it.

It puts PageRank to shame. But that's exactly the reason Google would never do it. I think they're locked into PageRank, and cannot or will not think outside of their box.

Chris_R

12:00 am on Jun 14, 2003 (gmt 0)

And how is it that you can't cheat this?

doc_z

2:00 pm on Jun 14, 2003 (gmt 0)

re5earcher,

if I understand you correctly (otherwise please correct me), you are mixing two different things: PR algorithm and ranking algorithm (see this thread [webmasterworld.com]).

It seems that your suggesting a new ranking algorithm (maybe someone should change the title of this thread?).

a. The more the _number_of_different_ queries hited your _site_ divided by _number_of_pages_ on the site the _lower_ your site PR - calculated over time.

I'm not sure if I unstand your idea correctly. So far it seems to me that your would strongly favour 5 small product pages instead of having the same content on simply one page (this effect would be multiplyed by point b.)

b.Title counts in a bit higher PR - it's limited in length anyway

I think the title is already an important factor.

c.Backlinks and links does not count at all - not every home page is required to get listed in top directories

Are you suggesting to remove any contribution from anchor text? Anchor text is backbone for all mayor search engines. (Apart from PR, this was the main improvement in the last years.)

d.Layout does not count at all - page can be <h1><b> and still look nice with css.

I mainly agree. Layout shout count less.

e. If multiple pages have the same PR they _got_to_be_ dipslayed in _random_ order - this is the natural way our memory works

Do you mean pages that with the 'same on-page scoring' and similar but not excatly the same PR (e.g. 5.1 and 5.2) have to displayed in random order? Or did you mean, pages with a similar final score (summing up the contribution of all factors) should be ranked in random order?

By the way, I have found some hints that Google perhaps has introduced a random component for PR with the start of the dominic update. However, it's too early for a final decision. (But this would go into the direction you suggested.)

-> I think introducing such an algorithm would lead to creating a lot of very small content pages (maybe just one phrase) with an appropriate title. I wouldn't see this this as an improvement for the user. (Instead of placing tons of popular keywords on one page, spammers would place tons of popular keywords on tons of pages.)

Kackle

3:28 pm on Jun 14, 2003 (gmt 0)

I think introducing such an algorithm would lead to creating a lot of very small content pages (maybe just one phrase) with an appropriate title. I wouldn't see this this as an improvement for the user. (Instead of placing tons of popular keywords on one page, spammers would place tons of popular keywords on tons of pages.)

But only a small page dedicated to one keyword would keep the number of different queries that hit on a per-page basis to a low level on the site overall, thereby maximizing the SiteRank.

Okay, so you are trying to spam SiteRank. Your first problem is this: You want lots of hits on your list of keywords, but you want to maximize your SiteRank, so you spread them out to one keyword per page.

If you have to do that, in order to keep the customer interested when they view the page, it had better be conspicuously relevant to that single keyword. That means you have to make each page relevant to its own keyword. That means you are no longer a spammer, but a webmaster trying to create relevant content.

You're thinking that the site would resort to many spammy pages instead of a few spammy pages. I don't think it would happen that way. The reason is that any page that is focused on more than one keyword will tend to have a hit rate for a greater number of different queries. Look at the word "different" in that sentence, and think about it. The greater the hit rate for different queries on a per-page basis, the lower the site's overall SiteRank.

Basically, the more keywords per page, the lower your SiteRank. That means your pages, whether many or few, must be more focused in order to keep your SiteRank higher.

The clever thing about this algo, in my opinion, is that it figures out how many keywords are on a page by looking at the hit rate for _different_ queries. This is like picking the spammer's brain, and turning it against him. There's a natural balance point that will be achieved because as the SiteRank declines due to more keywords per page, the per-page hit rate will decrease, which will bring the SiteRank back up slightly, which will increase the hit rate, which brings it back down again. Each site seeks it's own level.

Really cool. I like it.

SlowMove

3:31 pm on Jun 14, 2003 (gmt 0)

>>Guys, we got to help google to develop the ultimate PR formulae.

Google is a business. They'll always be able to manually edit the index.

doc_z

4:08 pm on Jun 14, 2003 (gmt 0)

If I understand the rating system correctly, I would do the following to spam it (if I would sell a large number of pruducts named product1, ...): Apart from my regular site, I would create an addtional one with a (dynamic, auto-generated) page for each product. The title would be 'product1' and just one short (relevant) phrase on it (e.g. 'buy cheap product1.'). The phase has a link to my main page.

Of course, this is just a simple example and one could modify the algorithm in such a way that this wouldn't work. However, I would change my tactic in the same way.

I see the problem which re5earcher is referring to ('tons of popular keywords on one page'). However, if you find an algorithm which will solve this problem, there is always a different (new) way to spam the system. And people are very clever to find such ways. I think Google's PR algorithm is a good example. It was good as it was introduced, because people linked to other pages because of quality. Now people knew the algorithm and they are manipulating the system.

(However, this doesn't mean that you cannot improve the current system.)

GoogleGuy

4:21 pm on Jun 14, 2003 (gmt 0)

So you're proposing a site penalty ~= k * (number of queries hitting the site / number of pages on the site).

Let's see. There's a couple real-world problems. I don't think that adding this in scoring would impact how users search--this would be far enough removed that they'd have problems realizing when their search terms were too general. There's also the problem of defining the scope of a site (is a site mit.edu, lcs.mit.edu, or lcs.mit.edu/~somestudent, etc.). I can imagine people that share the same ISP/host shouting at each other: "Your pages are too popular! All those hits on just 1-2 pages! You're killing my Re5earcherRank!" :)

Leave that aside though. To minimize the penalty, you'd want to either minimize the number of queries that hit the site, or maximize the number of pages that are indexed. I do think that would lead toward a trend of spreading out content to a series of really focused pages. In the limit, it could take you to more doorway pages. Maybe not necessarily, I'm just sketching out the extremes. It would definitely discourage single pages with lots of content.

Hmm. Probably the biggest benefit would be for sites like namebase.org, where each page is an entry for a different person. Re5earcherRank would work so well for namebase.org because each "keyword phrase" (in this case a person's name) is orthogonal to most of the other keywords that someone could use to find other content on the site. But it wouldn't work as well for longer/more complex/less orthogonal queries that need multiple words to match on one page.

I'd also worry a little that people who didn't know anything about Re5earchRank would be penalized by it. The natural inclination of a good site owner (produce lots of good, original content that surfers find useful) would work counter to them if they put the content on a small number of pages, but would work better if they put the same content on a larger number of pages. So Re5earcherRank would be a force pushing site architectures toward more pages per site, but not everyone would be aware of it.

It could also encourage people to promote throwaway domains. As soon as the site penalty started to creep up for one site, you'd discard it and begin promoting a different site. But you don't need me to play toolman, he can do that himself. This is fun stuff to think about though. I should drop by this forum more often; it's fun to play spammer for a change. :)

Kackle

5:16 pm on Jun 14, 2003 (gmt 0)

Very interesting discussion. The "number of queries hitting the site" should read "number of different queries that produce a hit for the site." I think that's what you meant anyway.

I suspect you'd need some link pop in the algo, but not nearly as much as PageRank. A very crude, minimalist link pop. None of this "extra juice on top of the social power law" stuff.

I don't see where something like lcs.mit.edu/~somestudent would necessarily stop this idea in its tracks. The TLD portion of the URL is easy to spot, since there are only a limited number of official TLDs, the tilde after that is easy to spot, etc. A set of published (not secret!) rules as to what constitutes a distinct site might be needed.

There'd be a ceiling on SiteRank, above which you cannot climb. Once you hit the ceiling, all your spammy domains that you put up to beat SiteRank would appear in random order with all the others that hit the ceiling for those keywords. Eternal attention to better spam would reach a point of diminishing returns. You can try to saturate the top ten SERPs by virtue of overwhelming the domain space with your spam, but so can everyone else for your keywords. A fool's game. They'll all go down together, taking their Viagra with them.

But it wouldn't work as well for longer/more complex/less orthogonal queries that need multiple words to match on one page.

Good point. Some special scoring for multi-word searches might be needed. My first inclination is to disregard them all for purposes of penalty scoring unless they are all proper nouns (i.e., not found in a "plain" dictionary). There are enough folks who do single-word searches that you'd get a decent sample from the unwashed masses, based on single-word searches alone.

SlowMove

7:22 pm on Jun 14, 2003 (gmt 0)

I'd also worry a little that people who didn't know anything about Re5earchRank would be penalized by it. The natural inclination of a good site owner (produce lots of good, original content that surfers find useful) would work counter to them if they put the content on a small number of pages, but would work better if they put the same content on a larger number of pages. So Re5earcherRank would be a force pushing site architectures toward more pages per site, but not everyone would be aware of it.

People that don't know much about the current system could also be penalized without knowing what's happening. There are a lot of sites out there that have some good content, but also have a links page where they link to "bad neighborhoods". I don't think that there can ever be a perfect system.

GoogleGuy

6:22 am on Jun 15, 2003 (gmt 0)

Yup, I'm not saying that identifying sites is a killer, but I deliberately gave a simple example. For randomdomain.com/path1/path2/file.html, it could be tricky to draw the line on where a distinct site starts or stops.

re5earcher

10:43 am on Jun 15, 2003 (gmt 0)

Many thanks GoogleGuy for spending time and pointing out real-world examples, and everyone for comments.

So here is the another new formula derived from those examples, as the previous formula proved to be wrong. What do you think.

The ideal spam-fighting ranking would:

- give top rank for pages which gives unique hits for many users with different IPs
- give lower rank where many users hitting the page with popular keywords.
- rank cannot be affected by one user running the queries intentionally
- rank is not affected by number of hits on the page

Its real-world analogue calculated daily/weekly/monthly would look something like this:

One page rank = sum for queries with distinct IP network[x.x.x.0] of
(K ^ total number of distinct IP networks this query came from / total number of pages hit by the query / K)

where K = 2

So same query hitting the page from different IP networks reflects the focusing of the page and will give the ranking the exponential growth - this can be simulated by someone to raise the ranking but these abilities are limited. The K constant needs to be tuned to weight this exponential component.

The weight of the query also multiplied by inverse popularity - to dicourage use of popular keywords on a page.

Let's see how this cope with extreme scenarios:

a. Suppose someone wants to push his own site up in rating - he would want to run loads of queries unique for his page:

1 user
1 page hit (different queries with unique word or combination of words on page)
ran any number of times a day

The page will stay in rank 1, which can be set as a bottom rank, so there will be no point in doing that.

b. the large size page, 100,000 users daily find something popular on it, getting 100,000 total hits for each query.

10,000 users
all run different queries
100,000 average hits for each query

The page would have the rating of 0.1

c. 10 users daily find something unique on the page

10 users
same query
1 total hits

The daily rank would be 512

d. It's not possible for someone to lower someone else's rating.

e. If no one hits the page rank is 0.

[edited by: re5earcher at 1:17 pm (utc) on June 15, 2003]

bird

11:58 am on Jun 15, 2003 (gmt 0)

Just to turn this into an example that my brain can handle:
There's a page A on one domain that ranks well for one specific keyphrase.
There's a page B on another domain that does the same, but it *also* ranks well for another keyphrase, which is probably somewhat related to the first one.

Your theory appears to be that page A (narrowly focused) is more valuable to any searcher, so that page B (with a slightly broader focus) should automatically get penalized in comparison.

I'm not sure if I can follow your logic there. What would the benefit of such a mechanism be?

abcdef

3:08 pm on Jun 15, 2003 (gmt 0)

I like the concept contemplated in the post, if not the method proposed by the poster.

The poster implies that there is room for creativity and improvement in the way Google determines rank. Indeed that is what current changes in Google have been all about. And, ranking technology is evolutionary and more competition should spur more innovation in this area.

For instances: As time goes by the Google Toolbar will become more and more prominant in ranking of websites. Alot depends on the population of users that have the Toolbar installed, and use it's advanced features that authorize Google to collect information on their surfing activity.

In conjunction their might indeed come a day Google asks webmasters if they would do the like, and load invisible code on their homepage that would allow Google to measure web site popularity more accurately.

Inbound links is a crude way of measuring Web Site popularity. It mimics actual web site popularity, and is subject to manipulation as Google knows all too well now. However given that this is an evolutionary process that has a ways to go, you can't fault them for doing the best they can in the meantime while their rocket scientests in the back room work on the future...

vitaplease

11:04 am on Jun 16, 2003 (gmt 0)

re5earcher,

nice start to becoming a member here!

- give lower rank where many users hitting the page with popular keywords.

Lets say your latest posting is brilliant.
Webmasters of every type link to your post using many different keywords describing all your gems.
You are mentioning a lot of popular keywords in your latest posting.

Does your posting deserve a lower ranking?

The second part of your latest posting is infringing with this Google patent. ;) So it must have good gems in it.

"Methods and apparatus for employing usage statistics in document retrieval" [appft1.uspto.gov]

[0036] The frequency of visit score equals log2(1+log(VF)/log(MAXVF). VF is the number of times that the document was visited (or accessed) in one month, and MAXVF is set to 2000. A small value is used when VF is unknown. If the unique user is less than 10, it equals 0.5*UU/10; otherwise, it equals 0.5*(1+UU/MAXUU). UU is the number of unique hosts/IPs that access the document in one month, and MAXUU is set to 400. A small value is used when UU is unknown. The path length score equals log(K-PL)/log(K). PL is the number of `/` characters in the document's path, and K is set to 20.

re5earcher

2:26 pm on Jun 16, 2003 (gmt 0)

vitaplease,
in fact the only popular keyword is "page rank" which only gives 47,000 hits on google so it still can raise the rank. But it will get highest points if someone ever searches for "inverse popularity" though because it's almost unique.

It's not infringing the Google patent in any way - not the _number of times page was visited_ is counted but the _number of times page has been counted in query results_ - hit by the query.

The obvious fact you know and suffer from while searching - what happens if you measure the popularity by number of visits, then pushing the popular site on top SE results - the natural ballance is broken - most searchers don't get to a third page, so the rest sites will form a sediment on the SERP's floor, and searchers will hopelessly click >next endlessly. Something _automatic_ needs to be brought forward to maintain the natural ballance and fight spam - kind of _open to public_ algorithm which cannot be misused, in addition to currently used ones.

vitaplease

3:43 pm on Jun 16, 2003 (gmt 0)

re5earcher,

An interesting discussion: "The PR algorithm you cannot cheat"

I think the basic Google algorithm: Weighted and more or less motivated votes that are balanced with on-page content (and Inktomi and FAST are more and more using something similar IMO) - is great but needs other refinement than:

- give lower rank where many users hitting the page with popular keywords

A webpage could be a five page pdf document with excellent content on several sub-subjects. If others feel that page deserves many different motivated links (e.g. anchortext) using popular keywords - so be it - thats how they motivated it.
An algo should ideally not interfer with the natural way people publish content.

>>>most searchers don't get to a third page, so the rest sites will form a sediment on the SERP's floor

Search engines are discriminatory by definition. What about the fourth and the tenth page?

But one part of the solution is the searcher itself. They should take responsibility and use more keywords/advanced/localised features.

Another part of the solution could be the Search Engine offering "search query expansions", such as Vivisimo/Altavista Prisma/Teoma does.

Search for "web hosting": and three-word (less popular) query examples are suggested.

In a way search query expansion is educational for the basic searcher.

I remember Matt Cutts from Google saying at the Pubcon Boston that Google trialed this but at no avail. It was not used enough. I still think they should offer it as an opt-in and offer Search Trainings to primary/secondary schools.

In general I doubt less than 1% of search results would be considered spammy by the general public with Google at the moment. Maybe - adversly - we need more spam for the searcher to refine its searches more ;)

An algo you cannot cheat is impossible IMO.
Any form of ranking is dependant on social behaviour (voting/linking/quoting/editing/visiting)
With the economic impact Google has, social behaviour will be bought.

coolasafanman

2:57 pm on Jun 23, 2003 (gmt 0)

Every algorithm has a set of rules that defines it. SEOs reverse-engineer algorithms to find 'cheats.' Therefore, an SEO'd site would rank higher than a non-optimized one.

With this in mind, it seems that those with a lot of money or resources would rise to the top for any search term that has e-commerce potential.

To solve this problem, it seems logical to me to make the rules known universally. Regardless of the algorithm, if everyone knows how to abide by its set of rules, then no one has an unfair advantage. Granted, anyone that is serious about their online business can spend a few hundred hours surfing the web, hanging out in forums and learning all this SEO stuff.

Making <H1> headers and the like weight more has a very logical purpose - it's a system for organizing a page. Making that more widely known to the public would shift the advantage out of an SEO's hand and allow people to let the world know what their sites are about.

I'd like to see <META> brought back into the mix. Cut it off at 5 phrases per page, and weight page title higher as well.

The question then becomes, what do we do with all these sites that now have identical themes? Should a company that sells 10000 widgets rank higher than one that sells 100? Should the company that sells 100 blue widgets rank higher for blue widgets than the one selling 10,000 blue, red and green?

How about some randomization in the results?