Forum Moderators: Robert Charlton & goodroi
"TrustRank" was filed with the USPTO about a month ago. Interestingly, members of the Stanford Database Group have written a paper about the use of "TrustRank" to combat web spam that we blogged about in early March. Makes you wonder if the implementation of TrustRank™ will be something coming soon from the GooglePlex. Stay tuned.
[blog.searchenginewatch.com...]
[webmasterworld.com...]
I believe that the significant thing is that Trustrank recognises that manual intervention is necessary. In their case this is to determine the seed sites, but it has been apparent for the last three or four years that manual intervention will eventually be required to clean up the results. It's now a question of when not if.
One thing I am not sure about is ...
While human experts can identify spam, it is too expensive to manually evaluate a large number of pages.
I have always believed that a small team from the each of the main SEs working full time on this could make a huge impact. If an instant lifetime ban was imposed on sites that were in blatant contravention of their guidelines the spammers would soon realise that this was too risky to be a good business model.
If an instant lifetime ban was imposed on sites that were in blatant contravention of their guidelines the spammers would soon realise that this was too risky to be a good business model
Unfortunately there's no risk in getting banned because they don't lose anything. What is needed is a way of stopping them making money in the first place - that would make the business model more risky. The sandbox effect has gone some way to do this, but it's not like spammers are gonna just turn round and say "gosh darnit" and walk away! ;) They learn and adapt too! :)
I notice one of the authors of the paper is a Yahoo employee, and doesn't this mesh nicely with Yahoo's patent document referring to "concept networks"?
Scott
Unfortunately there's no risk in getting banned because they don't lose anything.
I think you may be missing my point. What the spammers "lose" is not the issue here. Their aim in life is to appear at or near the top of the SERPs. This makes them VERY easy to find and deal with (for humans.)
The more clever they are the easier they are to find.
The Google job openings could also have been related to AdWords/AdSense....
Google was advertising for search and AdSense quality evaluators a while back on Google.com. (Separate jobs, separate postings.)
If they want to combat spam, I hope they will not repeat an old error and integrate a TR bar in future generation toolbars.
domains are cheap-hosting is cheap. I expect spam & spammers to disappear about the same time viruses & virus writers disappear
That's why new sites may rank worse than old sites. That's the reason for sandbox. They prefere to lose some new sites with good content than have more throw-away domains in the results.
I expect, that Trust Rank will also depend on the age of a site and new sites will have to wait several months before getting it.
It's not nice, I have some sites I launched recently and they have to wait before their time comes. But I launched them some months earlier to give them more time, and the will be finally designed when they will be about six months old, before that, I don't need them to rank well. That's the price of fighting throw-away domains and other forms of spam.
Is there an alternative? Perhaps to make a law enforcing registration of all domains owners and allowing to sue them for spamming? But do we want such a law?
In days of Yore when AltaVista was the ne plus ultra and Google was but an excuse for a couple of postgrads to drink coffee rather than finish an overdue PhD write-up, I used to expect to have to wait at least 6 months to reliably see any new page I put up appear in a search engine and used to plan ahead accordingly.
Thus it seems as if a sandbox scheme simply rewards the patient and cools the heels of the heels and the feckless.
"Time wounds all heels" - Jane Sherwood Ace (1905-1974)
If so, good.
Rgds
Damon
In days of Yore when AltaVista was the ne plus ultra and Google was but an excuse for a couple of postgrads to drink coffee rather than finish an overdue PhD write-up, I used to expect to have to wait at least 6 months
Exactly! I was doing my first SEO work in those times, and it was always natural for me that major SE doesn't rank anything too fast. I remember when Altavista Basic Submit was delayed - the crawl happened a few months after doing a submit - to make people paying for Express Inclusion, and I usually submitted sites in advance to make-up this delay.
But I didn't expect sites to rank in top of major SE in first months - it was possible only in minor, local SEs (which haven't changed since those times and I always have #1 in them, because their algo's archaic - but almost noone uses them ;).
Today, you can add site to Google almost instantly. I put a link on one of my high PR pages, and get new site indexed in days. But still, it seems logical to me that if the site is a few months old, it will not be likely to be a serious source of information - what percentage of sites survives after a year? Google gives a chance to these sites, because it indexes them after all. But if such a site builds great score of inbounds in 3 months - it's logical for G to apply a sandbox.
I wonder if a kind of TrustRank is already in use - if a new site acquires many inbounds, but including one link from DMOZ, and several links from top authority sites in certain subject, will it be sandboxed equally to a site that acquired many inbounds from off-topic non-authority sites?
I wouldn't be surprised if Google already assigned a kind of TrustRank to DMOZ - and I hope they won't assign to much of it to A... let's say, to certain established commercial site ;))
I'd be interested to know if this could be used as prior art to bust the patent.
Advogato's ranking is how highly esteemed on is considered to be as a free software programmer, but it is meant to serve as a testbed for algorithms that someday might be used to fight spam.
Standard site navigation exists on every page I have ever seen, basically. Every page links to SOMETHING. A link back home, a link to the other sections or whatever. (Fine -- I haven't exhaustively studied this, but my point is clear, I hope?)
However, just before discussing the unreferenced and non-referencing pages, the paper says "we also remove self hyperlinks." Does that mean links to other pages within the same domain? They state clearly that they are not analyzing sites, but individual pages. So the assumption I had made was that "self hyperlinks" are the internal anchor tags that move you around a single page.
So... Does this paper assume the ability to detect and discount 'standard' site navigation? Is it by domain, or is there some other process at work (like MSN's discussion of page zones)?
Anybody else thinking about this?
I guess the TrustRank, as it spreads out across sites touching the "seed" sites will have the effect of assigning partial 'spam' values to pages.
Still, in the initial seed set, I think a black-or-white call is crude. This is a subjective thing, unless we have clear enough rules that really, we ought to be able to program a computer to make the determination iteslf anyway. Barring that level of specificity in the determinations the human "experts" make, a binary decision is not as accurate as would be a finer instrument. (A 1-10 or 1-100 or something.)
Even still, we should be asking, "What are the experts looking for?" Are they reading the content? Are they analyzing subdomains, or looking at the use of alt tags? Surely they're not just tuning into astral vibrations and making these decisions.
If news organization A posts a story about fraudulent, spammy news organization B, including a link, site B gets a boost in their TrustRank.
As more people start talking about how bad site B is, site B keeps getting more trustworthy.
This has always been a potential problem with PageRank -- a link is not necessarily a vote FOR the target site -- but consolidating influence among fewer sites increases the potential for mistakes like this.
Stupid suggestion to alleviate this: apply the "nofollow" attribute in cases like this as well, and teach everyone to use it when they don't want to help the sites they're linking to.
This sounds good on paper, but execution is another thing. There are so many rules that need to be created that the original idea becomes blurred. It is very easy to game once the rules are known.
I've gotten blasted for this before, but probably the best way to quality results in a search engine is creating a directory. If I were Google, this is what I would do...
Start with the Dmoz directory since there has been a degree of quality assurance already applied to that directory. This is your basline of TrustRank. Just by being in the directory you get points let's say TrustRank 3. Certain categories qualify for up to a TrustRank 10 rating - if they gain points via a qualitative test described below. These are the "standard" websites such as Yahoo, MSN, WSJ and the like... They do not have to pay for this evaluation.
Websites in lessor categories, can for a fee, have their sites manually evaluated for Publically Known quality standards. This allows webmasters to conform to the quality rules if they choose (no black box). This includes things like:
- Content
- Site organization
- Size
- Load times..
Whatever is attractive to people when searching. Each quarter a nominal fee is paid (one that just covers the cost of Google to peform the evaluation) in order to qualify for additional points - up to TrustRank 10.
My suggested fee is $25 / quarter for the evaulation for added TrustRank points. The evaluation is performed quarterly on an unknown date so that gaming is less possible.
As is the case with PageRank today, this measure of TrustRank is just one consideration when returning results.
My 2 cents
Each quarter a nominal fee is paid (one that just covers the cost of Google to peform the evaluation) in order to qualify for additional points - up to TrustRank 10.
PFI may be practical for commercial sites, but what about the vast numbers of .edu, .org, .gov, open-source, and labor-of-love sites that wouldn't shell out for (and probably wouldn't be aware of) a fee-based QC program?