| This 68 message thread spans 3 pages: 68 (  2 3 ) > > || |
|Signals of Quality - What are they?|
Here's my list
Many of us - perhaps most of us - have been hit by the recent spate of updates from Google. I have a site of my own that was completely de-indexed and thankfully was re-included after only a week. During that week I did some real soul-searching.
My 3 employees and I are totally dependent on income from my web sites. Life has been pretty good as a web site owner, and I want to build on what we've already established. We want to be in this business long-term.
During this week of exile I began to look at my web sites differently. If my banned site was given that golden 30 second review by a real human at the 'plex, what would they see? Would they shrug their shoulders and say the Internet would be a better off without my web site? Would they think that my web site looked exactly like 100 others that they had reviewed previously that day? Or would they see an active, vibrant web site that at least appeared to have a real purpose to serve in cyber space?
Then I started looking at other sites differently. Would CNN.com ever get banned accidentally? How about Adobe.com or WellsFargo.com? Why wouldn't they? They are web sites just like my web sites, except they have a LOT more visitors, and some sort of a brick-and-mortar presence, but the G-bot doesn't know that. I want my sites to be as stable in the SERP's as those sites!
As I started looking for the earmarks of these obviously high-quality sites, I started to notice that the "Signals of Quality" that have been eluded to in many of the discussions here at WW kept coming up. How long are their domain names registered for? How many outbound links do they have? How often are their pages updated? How fast do the pages load? Do the pages validate? These are the kind of things that separate Adobe.com from my sites. As unfair as we may think these signals of quality are, I don't think there are any webmasters at CNN worrying about if their web site is going to be banned overnight. I don't want to have to worry about it either.
I immediately started to compile a list of the "signals of quality" that we've read about in the Google patents, the ones that we've heard about through our discussions with the Google engineers and the ones that we know as a matter of common sense. I would like to share my list in the hopes that others will share theirs too. I realize that some of these signals are controversial, but I don't want to get off on any tangents. Some of these can't be proven, but through anecdotal evidence at least it seems that they might have some bearing. Please share your lists.
Domain name registered for more than 1 year, preferably 10.
Fast loading pages.
Dedicated IP address.
Hosted by a "trusted host".
Low link "churn" (links not changing too fast on a page).
Correctly formatted and validated web pages.
Web site regularly growing in size (not by large spurts).
Backlinks regularly growing in size (not by large spurts).
Real activity visible from the home page.
Home page not overtaken by advertising, AdSense or otherwise, particularly above the fold.
Session ID's in URL not required for viewing web site.
Valid use of Robots.txt file.
Low or no duplicate content.
Low number of affiliate links.
No site-wide external linking.
CNN.com, Adobe.com, and WellsFargo.com would show these signals of quality. What else should be on this list?
Biggest problem I have with your list is it unfairly gives commercial sites an advantage over non-commercial sites. Such as domain name registered for 10 years, static IP, "trusted host", etc. And why would a website about the Peloponessian war need real activity visible from the home page?
Rethink your list to weed out criteria that would cause excessive collateral damage.
Excellent list, thanks.
What is 'site-wide external linking' and why is it bad?
I don't think that activity visible from the homepage is a quality sign. The quality of sites is determined by the quality of the inner pages, not the homepage.
Therefore, for me a high percentage of links from other sites to internal content pages would be a quality sign. Such a link is a direct vote for a specific content page rather than links to the homepage or a links page, which are often acquired via link exchange requests.
rfgdxm1, Google can easily tell if a site is commercial or not, so they can apply the list appropriately. The 10 year registration and the trusted host criteria aren't just a guess, they are part of actual Google patents.
|What is 'site-wide external linking' and why is it bad? |
Links to site A from every page of site B. These "run of site" links are typically a form of advertising (some would say: of trying to manipulate the SE), and not a "natural" way of linking.
As for the 10-year registration, I have some doubts... Did anyone register domains for such a long period before the information in that Google patent became available?
I'd say the "normal" way of doing things would be to register a domain for one or two years, which is what most registrars offer as the default.
Here are a few elements to think about for commercial sites that almost all reputable sites have as well.
1. Easily accessible phone number.
2. An address to locate the entity.
4. A copywrite date.
5. Customer support.
6. A trademarked logo.
7. Feedback channel.
8. Detailed Help Section.
9. Incs. dominate LLCs and Sole Proprietorships
10. Complex ad delivery systems that have sold ads, remnant space, and contextual ads
11. Partnerships with other major sites/brands (ie CNN and yahoo).
12. Posted Terms and Conditions
The other side note about major sites is that they seem to be constantly trying to improve. Take MSN for example, their homepage has undergone major transformations in the last 12 months. It's much faster, cleaner, easier to use and more functionality. The big guys have resources to figure this stuff out and continualy churn until they get it right. Several big brands have made horrible v1 sites, but have plugged along and continued to invest to the point where they are becoming successful.
The small webmaster has a big challenge, in that their pockets aren't necessarily that deep, so they live hand to mouth so to speak, which prevents them from or hinders them from actually doing all the hard work to make a site great. Most small sites do a quick implementation, cut costs and features until it is inexpensive enough to build and simple enough to maintain and start slapping up content. Some add new products, rss feeds, new info, trade links, and produce original content but very few create killer features or churn out several major releases per year. This is a major disadvantage.
As a small webmaster, I get nervous about all the community sites that cover niches where most of us make our bread and butter. With all the large sites investing in community type websites and getting better at SEO architecture, I think they propose a strong risk to the small webmaster...sort of like blogs, but worse since many are feature rich, and content is easily publishable.....Have to cut this short to go to Costco....Here is to getting big and rich, so when you have the resources, you can do it the right way:)
|...not a "natural" way of linking. |
Site-wide external linking is done by many free CMS software including but not limited to Mambo and Wordpress. IMHO those sites that leave these site-wide links there give more sign of quality than those that try to hide that they are running on a free software package.
Without the site-wide links back to the software producer a bot can still fairly easy determine which software was used to construct a site by comments in the header, names of classes etc. Deleting this type of site wide links could be seen as trying to prevent PR leakage, a sort of SE manipulation.
|weed out criteria that would cause excessive collateral damage |
I think most of these issues are things that could add a PLUS to the site in an algorithm, not issues where their absence would cause a penalty. Most of them really are signs of quality - something that a serious business would do that a "disposable domain" often wouldn't.
Along these lines, having an MX record associated with a domain is a big deal, in my opinion. A quality domain usually has an associated email address.
I like the MX record idea. And G can easily do what some SPAM-filtering software does (including mine; my machine gets up to 40,000 SPAMs per day so I rolled my own filter in front of my server) which is to check that the MX data is real, ie corresponds to real machines listening on port 25 with an apparently compliant SMTP engine, etc, etc.
>>I think most of these issues are things that could add a PLUS to the site in an algorithm, not issues where their absence would cause a penalty. Most of them really are signs of quality - something that a serious business would do that a "disposable domain" often wouldn't.<<
Tedster..I agree there...and things like 10 year payment up front shouldn't be considered. We pay once a year along with our yearly hosting fees.
>>Along these lines, having an MX record associated with a domain is a big deal, in my opinion. A quality domain usually has an associated email address. <<
We have an email addy but hide it in a jpeg. Cut way down on spam but can't be read by bots...could be causing me harm?
>rfgdxm1, Google can easily tell if a site is commercial or not, so they can apply the list appropriately. The 10 year registration and the trusted host criteria aren't just a guess, they are part of actual Google patents.
They can reliably? I'm not so sure this is so when it comes to amateur sites with clink through ads. (Unless you want to argue any site with ads is purely commercial.)
|We have an email addy but hide it in a jpeg. Cut way down on spam but can't be read by bots...could be causing me harm? |
Not on this issue, no. What's on the web pages doesn't matter, and I'm not talking about verifying any particular email address - the domain server can be pinged for the MX records in any case, just to learn if email service is configured.
I know if I were looking for "signs of quality" this would be one small item on my checklist.
The signs-of-quality mindset is a good one to cultivate, as opposed to "what are the penalites and how can I avoid them." Looking for signs of quality (as well as relevance) is where a search engine starts, IMO, and the penalties and what not are just one means to that end.
>I know if I were looking for "signs of quality" this would be one small item on my checklist.
I'd agree. Even amateur home page type sites commonly have a working e-mail.
|Looking for signs of quality (as well as relevance) is where a search engine starts |
This is a very good thread; the concept of "signs of quality" is such a powerful way to take an objective look at your web site; correct whois information? Is that a sign of quality? Sure would be for me. Linking out to good sites as opposed to, “If I get a visitor to my site, I don’t want to provide any means for him to leave” makes real sense here. Proper use of text links? Oh yeah, there’s a big difference between informative text links that take you to a better page on the site as opposed to 200 text links stuffed into the bottom of your home page.
The really interesting thing about these facets is they are very critical to Google, but we haven’t seen much evidence that “signals of quality” are being heard by other search engines.
|No site-wide external linking. |
Darn, there goes WW thanks to the BestBBS link ;)
Hardly any of those things in the first post are signals of quality, with the exception of trusted host and duplicate content. Some are red flags for non-quality, while others like robots.txt seem neutral but if anything crap peddlers are more likely to have a valid robots.txt file than quality sites.
Most signals of quality of course come from off-domain. Vouching for yourself seldom is a compelling argument about quality.
>Some are red flags for non-quality, while others like robots.txt seem neutral but if anything crap peddlers are more likely to have a valid robots.txt file than quality sites.
That robots.txt on the list is totally silly. Why have a robots.txt file if yu don't want to keep SE bots out of any part of your site?
I guess I should give some examples I'd view as signals of quality:
links from authoritative niche resources
link from authoritative generic resource
links to inner pages from authoritative niche resources
website sections/linking that shows LSI-content... a domain about apples has a red delicious and pippin sections, ideally with linking from other red delicious and pippin pages that are well-regarded
linking to authoritative, on-topic resources
sound structure where multiple pages are emphasized (that is, this is not a only-the-main-page-counts website where all the rest are basically irrelevant)
subpages achieving ranking for multiple LSI-type searches (rather than the main page being the only page to rank)
titles and H1/H2 relating to link text
part of a parent domain (not a subdomain)
lack of third party generic advertising
domain has been around awhile, on the same topic, with significant link and text churn in that time
[edited by: steveb at 1:39 am (utc) on Aug. 17, 2005]
As an SE, one signal I'd be looking for is a LACK of "I'll link to yours if you link to mine" pages. :)
|Hardly any of those things in the first post are signals of quality, with the exception of trusted host and duplicate content. |
|That robots.txt on the list is totally silly. |
When reprentatives of Google have mentioned "signals of quality" (GG and one of the Google Engineers in New Orleans) they didn't indicate that this had anything to do with ranking. Traditionally SEO's have only thought about penalties and how to get higher in the SERP's. I believe "Signals of Quality" is broader than that, and if you're in it for the long run, it needs to be paid attention to.
The robots.txt thing seems silly to me too, but Google lists it as something that every web site should have, so I'll play along.
Again, back to CNN, Adobe, and Wells Fargo, if we're going to be in business as long as they are planning to be, we need to have sites that mimic the same qualities as theirs. I'm sure they haven't thought twice about registering their domain name for 10 years and I doubt that they are worried about backlinks.
To add to the list, Google's most recent patent refers to people getting to a web site through bookmarks. How do you think they determine that?
Just because Google reccomends a robots.txt doesn't make it a signal of quality. Signals of quality are signs that a page or site are superior in nature. This means things that are solid webmastering, things a junk peddler with ten thousand near copies of the same thing would do, do not necessarily signal quality content.
Another way to put it is that signals of quality are not plainly easy to have, or spoof. Anybody can upload a generic robots.txt file anytime they upload a site. It means nothing. No difficulty is involved. A link from an authoritative resource has some difficulty in getting, and even has some difficulty in spoofing.
On the subject of robots.txt, I know I read somewhere sometime back that a google employee was asked why googlebot kept spidering certain of his pages even tho his robots. txt forbade this. And the google employee said that if googlebot comes to your site via a link from another site googlebot wouldn't necessarily ask for the robots.txt-so why would robots.txt be a concern of google concerning quality?
On another subject-didn't google guy say that you should have at least two outgoing links on each page? Or was that Bret?..lol
steveb, that's an excellent list. And very concise. And it works, by the way. If anyone has posted a meaningful update to the 26 steps by Brett, this would be it.
That list is almost item for item what I strive for on quality sites. Unfortunately, I've found it distressingly difficult to get clients to grasp just why they should implement the items on that list. It takes a lot of work, year in and year out, which is what makes that a good marker for quality, you can't achieve it without effort.
|titles and H1/H2 relating to link text |
Less spoken about, but very potent virtue in post PR days.
|Low number of affiliate links |
Not all by itself. It won't matter if there are loads of unique, useful content IMHO.
Few points from my side here -
- Low percentage of keyword anchor text in IBLs.
- Low percentage of IBLs from non-links pages.
- Links from pages that are part of Google News sites.
>>>titles and H1/H2 relating to link text
- Does anyone but an SEO do that?
- Isn't that on the old checklists for ranking well on Search Engines?
- And if mostly SEOs are doing that, doesn't that make it a fingerprint of someone who is trying to manipulate the search results?
- If it's a fingerprint, wouldn't that make it an anti-signal of quality?
I agree with the rest of your list, though. Good list.
Good spelling and grammar is often to be noticed on the sites you mentioned.
"Does anyone but an SEO do that? "
Yes, perhaps more often than SEOs. "Normal" people make pages about something, and link to the page with text that the page is about. That is what is normal.
In contrast many (non-quality) seo pages will have links like "buy cheap whatever", but of course the pages themselves won't have that groveling type of cheap text as their H1 or title.
It is a signal of non-quality to have most of your anchor text different than your title and H1. It's a mixed message, like you have no clue what your own page is about. (This would be particularly true of links on the saem domain as the page in question... if most of the links say one thing, and the title and H1 are somewhat far afield of that, that is a signal of schizophrenia or lack of confidence.)
This is a minor thing perhaps in terms of signal of quality, but I think its a major thing when looking at it in reverse... the "buy cheap whatever" point is a signal of extreme non-quality.
Ok, you've been given a while to digest my originally suggested list. I would like to give my theory on how Google's "Signals of Quality" come into play. I have never seen this actually discussed anywhere but it is becoming increasingly obvious that there are fewer black-and-white criteria that Google uses to either ban a web site or rocket it to the top of the search results. And this theory explains this.
First, it must be pointed out that there is a distinct difference between quality and relevance when it comes to search results. A search engine must balance the two because what searchers want is primarily relevance. Quality has become more important lately only because of how many low-quality sites are being added to the Net today. Steveb's list is a great list, but it mostly has to do with keyword relevance, not the "Signals of Quality" that have been talked about in recent Google patents.
Back to my theory: One of the most effective SPAM filters ever devised is called a "Bayesian Filter". There are many implementations of Bayesian filtering used for many different applications ranging from open source to highly commercial. It is assumed that Google uses Bayesian filtering to categorize their search results and target AdSense ads. They didn't invent it, but they are very, very good at it.
Here is a simplified explanation of how Bayesian filtering works: You start with a sample of known good documents and a sample of known bad documents and you count the occurance of words in each. If the documents are in HTML, you can also count the occurances of HTML tags. You then divide the word counts of the words in the good document by the word count of the words in the bad document, and the other way around to get two lists. One list is a list of words that are likely to be found in a good document, and the other is a list of words that are likely to be found in a bad document. You will will also have the percentage of likeliness of each.
Next you take a test document and count the words in it, adding up the word count of words that you have from your good and bad word lists. Dividing these counts gives you a likelihood that your test document is either good or bad. This all sounds kind of complicated, but it can be done programatically very quickly.
When a Bayesian filter is used, something happens that almost appears to be magic. Words and phrases that most humans would see no pattern to become obvious. Once example is the HTML tag that makes text red. At one time, the presence of this red tag in email meant a 94% chance that the email was SPAM. Also, things that you would think meant a document was good, may not have much bearing on determining if a document is good. Have you noticed how a lot of email SPAM is stuffed with words that wouldn't normally be associated with SPAM? The spammers are trying to get around Bayesian filtering.
To implement this, Google would need an human surfers to keep their filter calibrated, marking some web pages as definite good or bad. This would explain the role of eval.google.
I've explained all this to say that it appears that what Google is now doing is a sort of Bayesian filtering that takes into account external factors such as domain registration and the reputation of the web host. Assuming this, look again at my original list as well as the additions to the list suggested by others.
A web site with a domain that is registered for only a year certainly does not mean that the web site is low quality, BUT, given the fact that a known sampling high quality web sites (CNN.com, Adobe.com, WellsFargo.com) have their domain names registered for many years in the future, and nearly none of the known bad (scrapper sites) have their domains registered for more than a year, somewhere in the 'plex there is a statistic that says that if a site has a domain name that is registered for 10 years into the future, this web site has a XX% chance of being a quality web site. When this statistic is added together with a the percentages from all the other "signals of quality", they have a good idea of the quality of a web site.
Let me also point out that while the robot.txt issue seems silly to me, with a quick sampling of sites, 100% of the good sites that I checked use a robots.txt file, and 0% of the scrapper sites that I checked use a robots.txt file. Though I would imagine that this would not be a great indicator, I think that somewhere in the 'plex there is also a statistic that says that if a site uses a robots.txt file, this site has a XX% chance of being a quality web site. Again, when added to the other statistics, Google has a good idea of the quality of a web site.
What this means is that as professional webmasters we need to take into account pretty much everything that gives the end-user a good experience, plus a few other technical details.
How the signals of quality are used by Google is fodder for another thread since they don't seem to directly effect search results rankings, but I have seen many sites banned in the past 3 weeks that don't show many of this signals of quality.
I think that the days are over where we can make blanket statements like "a good web site WILL ALWAYS do XYZ..." or "a good webmaster would NEVER PDQ..." For example, it used to be that the use of doorway pages would get you banned by Google. Now Google admits that there are legitimate uses for doorway pages. Somewhere there's a statistic...
It's all in percentages now, and I want my sites to have percentages that places them firmly in the realm of a "quality" web site.
| This 68 message thread spans 3 pages: 68 (  2 3 ) > > |