Page or site related factors

Forum Moderators: open

Message Too Old, No Replies

Page or site related factors

Everyone says that Google only looks at pages. Why?

BigDave

6:27 am on Jan 12, 2003 (gmt 0)

I regularly see comments along the lines of "Google only looks at pages, not at sites"

Now PageRank is calulated *by the page*, but I can think of no reason why google would not be able to include sitewide factors when deciding the rank of a page.

Has anyone ever heard this from google? Has it been mentioned at a conference or in an article? Or is this just an opinion that started because of how PR works, and it has been expanded to cover all aspects of the algorithm?

By site, I am not necessarily referring to a domain. There are ways to help define sites within domains.

jdMorgan

6:57 am on Jan 12, 2003 (gmt 0)

BigDave,

As a recent poster of such a comment, I'll say it's at least a somewhat-informed opinion, but I don't recall seeing it in print anywhere. And yes, my opinion is based on the paper about how Google calculates PageRank - and more to the point, how it crawls links and determines PR by counting "votes" for pages.

Also, the "only looks at pages" bit is an intentional simplification, meant to explain at a certain level; We get a lot of posts here asking questions like, "What PR can I get for my site?" and so the response is meant to point out that a site does not get PR, pages get PR. This information is offered in the same way that we explain to students that objects are solid, liquid, or gaseous, and leave until later the fact that even "solid" matter is composed of mostly-empty space, unless the matter in question is a neutron star or other exotic matter.

As to whether Google considers site-wide factors, I don't know. They state that their algorithm has many parts, and it is left to us to discern the inner workings of their "black box" using insufficient-quality data, sampled only sporadically - Not much more than guesses, in many cases...

It's an interesting question, though, whether anyone has seen evidence of site-wide factors with the exception of actual site or domain bans.

Jim

BigDave

7:56 am on Jan 12, 2003 (gmt 0)

It was your post that reminded me that I was interested in this question, but I have seen it mentioned in relation to several other issues, such as outbound links and themeing issues. I have even seen it used as an argument that google *can't* do something because they only look at pages, not sites.

I can actually think of several sorts of groupings of pages that could be useful for google to recognize, and they have a building full of very smart people that think about these things all day.

I do see the value of stating that PageRank is by the page and not by the site. But when the students become more advanced, you must start to look beyond the 3 basic phases of matter, ir in this case, the limitations put on one part of the algorithm do not necessarily apply to the other parts.

vitaplease

8:24 am on Jan 12, 2003 (gmt 0)

I would say there are quite a few threads around on Google and site related matters, but you are right the focus of the algo seems page related. Mostly because that is what one can most easily extract from the publications and from ranking "proof".

Some site related matters Google does take into account:

1. In general, the higher the Pagerank of the "site"(the highest Pageranking page) the deeper its content (internal pages gets indexed) and the better it gets crawled.

2. Same goes for the voting weight of internal linktexts..(depending on the linkflow throughout the site).

3. Having a few high Pagerank pages within the site, allows one to Fresh-link to internal pages more easily, thereby allowing site internal pages to be Freshly crawled/indexed.

4. Wrong site-outbound links (to too many PR0 linkfarms/or too heavily cross-linked) could get the whole site a penalty.

5. And what I always find an interesting concept: If my Pagerank 8 page gives a link to another domain's Pagerank 4 page, I am not only voting for that page, as described in 1&2 I'm voting, in a very strong way, for the whole site, in fact I am voting for the sites that other domain is voting for etc..

I am sure there are other points more related to sites than pages..

martinibuster

8:37 am on Jan 12, 2003 (gmt 0)

Sitewide relevance... This means what the overall theme of a website is about.

My personal opinion is that the topical relevance of the sites linking to you have an effect on determining what you are relevant for. A post warning against guestbook spamming raised this question not too long ago.

If you haven't read the post, to recap it for you: This person had conducted a wide ranging Guestbook Link Campaign.

The webmaster Googled their own web site, and out of curiosity clicked the "find similar sites" button. The webmaster was horrified to find that Google equated "similar" to guestbooks, as all the results were guestbooks. Which led the webmaster to extrapolate that a certain amount of theming, based on the topic of the site linking to you, was going on.

Just a thought. I'm not saying that this is what is going on. I am only geeking out on the algorithm.

troels nybo nielsen

8:42 am on Jan 12, 2003 (gmt 0)

Dave, one interesting question here might be: what _is_ actually a site?

On two of my domains there are entities that most non robot visitors easily will recognise as separate sites:

myfirstdomain.dk/siteone/
myfirstdomain.dk/sitetwo/
mythirddomain.dk/sitefour/
mythirddomain.dk/sitefive/

On the other hand myseconddomain.dk is really _one_ site and actually has a couple of pages on domains, that are not mine.

Could one expect Google's algo to recognise these entities as sites? Perhaps. But there are experts who claim that it can't.

BigDave

9:49 am on Jan 12, 2003 (gmt 0)

Will google be able to recognise every actual different type of site? Probably not. But i bet they could get most of them.

But there are some known types that it can recognise.

mysite.com/~ricky
mysite.com/~lucy

Would be quite easy to split up. That is the whole point behind ~name.

mysite.com/ricky/
mysite.com/lucy/

would be tougher. But there would likely be identifyable linking patterns within the site.

In fact, it just occured to me that they could have an algorithmic site or group definition that would not necessarily match what the human designers would condsider a seperate site.

If you had a main site that covers widgets and songles, then you had totally different navigation trees for each, and the only connection was the home page, from an algorithmic point of view, it might make sense for it to be considered three seperate sites. Widgets, dongles, and the home page.

Remember, that if someone says it can't be done, it really only means that they do not know how to do it. Who would have thought google news or froogle could be done.

Marcia

9:55 am on Jan 12, 2003 (gmt 0)

Could one expect Google's algo to recognise these entities as sites?

troels, if those sites were all on Geocities, could they all be mistaken for one site? If someone did one of those sites about /doing_Geocities_homepages/ with all of the pages on the same topic, and the site was linked to from their main site, it might be. But if there's one about /my-pets/ and another about /my-recipes/ they'll carry different sets or recognizable groupings of keywords across all the pages.

Maybe we can think, rather than only in terms of different sites, different groups of pages. And what would identify the groupings would be commonalities in keyword families across them. That gives a little different perspective, and going a step further, carrying over to looking at external sites linked from and to. Within or external to sites there are inter-connections. That's why "neighborhoods" are recognizable - and also nicely defined by how ODP is set up. An ODP listing helps a site, and those aren't for pages; they help in defining what a whole site is about.

In the guestbook thread, when checking similar pages, it was shown what the sites reputation was. It's clear to me that there are inter-relationships that are looked at. It's not hard evidence, but I've seen effects of having a correlation between link text and the page titles of pages.

With one site the addition of one keyword phrase to a page title made a drastic difference. Carry that across a whole site and a pattern starts to emerge if there are related keywords used throughout. The interior page that had the title changed is sitting now as the indented result. It isn't in the title of the homepage, which isn't the one that's indented. There are instances of link text from other sites pointing to the homepage, which went up prior the time the phrase was added to the interior page title. So those links are connected to the homepage which is then connected to the second page. It's all related, and the change to the interior page had noticeable impact.

The reverse is so for another site for another phrase. The homepage has the exact phrase in the title, while the interior page, the one that's indented, has the words in the phrase, but separated by another word in between. It would be interesting to see if that would reverse which one is indented if that extra word were removed, since the phrase is closer to the beginning of the title. And two of the words are in the filename, which is a /directory/.

Page Rank is page specific, but I've tended for a while to think in terms of optimizing individual pages for Inktomi and optimizing the whole site for Google. Irregardless of PR, I've always looked at Google in terms of being site-wide rather than single pages. No hard evidence, it's what I've always assumed.

There are probably a lot of clues to look at with those indented results.

martinibuster

10:15 am on Jan 12, 2003 (gmt 0)

but I've tended for a while to think in terms of optimizing individual pages for Inktomi and optimizing the whole site for Google.

This makes sense, and hews to Brett's "inverted pyramid" directory structure (as I recall it). General keywords at the top, more specific as you dig into the site.

optimizing individual pages for Inktomi

Ink seems to give precedence to paid pages. Within Ink results, my paid index page ranks 11, which is better than any of the freebie pages that are in the Ink DB. In fact, none of my freebie pages rank.

[edited by: Marcia at 10:16 am (utc) on Jan. 12, 2003]

glengara

12:03 pm on Jan 12, 2003 (gmt 0)

The much used quote from GG:
*If you work really hard to boost your authority-like score while trying to minimize your hub-like score, that sets your site apart from most domains.*

Would this not indicate SOME site wide measurements?

Dante_Maure

12:52 pm on Jan 12, 2003 (gmt 0)

The much used quote from GG:
*If you work really hard to boost your authority-like score while trying to minimize your hub-like score, that sets your site apart from most domains.*
Would this not indicate SOME site wide measurements?

I would go so far as to say the much abused quote from GG. :)

This one quote has been the source of a great deal of wild conjecture. Some of which might have merit, but much of which I believe is way off base.

The quote is almost always removed from the very specific context in which it was originally offered... in relationship to the use of javascript to hide links.

The only thing it clearly expresses is that if a page has a disproportionate ratio of inbound links to outbounds it could be a sign of "un-natural" linking procedures.

A closer look within original context shows that the quote in no way definitively states that Google currently does check for this... just that they could if they wanted to.

The discussion was about a means by which webmasters could intentionally manipulate PR which is obviously not in Google's best interest. GoogleGuy then prefaced the above quote by basically saying "Hey kids, you never know what we're going to factor into our scoring methodologies... think about this."

Given the context, I don't see where that quote is any clear indication that Google does in fact take any type of site wide scoring into consideration.

Do they? Perhaps to some small degree... or may choose to do so in the future, but GG's above statement does little to indicate it definitively one way or the other.

There are countless folks devoting enormous time and energy to studying Google's ranking methodologies, and I have yet to see anything even resembling solid evidence that such factors are currently at play to any significant degree. If they were, I have no doubt we would see much more evidence to support it given the enormous think tank which exists here.

Since theories are all that exist in this matter (outside of the Googleplex), I'll share mine...

Topic Sensitive PageRank [webmasterworld.com] will indeed come into play to some degree in the future... but isn't currently used beyond anchor text and the keyword relevance of other pages (not entire sites) directly linking to the target page being scored.

Furthermore, if Topical PageRank does get factored in, I don't see it being applied by domain at all.

As Marcia has suggested above, it is more likely to be based on "neighborhoods" (of pages, not sites) which will be defined by relational link geography... regardless of the domains where these pages and links are found.

Think about it. If MIT published the world's most definitive paper on artificial intelligence, would Google apply a scoring methodology that insured it did not come up first because the "theme" of MIT's site is not definitively about AI?

More likely it would have the chance to come up first because many other pages devoted to AI spread out across many domains would all be linking to the MIT paper. This would in effect create an AI "neighborhood" that is very clearly defined by theme, but not limited at all to specific domains... just a topography of similarly themed pages.

To bring this all full circle to the original topic at hand... (imagine that ;))

Even in the Topical PageRank environment described above, we see an emphasis on individual pages, as well as groups of pages, but not sites specifically.

In the event that a site is entirely devoted to a given theme, it would in effect be a "neighborhood" of pages in and of itself, but the domain would not be the determining factor... just the grouping of related pages.

annej

3:19 pm on Jan 12, 2003 (gmt 0)

In the event that a site is entirely devoted to a given theme, it would in effect be a "neighborhood" of pages in and of itself, but the domain would not be the determining factor... just the grouping of related pages.

This makes more sense to me. People here have asked if they should put everything in the root directory. My experience is that it doesn't seem to matter. Related links seem to have more influence than what directory a page is in.

Hmmm, we've talked about how Google gives a temp page rank based on the page it is linked from. I wonder if it matters if the link is from the same domain or not.

Anne

BigDave

5:59 pm on Jan 12, 2003 (gmt 0)

I was never intending to refer to a domain as a site. The vast majority of sites on the web do NOT have their own domain name.

aol, msn, geocities and angelfire probably make up more sites *each* than the total number of registered domain names.

There are many ways to define a group of pages. Certainly there is theme and commonality of keywords. There is similarity of navigation features. Framed sites would be fairly easy.

There would be advantages to using different groupings for different factors. Sometimes ranking by neighborhood would be too big, and sometimes it is a good unit to use.

BigDave

6:04 pm on Jan 12, 2003 (gmt 0)

Hmmm, we've talked about how Google gives a temp page rank based on the page it is linked from. I wonder if it matters if the link is from the same domain or not.

The guessed pagerank is a totally different matter. All it is is the PR of the domain name - the number of directory levels deep you are. Many of these "guesses" are on pages that have no incoming links, so it is hard for them to know what sort of other group to put it in. It is just a quicky way for the toolbar to put up a guess, you shouldn't try to read any more into it than that.

annej

3:08 am on Jan 13, 2003 (gmt 0)

PR of the domain name

How is the PR of the domain established? I would guess from my experience it is either the root level index page or the top ranked page but I not at all sure.

Anne

BigDave

3:12 am on Jan 13, 2003 (gmt 0)

I mis-spoke. It would be the PR of the root level document of that domain. In other words if you go to www.wallys-widgets.com/ it would be the PR value of the document you would get.

annej

3:26 am on Jan 13, 2003 (gmt 0)

That fits for me. The site I am most concerned about has a PR7 on the root page, PR6 on all subdirectory root pages and PR5 on all pages linked from the subdirectory root pages. Googles guess always sticks unless some of the 3rd level pages get outside links of their own which seems to bring them up to PR6. At least that is how it seems to work this month.

Anne

BigDave

3:32 am on Jan 13, 2003 (gmt 0)

that's probably why they make their guesses that way, it seems fairly accurate for your average small website.

stevenha

4:08 am on Jan 13, 2003 (gmt 0)

BigDave, You've made a good point. The PR guesses for subpages seem to be based on PR of the root domain, and that's a kind of evidence suggesting that Google keeps some site-level information, in addition to page-level info.

I've also thought that the penalities are sometimes a site-level thing too.

BigDave

4:13 am on Jan 13, 2003 (gmt 0)

I really don't think the PR guess being based off the domain root has anything to do with any sorts of search factors. That would be strictly a toolbar issue, and it has nothing to do with the index or the search.

ciml

11:34 am on Jan 13, 2003 (gmt 0)

Vitaplease's nice list of ways that PageRank can depend on the site are all because the link map of the Web is quite domain centric (a typical page has more internal links than external links). PageRank merely follows these links and therefore tends to reflect domains. The Toolbar guess of one less notch of PageRank per "/" happens to reflect typical Web site construction, so it is often a good indicator for PageRank too.

There's no reason to believe that Google can't use 'domain' based theming at some point, after all, domains are used to spot artificial link networks so why not natural ones? Personally though, I prefer mechanisms that work on the link map. That way it doesn't matter if a site covers many domains, of if many sites are on one domain.

On the other hand, Google have been found to use domain extensions for the country filters and apparently IP geolocation for both the country filters and the boost/penalty used for some travel related phrases.

jdMorgan:
> ...it is left to us to discern the inner workings of their "black box" using insufficient-quality data, sampled only sporadically

Indeed, and black box experiments are fun. My own lead me to say things like "PageRank is transferred within a domain just as it is between domains, to at least an accuracy of 1/30 of a notch on the Toolbar; but the anchor text benefit is quite different when the link isn't to a page on the same domain".

Part of my experiment (designed to quantify the latter) doesn't get spidered for some reason; I take that as a hint not to describe how I arrive at these statements in any detail.

Brett_Tabke

12:07 pm on Jan 13, 2003 (gmt 0)

>"Google only looks at pages, not at sites"

Fairly rare over here as most have come to realize there is something to do with 'context' in the algo that isn't based on pages.

It's also funny that you won't hear; "my one page out of 8000 has a pr0, whatever will I do?". ;)