Forum Moderators: open
<added>
P.S. I won't be posting as often (gotta work, ya know :), but I will be checking this post and chiming in when there's something I can add.
</added>
While it's true that the 20 character identity would add to storage space, at this time (3 billion pages) it would "only" be 16*3 billion, or around 40-ish gigs of space. If you spread this out over, say, 40 machines that's a gig apiece: not a lot. And the index would be spread out over more than 50 machines.
Excuse me, but read the essay by Brin and Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," about the Google architecture. They use two inverted indexes, the "fancy index" and the "plain index." Between these two indexes, plus the other places in the system where the docID is used, it amounts to a total space requirement of two docIDs per word per document indexed.
Yes, that's not one docID per document, it's two docIDs per word per document.
You can use a 20-byte hash if you like, but I think a four-byte or five-byte docID would make just a little more sense.
This is nothing new and honestly, they are handling it correctly. www.domain.tld is not necc .domain.tld
Recap
- Update will take several days to finalize. Sites will flux in/out until then.
- PR on the toolbar is not reliable.
- PR from -fi is not reliable.
- Directory has not been updated or is glitchy at times.
- Sit back, Relax and enjoy the flight.
> Brett, any chance of a trophy or citation or something
> for GG on this one?
He'd never take a gratuity. (I've even tried to help with algo on many occasions ;-) no go.
OK, so the actual hashed IDs might only be 40GB, but you need to associate each of those IDs with every word on the page in order to have a searchable index, probably by associating words themselves with the pages on which they occur. That's where the bigtime storage comes in, since you effectively have to multiply that ID by every word, but its necessary to index in this manner for computational efficiency. Gigabytes become terabytes pretty quickly...and you'll need to keep most (if not all) of it in RAM to return 10-100 results in under a second. Further computational and storage efficiency would be gained by keeping the IDs as short as possible, and in that sense I can see some credibility in the theories about running out of address space.
> Brett, any chance of a trophy or citation or something
> for GG on this one?He'd never take a gratuity. (I've even tried to help with algo on many occasions ;-) no go.
I've learned SO MUCH on this forum!
The most astonishing thing so far that I've learned on this go-round:
BRETT actually has a sense of humor-and a GOOD, DRY ONE @ that!
Just when I thought that the mods were nuns with rulers poised to slap the hands of anyone trying to be creative-thanks for the comic relief, Brett (and for the fantastic forum that you have created)!
BTW, if this post is deleted, please disregard the 'sense of humor ' reference.
IFFF this PR sticks (and my hunch is that it will for most sites), seems like now Google is awarding PR more conservatively than it has in the past. In the past (its been my observation that) getting links from fewer high PR sites resulted in a greater boost in PR. Comparitively it seems now that getting links from a greater number of high PR sites is resulting in a relatively smaller boost (or no change) in PR. This change seems to be across the board on all of my sites.
I always felt that Google's (pre-Dominic) algo was rather too easy to 'work with' in order to achieve a high PR. Simple trick seemed to be to have 1 or 2 PR-7 sites link to you to get a PR of 5 or 6 (all other on and off page factors constant).
If that is actually the case, then this change is welcome since it's in line with some recent observations that Google's algo seems to be laying less emphasis on PR now (as opposed to the past) when ranking pages in search results.
But if incoming links from high PR sites is less of a factor now, how does that fit into Google's own description of their service. Here's some excerpts from one of their help pages.
-------
The heart of our software is PageRank...
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."
-------
So, is Google changing the *heart of their software*? Seems like this would be a strange thing to do. Almost like Microsoft deciding to dump Windows for Linux. Ok, maybe that's a wierd analogy, but it just feels off to me. Yet, this dance hasn't helped my PR in -fi (yet) despite the highly ranked links I've gotten from related sites. So, I am definitely confused. Am still hoping the rest of the dance will change this.
and I am wondering, since it's new to me and having a bad effect...
ok, it's true that www.domain.tld is not necc .domain.tld
but shouldn't google be able to tell if they are the same thing?
shouldn't they know that index.shtml is the same in both instances too?
I understand your point that this is nothing new, but I can tell you for a fact that it's new to a lot of sites that have multiple listings of different paths to the same page, and it only makes sense that this has something to do with the fact that those pages dropped significantly.
I'm not trying to start an argument with anyone, but it is probably not coincidence that having the same page indexed 4 or more times would dilute PR and ranking relevancy.
If google has to determine which page of a themed site should show up, and has ALWAYS chose the main page, splitting this main page into 4 duplicate listings can't make this choice any easier.
I understand that there is a lot of update left, and I'm not rushing out to change my .htaccess file until I see if this is something google will fix or not. I'm not going to ignore the problem and say there's nothing wrong though.
Maybe it's pointless to advise people to bear in mind that results probably will change some (based on the digging that I've done today), but hopefully it will ease a few minds, too. :)
[edited by: GoogleGuy at 3:32 am (utc) on June 17, 2003]
There's absolutely no doubt that PageRank is at the heart of Google's algo(s), however PageRank is only one of the variables that Google uses (in addition to other on page and off page factors; anchor text, kw density etc.) in order to rank pages in search results. I would like to think that in order to determine SERPs, the importance of page rank 'PR' (in relation to other on page and off page factors) is not a constant, and is controlled by Google's engineers who write that algo.
Let's not forget that Page Rank has been bought and sold in the past (and probably still is); therefore it's important that in order to deliver high quality results, authors of an algo are able to fine tune the importance of PR in determining the search engine result positions of web sites.
With content (and not PR) being the king, there are plenty other reasons why PR alone must not be the domineering factor in determining the overall position of web sites in search results.
Regardless, best of luck (PR and otherwise) with this dance.
[edited by: rts5678 at 3:51 am (utc) on June 17, 2003]
Maybe it's pointless to advise people to bear in mind that results probably will change some (based on the digging that I've done today), but hopefully it will ease a few minds, too. :)
Here is hoping and praying that you had a chance to dig into the ranking of an internal page vs. index page issue and hope things will change for better:). Thanks
Hey, I wanted to get a chance to ask about what searches were good or bad. :) It snowed me under enough that I'll probably leave stickies off in the future except to duck in, ask for specifics from someone, and duck back out.
What, you thought I had some outside life or something? Note to self: gotta work on getting an outside life. :)
As far as stickymail goes, I was heartened to hear a lot of really nice messages. Even the reports of searches people disliked weren't as bad as I'd expected from reading the boards here; that was another interesting surprise.
I do understand that PR isn't the only factor in determining search position, and really, I guess you and I might have been talking about different things. I was referring to backlinks increasing PR, rather than PR affecting SERP position. But in any case, GG says PR is still stewing, so I will just be patient (or become a patient in a mental hospital) and see what happens...
For me the update is good news, nice to see fresh backlinks in at last, especially as 2 months ago I had virtually no external links at all.
My index page is still not there for my main keyword in Google, but in Yahoo it's jumped a whole pile of places from around 150 to 41, all thanks to WW and a systematic 2-part campaign of content and link building. The jump is no doubt because of the links, but the question is does the better position in Yahoo bode well for Google too? As Yahoo uses Google it seems likely that I should see a similar jump, but on the other hand it's strange that my index page does well in Yahoo first.
What do people think are my chances of a similar jump in Google SERPS?
All the best everyone,
Jeremy
Obviously their are a bunch of other variables involved, but it seems strange to have an internal page with all internal links with a PR 2 higher than the home page. Did this guy crack the PR code and figure out how to make page rank consolidate on an internal page through a interlinking scheme?
A nights sleep.... and I feel much better.... but still sad enough to get out of bed and head straight to the PC.
The above is absolutley correct. Many of those I received yesterday are back to more or less where they should be on their original index pages. A relief really, because there are some situations that just don't add up.
I'll study (and monitor) more today, including those that came in overnight. But it is clear that we are still in update flux (you'd think I'd should know better than to think otherwise on day 1 of the dance... let's just call it therapy!)
Thanks GoogleGuy by the way for looking over those. They were very typical.
At least today has started more positively. I lost yesterday completely!
At least today has started more positively. I lost yesterday completely!
As an aside. All interior pages (35) are still in the top 5 for their key phrases and the new content is also right on top. Losing the home page's key phrases accounts for only about 10% of the traffic, if that. It is the yardstick by which potential clients judge the site, however. I can't really say "Join my site, look how it ranks on MSN when you search for 'company name', now can I?"
I'm not whining here, just giving an example of why diversification (within a site) is not always enough.
OTOH, the directory is contractually limited in size and scope, and almost full. All I really have to do is keep bringing qualified traffic, regardless of where i'm ranked. Almost.
[edited by: Powdork at 6:54 am (utc) on June 17, 2003]
But if incoming links from high PR sites is less of a factor now, how does that fit into Google's own description of their service.The heart of our software is PageRank...
I think we have to be careful not to confuse marketing prose with engineering realities.
PageRank likely will always be an important part of the process, even while the approach to calculating it might change. But its importance to the people developing the algorithms is one thing, and its importance to the people writing "why you should choose Google" copy is another. :)
In my case I think I have about 18 hours left in my penalty. The first of the June 14 fresh tags have disappeared. The second set should be dropping off soon. After the June 15 tags go I should be back on top.
The new index is now showing up on SJ and DC.
Funny, for one of my search terms I get different results on -sj then on any of the other datacenters. It's been this way for the last two updates, but I can't figure it out. I've never seen the -sj result turn up on www.google.com outside of the 'dance' period.
Has anyone else noticed this?