Forum Moderators: open
----
I'm starting this thread because another member suggested such would be a good idea because the main Google update thread is cluttered with posts like "OMG, I've been dropped in the new index!" and "Yippee, I'm now #1 on a key SERP". This thread is ONLY for serious, generic discussion of changes that you are observing with the new algo in this update. As in things like "Looks to me like PR is less important this month, and anchor text of inbound links counts more.", etc. How your site is doing has no relevance here unless you can explain why you think so in terms of a general algo update.
Been looking at five different cat's ... -sj versus -fi ... three of the cat's we're in, two we're not. Four of five show more consistent quality results in -sj. So from here, looks like the spam filters no on yet fully in -fi.
In one of these cat's in particular, a large, quality site fell to page two in -fi; OK in -sj.
For what it's worth, we also see more resemblance to the current www in -sj. If Google is using www as some sort of benchmark, that would suggest -sj is still the place that best reflects G's direction.
But, if G not happy with current SERP's, then perhaps -fi reflects the newer algo better?
Yikes.
Others have noted that single keywords are dropping and multiple keywords rising and this might reflect such an attempt.
If so, quite an undertaking. IMHO, Google may need quite a bit more tweaking if "-fi" is any indication of such an attempt.
Thoughts anyone?
"From this exchange, it seems safe to make the follow assumptions (IMO):
1. -sj does does not contain the complete results from the deep crawl.
2. Pages from the deep crawl will be added to the index in coming days.
3. As pages are added, backlinks will be added.
4. (At least some) SPAM filters have not been applied to the index yet. Thus -sj is currently more spammy than the final index.
5. Given the preceding 4 assumptions, the SERPs of the final index could be drastically different from what we currently see on -sj (IMO)."
Swerves post was based upon GoogleGuys responses in this thread
[webmasterworld.com...]
An excellent summary of all the clues Swerve!
I've been doing a fair amount of research for 2 days - and the 'process' for update Dominic is totally different to what we have seen previously. The whole update process is changing.
FORGET the '-sj SERPS' per se - look at the algorithm that has derived the results. READ Swerves summary of GoogleGuys posts.
GoogleGuy has already given us a heads up - reread the links orginally provided by Swerve.
What you can do right now is see how the -sj results are derived - and learn some very valuable stuff. Ignore the spam - assume it will get killed by one of more spam filters, which are yet to be applied (see 4 above).
Just forget that your backinks are wrong - see 3.
Just forget that your deepcrawled April pages aren't in the -sj index - see 1 & 2.
And I'll add to what Bikeman said - yes - the -sj index is based on an old (previous) ODP dump - whereas the 'current' www is from a more recent ODP dump. It will get changed later - add that as 6. above.
So you can't analyse 'Dominic' the way everything else has been previously analysed at this point in an update cycle - THERE IS NO FINAL INDEX/SERPS TO VIEW AS YET
THIS one is being built differerently - and for the first time - we are geting an insight into how it is being built. Normally - we see it after its built - we see it getting replicated. This time - we are seeing an index actually being built - ingredient by ingredient. Don't waste the opportunity!
Lets face it - how do you keep scaling/ sorting/ analysing/ ranking a database on some 16,000 servers, in 7 datacentres, with over 3 billion webpages? I don't know - but Google is doing it as we speak.....
My advice right now - look for the patterns on the 'good' sites, ranking highly in the -sj index. Ignore the spam - it gets culled later. Are the good high ranking sites the same as the 'good' sites in the current www index? Ignore the spam - look at the good sites......
Chris_D
[edited by: Chris_D at 6:46 am (utc) on May 8, 2003]
If you think the update has finished - think again.
Results on www3 are still very different to www.google.com or www.google.co.jp. There is even listings from the DMOZ that have been removed months ago still showing up.
?I'm puzzled by that statement?
Who thinks its over? The time its taking indicates that something very different is happening with the 'Dominic index update'. The whole process has changed.
The google dance tool shows:
7 datacentres with 681,000 links to Yahoo.com - and only ONE datacentre has 384,000 links to yahoo - and thats www-sj
Hence - everyone is focussing on -sj
It also shows that www2 and www3 are showing 384,000 yahoo links - and www is showing 681,000 links.
If you do some analysis on where the data is coming from - which rfgdxm1 alluded to earlier - it gets a little easier to see why...
www-sj 216.239.47.166
www2 216.239.47.166
www3 216.239.47.166
whereas:
www 216.239.48.242
www-ex 216.239.47.2
etc.
Does that seem a little unusual that wwww2 and www3 and www-sj are all returning the same IP address? Its the same index version - The process has changed. It not www2 and www3 being different anymore, relecting a completed different index and then being integrated anymore - they're building this index right in front of us now....
eSo www.google.com and 7 out of 8 datacentres are still showing a 'current' index - with little freshbot bits in it - but www2 and www3 are showing some 'testing', based on some 'older' data, older DMOZ, spam filters turned off etc.
Chris_D
I still think you cannot analyise the new algo until after the update has finished.
But what if "the update" never finishes again?
A very important goal in dealing with huge volumes of data is to find a way to avoid "batch" processes that require processing all the data to reach a new version. It's been common opinion that Google was aiming to abolish the monthly update in favor of a constantly-rolling update--this may in fact be required in order to grow further.
When you have figured out how do it, how would you make that transition? You would keep a baseline index, and then you would start keeping an "update buffer" with the results of all crawls, directories, etc. (and some minimum buffer size may be needed for practical updating). Then you would start applying the contents from the head of the "update buffer" to your baseline index, adding new sites, deleting others, and running the processes to calculate and propagate changes (links, PR, ...) to all sites (while you continue to crawl, refilling the update buffer). Once you get this continuous update started, you never stop again.
The process by which a new state of the index is copied to tens of thousand of machines in multiple datacentres is a separate question, and rather less interesting ("just operations"). Much of the speculation about different centres has failed to take into account both how flexible DNS resolution can be and how creatively "load balancing" can be used among a great number of servers at each centre, so the reports here are often faulty. The only interest, anyway, was to get an early sighting of "next month's" index. In a world of continuous updates, that interest will have gone away.
(Just a hypothesis, but it does explain why people observe that "this update is different in so many fundamental ways from any we have seen before!" It's not an update: it's the end of discrete updates.)
Chris_D - Hat's off to you, nicely done.
We're looking through the peephole of the Google construction site!
Since the update isn't complete hopefully the 'widgeting' homepage will come back up but who knows. The 'widgeting page' has a PR6 while the 'widget' one has a PR5 but we've seen that doesn't seem to mean much anymore.
Even a site titled "submit a site to widgeting" is now ahead of both sites. (groan)
I did notice that sj is now showing up-to-date titles and cache now. That's probably already been mentioned here.
As to the actual position - it's been leaping up and down.
The point is that I think Google has become more intelligent when it comes to related words.
Perhaps your site is now at 22 because of a surge in new widgetting sites? It's a very popular hobby...
There seems to be a bit of randomness to the serps. Maybe I should go study chaos theory or maybe even random-walk theory....Nah.
Got to say that from my limited observations if the SERPs we're seeing on www2 etc are the new algo then Google's taken a big backward step in the quality of its search results. (As a Google fan I want to be wrong here)