homepage Welcome to WebmasterWorld Guest from 54.167.179.48
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
The "NEW Google" and How It Works
My musings...
caryl

10+ Year Member



 
Msg#: 3241770 posted 3:31 pm on Feb 3, 2007 (gmt 0)

The 'NEW Google" and How it works - My musings...

Month after month we see new threads regarding the 'latest serp changes', penalties, and the general overall woes of webmasters trying to figure out - "what happened?".

One thing I have yet to read (maybe it is posted somewhere - but I've yet to find it) is an evaluation of just 'how Google works' since the introduction of their Big Daddy system.

Here are my musings...

History:

Years ago, Google preformed what was referred to as the "Google Dance". The Google Dance happened approximately every 30 to 40 days and it was then you could see what new pages were added and the resulting serp changes to Google's database. At that time they had about 10 datacenters.

The next 'big' systematic change came around the time of the infamous "Florida Update". It was after that that Google started using what came to be refrred to as 'Freshbot' and 'Deepbot' and Google moved away from their ~ montly updates to what came to be known as "everflux". At this time new pages added to your site would show up (sometimes) within hours of upload and linking. Google grew to 56 active datacenters at that time.

BUT - due to unpresidented growth of the internet AND Google's inefficient system of purging their databases of 'dead - long gone - pages', their databases soon grew to an unmanageable size.

Current:

Google's solution to their buldging, out of control databases - the Big Daddy system of data handling!

Since the introduction of Big Daddy, month after month, websites (good websites) have been bouncing in and out of Googles serps AND sometimes even in and out of Googles index entirely.

I think it is about time that we consider just how this 'new system' works. Maybe then, we would have a better chance at understanding what is really happening to our sites.

Observations - since Big Daddy:

Supplimental database - Usage of a Supplimental database on a level never seen before.
Google spiders - Google spidering has changed (reduced). Google said relying on 'caching of pages more'.
Data pushes - Data Refreshes in lingo used by G employees.

Here are my speculations as to How the "New Google Works":

THE Google Index has now become 3 different Indexes OR a Three Tiered Index. There now is the Supplimental Index, the Secondary Index (where site:, link:,etc searches come from), and the PRIMARY INDEX (where the serps results come from).

I think that it is THE WAY these Indices are populated - that is problematic for the Google Search Engine results.

I have come to believe that internal Google spiders crawl Googles Secondary Index accumulating data for the 'next' data refresh. The Google spiders that you and I see in our server logs return their findings to either the SUpplimental Index or the Secondary Index.

I speculate that these 'internal Google spiders' - Primary Index Bots (I'll refer to them as PIBots) either are dropping data (pages) OR are not fast enough to accumulate ALL of the data from the Secondary Index in time for the next data refresh.

I speculate that each data refresh totally replaces the previous PRIMARY INDEX. Therefore, the resulting 'new' Primary Index is incomplete and many websites/pages have inadvertantly been OMITTED. This might explain why some webmasters report bobbing in and out and up and down in the Google serps.

Spidering issues would also explain why some webmasters are seeing a correlation with sitemap SPIDERING. Perhaps it is the SPIDERING that is the issue - NOT the sitemaps - per se.

The major problem with this hypothesis is that IF this is what indeed is happening (an internal Google issue) it leaves us POWERLESS to change things on our end and people can't deal with being powerless.

I am hoping that this will spark off some further discussion on this topic. Perhaps the 'needle in the haystack' we have been seeking (an answer to the un-stable Google) can be found in THIS haystack!

Caryl

PS - Yesterday I created a little tool to monitor my site for Google spidering and the IP addresses each spider originates from - perhaps I will find a 'clue' there - who knows...

 

gehrlekrona

5+ Year Member



 
Msg#: 3241770 posted 8:45 pm on Feb 3, 2007 (gmt 0)

I think this is a very possible scenario of what is happening and what Google is doing "behind the scenes".
I my self as a DBA and a developer see this as a way to index your databases and moving data between centers.
Google might have a "timing issue" with their bots or/and the release of data from one index to the other to the bots.
If Google would be using our xml files to upload into their data centers then they wouldn't have to crawl as many pages as they normally do and could concentrate on site with no xml sitemaps.
Their talk about "Maybe we don't know about all your pages..." is just a bunch of you know what! Like I have said before, if they don't know about your pages, then there is something wrong with your site!
In my opinion there is not need for any sitemaps at all.

johnhh

5+ Year Member



 
Msg#: 3241770 posted 9:10 pm on Feb 3, 2007 (gmt 0)

Yoi may be interested in this from a certain Mr Matt Cutts at WebmasterWorld pubcon boston: [mattcutts.com...]

steveb

WebmasterWorld Senior Member steveb us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3241770 posted 9:49 pm on Feb 3, 2007 (gmt 0)

Sadly Matt's post just shows Google's delusional denial.

98% of webmasters are not sitting around worrying about Google's use of their bandwidth, and almost exactly zero with sites under 1000 pages care about this. Google however has responded to their OWN needs by attempting to crawl established URLs far less than previously, with disasterous effect to their index.

Five PR6 links get you crawled rarely. 5000 PR0 blog comment links get you crawled every day. Google spares themselves bandwidth while deliberately changing its crawl priorities to spam domains from quality domains, especially smaller (under 10,000 pages) niche domains.

On top of that Google discards unique pages from its index for foolish ideas, stricly to save its own space. It's mindboggling that it will only index 28 of 33 pages (all PR4) of a photo section on a 500 page domain, while gobbling up page after page of cms-generated, blog comment-linked random text garbage. Those five photo pages apparently will make Google crash if they index them, so they take the deliberately anti-user position of not indexing something they *know* is unique content on a domain they generally respect, with moderate pagerank.

The "new google" sadly is based on a stupid decision that can't be justified except in Google's own inability to first index the web, and then to discern poor quality content, since obviously if they are incapable of storing everything, they should through out the bottom 20% and not the botom 15% and randomly 5% from the middle.

The "new google" is all about crawl paths, volume of links regardless of how crappy, even less valuation of niche authority, and most importantly for webmasters, a brave new world where Google's index is far weaker than previously -- which means a lot in terms of competing for rankings.

AndyA

5+ Year Member



 
Msg#: 3241770 posted 10:16 pm on Feb 3, 2007 (gmt 0)

You know, if Google really wanted to save bandwidth and preserve space, they would quit spidering bad/dead/old/incorrect URLs.

For instance, if a once active page is 301'd to a new page, after 90 days, or 6 months or so, it's probably safe to stop checking the old URL. This includes ALL URLs that have been changed from www to non-www, or vice versa. To just keep checking them forever seems to be a total waste, unless new links pop up from within the same site that show it's a good URL. (In case the webmaster decides to use that URL again in the future.)

Google should also add a feature in the Webmaster Tools that allows webmasters to tell Google when a URL is bad. For instance, URLs that can't be accessed, get a 404, or whatever, let the webmaster click a box to say "This is a bad URL, do not spider it again." And then don't.

This would free up a lot of bandwidth and save a lot of time. Time that could be spent spidering active pages that don't get spidered as often as they should. I have someone linking to me with a typo, but I can't find the site. I've checked my own site, and it doesn't show up. Yahoo doesn't have it either. So, Google once a week or so hits the "http://example.com/widdgets. html" page, which doesn't exist. "http://example.com/widgets. html" does exist. It would be nice to be able to tell Google to forget about the misspelled page.

johnhh

5+ Year Member



 
Msg#: 3241770 posted 11:38 pm on Feb 3, 2007 (gmt 0)

Google .. has responded to their OWN needs

That was actually the point in pointing it out , and the new "Tools" reinforce that.

I do feel that the new "tools" and methods are really saying "tell us as we can't be bothered to look as we need to speed all this up and hey if we miss a few sites or pages well never mind".

Going back to the topic these are two new elements in what makes Google tick and maybe the final outcome of these are yet to be seen.

Whether the world of webmasters like it or not this is what we have "never mind the people outside doing the graft - look at the share price"

reminds me of "never mind the quality feel the width" if anyone remembers that catch phrase...

kevsh

5+ Year Member



 
Msg#: 3241770 posted 12:12 am on Feb 4, 2007 (gmt 0)

>>>>
You know, if Google really wanted to save bandwidth and preserve space, they would quit spidering bad/dead/old/incorrect URLs.
>>>

I agree completely with your thinking, and I would dearly love to be able to get my own pages out of their index in a more efficient manner. But it's clear Google wants to keep even outdated pages in their supplemental index for searchers.

Google got as big as it did - popularity and index-wise - because it was the best place to find even highly rare information. And in some cases, those outdated pages contain information that can't be found anywhere else.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3241770 posted 1:14 am on Feb 4, 2007 (gmt 0)

There's another Google fact we live with, and it causes needless upset for some webmasters. That's the secondary level of information Google tries to provide that is, speaking kindly now, often less than dependable. This includes things like:

1. Toolbar PR
2. Number of inbound links
3. Number of total results for a regular search
4. Number of total results for any "special operator" search - especially when it's over 1,000
5. Presence or absence of a "supplemental result" tag
6. Webmaster Tools feedback

Every one of these items has some usefulness. But Google's main purpose is providing primary search results to the end user, and not total precision in the above areas for webmasters or for those researching the competition. In fact, getting anywhere near precision in some of these areas is a monster of a technical challenge, given the way that Google "shards" their data.

So these kind of data are there to help as a general guide, but they are not worth major obsession over every little bump and dip. Traffic matters the most, and then the rankings that you can see (which may not be what everyone else sees.)

I agree that sometimes it feel like the new "crawl caching proxy" that came in with Big Daddy gets a bit hinky at times. This is data, after all, and any database I've ever worked with ends up with bad data from time to time. I'm sure Google does too.

caryl

10+ Year Member



 
Msg#: 3241770 posted 1:18 am on Feb 4, 2007 (gmt 0)

johnhh,

Thanks for that link!

I think that this "new spidering system" (crawl caching proxy) IS the main culprit in many a webmaster's woes.

from MCs Blog (link above)
This crawl caching proxy was deployed with Bigdaddy, but it was working so smoothly that I didnít know it was live. :) That should tell you that this isnít some sort of webspam cloak-check; the goal here is to reduce crawl bandwidth.

Working Smoothly?

You need to look no further than your nearest webmaster forum to read of all the sites bouncing in and out of the serps to see that "Houston - We have a problem..."

Caryl

caryl

10+ Year Member



 
Msg#: 3241770 posted 1:24 am on Feb 4, 2007 (gmt 0)

There's another Google fact we live with, and it causes needless upset for some webmasters. That's the secondary level of information Google tries to provide that is, speaking kindly now, often less than dependable. This includes things like:

1. Toolbar PR
2. Number of inbound links
3. Number of total results for a regular search
4. Number of total results for any "special operator" search - especially when it's over 1,000
5. Presence or absence of a "supplemental result" tag
6. Webmaster Tools feedback

...

Tedster,

These are precisely the types of information I was thinking of as coming from - what I referred to as - the Secondary Index.

Caryl

AndyA

5+ Year Member



 
Msg#: 3241770 posted 1:42 am on Feb 4, 2007 (gmt 0)

kevsch wrote:
Google got as big as it did - popularity and index-wise - because it was the best place to find even highly rare information. And in some cases, those outdated pages contain information that can't be found anywhere else.

I agree. What I'm more concerned with are the pages that have never existed. You know the ones, they have strange URLs and you wonder how in the heck Google came up with them. In additional to the misspelled "widdget" page, there's always the fun "http://example. com/page1.html/maincontents.shtml" combo that combines parts of two different URLs. I get those a lot, and I know they didn't come from my site.

I think if a webmaster submits a sitemap to Google, they've verified their site, and Google knows who they are, then Google should index the URLs listed in that sitemap, and no others. If it finds some that it thinks belong, they can list them as a "Is this yours?" URL. Then you can say yea or nay. That would be an instant stop to the 302 redirects, the scrapers, and all the others.

Google has some great ideas to help webmasters, they just aren't implemented properly, or they aren't thorough enough.

ALbino

10+ Year Member



 
Msg#: 3241770 posted 2:54 am on Feb 4, 2007 (gmt 0)

If Google keeps crawling a dead URL is there any reason to not just use the Removal Tool? They'll stop spidering it again. We use it on occasion, with no ill effects.

annej

WebmasterWorld Senior Member annej us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3241770 posted 4:38 pm on Feb 4, 2007 (gmt 0)

So I'm beating my head against the wall trying to figure out why some of my pages are missing and why some came back an others didn't and it could be just this. Leaves a person feeling kind of helpless.

ichthyous

10+ Year Member



 
Msg#: 3241770 posted 5:52 pm on Feb 4, 2007 (gmt 0)

You know, if Google really wanted to save bandwidth and preserve space, they would quit spidering bad/dead/old/incorrect URLs.

I agree with this. I 301'd pages in September that Google is still searching for. I recently decided to update my htacces thinking that it was safe to drop some of the 301s, but then I was surprised to see that those pages were again reported as missing in Webmaster tools.

also, the 'New Google' makes it exceedingly hard to get new content indexed and get an kind of appreciable traffic. Since anything that has a low pagerank gets put into supps it seems automatically. The only answer is to constantly be on the lookout for good quality links to every new section of content...that's a full time job in itself.

pageoneresults

WebmasterWorld Senior Member pageoneresults us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3241770 posted 5:58 pm on Feb 4, 2007 (gmt 0)

I 301'd pages in September that Google is still searching for. I recently decided to update my htacces thinking that it was safe to drop some of the 301s, but then I was surprised to see that those pages were again reported as missing in Webmaster tools.

Rarely do you remove 301s anytime soon. There are most likely links out there pointing to the old URIs and that is why Google is still searching for them.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3241770 posted 5:38 pm on Feb 5, 2007 (gmt 0)

>> What I'm more concerned with are the pages that have never existed. <<

I find that Yahoo picks up many more of these than Google, and yes I find it interesting to work out why these things are picked up in the first place. It often exposes deep problems within a website, not just incoming links with somple typos.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3241770 posted 5:44 pm on Feb 5, 2007 (gmt 0)

Google also seems to be using what we might call profiling - building statistical pictures of what is 'natural' and using that data to help in ranking. We had a good thread about this very recently:
[webmasterworld.com...]

SullySEO

5+ Year Member



 
Msg#: 3241770 posted 6:50 pm on Feb 6, 2007 (gmt 0)

Since the introduction of Big Daddy, month after month, websites (good websites) have been bouncing in and out of Googles serps AND sometimes even in and out of Googles index entirely.

I think it is about time that we consider just how this 'new system' works. Maybe then, we would have a better chance at understanding what is really happening to our sites.

Another less popular (and possibly feared) angle - it could simply be that the pages from (good websites) didn't score as well as others presented in a set of results, e.g., less activity for those sites; fewer clickthrus, bookmarking, return visits.

Google users will largely decide where your pages rank. And Google needs all that power to track their behavior and store personalized searches.

CainIV

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3241770 posted 8:20 pm on Feb 6, 2007 (gmt 0)

Another less popular (and possibly feared) angle - it could simply be that the pages from (good websites) didn't score as well as others presented in a set of results, e.g., less activity for those sites; fewer clickthrus, bookmarking, return visits.

Possibly, but this wouldnt explain why sites bounce in and out at the rate they do, as those stats you are referring to are not changing drastically daily.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3241770 posted 9:39 pm on Feb 6, 2007 (gmt 0)

One more factor in the "new" Google - Human Editorial Evaluation - integrated into the algorithm [webmasterworld.com]. I think this kind of thing is playing a growing roll.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3241770 posted 12:42 pm on Feb 8, 2007 (gmt 0)

Hmm. I am seeing for some sites that have had unique title tags and unique meta descriptions for quite a long time, that they now show "Results 1 to 2 of about 200" where the first result is normal, and the second result is Supplemental. That is then followed by the usual "In order to show you the most relevant results, we have omitted some entries very similar to the 3 already displayed. If you like, you can repeat the search with the omitted results included." text.

So far, those have also been sites that drop the root index page from searches for "pages from the UK" at www.google.co.uk whereas the root page still shows up in "the web" searches.

Google has upped the filtering on "similar pages", especially on site:domain.com searches.

cabier

5+ Year Member



 
Msg#: 3241770 posted 12:58 pm on Feb 8, 2007 (gmt 0)

g1smd same for me. My homepage penalized yesterday, and out of 5500 pages (all pages are unique title and meta tags) it show only 8 results. Others are supplemental results. My website homepage was #1 on my target keyword, now it suddenly dissappeared from google index. Now, my earnings depends on only my inside pages since there are only inside pages on google search :(

caryl

10+ Year Member



 
Msg#: 3241770 posted 2:54 pm on Feb 8, 2007 (gmt 0)

cabier,

Please check out this thread...

[webmasterworld.com...]

Caryl

cabier

5+ Year Member



 
Msg#: 3241770 posted 11:50 pm on Feb 8, 2007 (gmt 0)

Thank you caryl, my homepage is no longer penalized. After 24 hours, my homepage index returned in "better" position.

claus

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3241770 posted 12:22 am on Feb 9, 2007 (gmt 0)

As for the funny URLs; they're tests to see how your site responds to non-existing URLs.

If you deliver a normal page (not a 404) you're most likely running a fully automated setup in which pages are constructed on-the-fly, in response to the URL. Or generated, as it is.

As for Google saving space; that rumor is highly overrated... remember, that we on the outside only see the tip of the Iceb0rg.

cabier

5+ Year Member



 
Msg#: 3241770 posted 1:16 am on Feb 9, 2007 (gmt 0)

claus; I was using 404 redirection with .htaccess (ErrorDocument 404 http://www.example.com).. Is it better not to use it, as I understand so far. How about creating 404.html and redirect to this page? Or should I do nothing? (I removed my 404 redirection and submit sitemap after my homepage penalized, 24 hours later my homepage came back. Before, I wasn't submit my website to Google webmaster tool. Google uses this penalty to increase usage of webmaster tool? :) )

claus

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3241770 posted 2:28 am on Feb 9, 2007 (gmt 0)

>> 404 redirection with .htaccess (ErrorDocument 404 http://www.example.com)

You probably think about this like forwarding your phone. But it's not. You're creating a "worm hole":

Think about it: You visit your good friend, and you know the house quite well. You go to the kitchen to get some item, and as you open the kitchen door ... *slam* you're in a flower shop in Tokyo with no way to get back.

That's a 404 redirect.

I would use a Document as ErrorDocument - that is: YES to 404.html. You can stil put links on your "404.html" document to the other domain.

a 404 *document* would be like leaving a note on the kitchen door saying "please don't use this door" and locking the door; that's the right way to do it.



Now back on topic please. I was only being kind and answered a specific question, don't want to take the discussion off track.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3241770 posted 4:05 pm on Feb 9, 2007 (gmt 0)

A 404 response is not a redirect. The URL shown in the browser URL-bar should not change. It should simply be a response that generates an error code (404) for the bots, and a helpful screen of HTML to help the visitor make the right choice as to what to do next.

A "404 error" where you specify the domain name in the ErrorDocument URL returns a 302 response! That will cause a LOT of problems.

Make a custom Error Document with helpful links to other parts of your site, and put it somewhere like /errors/error.404.html on the same site as where it is used. Make sure that this page also contains a <meta name="robots" content="noindex"> tag.

NEVER use your real index page as an ErrorDocument. That confuses the bot as to what your error page really is, and might affect the indexing of your root index page.

NEVER put a domain name in an ErrorDocument directive. When the error occurs, the handler will not deliver the correct HTTP status code if you do that.

Just explaining that stuff in a different way.
Now back to your regular scheduled progamming...

MThiessen

10+ Year Member



 
Msg#: 3241770 posted 4:20 pm on Feb 9, 2007 (gmt 0)

Rarely do you remove 301s anytime soon. There are most likely links out there pointing to the old URIs

I second this. I have a ten year old site that STILL has backlinks to pages that have not existed for years. Sometimes it's just not all google's fault, it's simply following links, business as normal.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved