Forum Moderators: Robert Charlton & goodroi
Month after month we see new threads regarding the 'latest serp changes', penalties, and the general overall woes of webmasters trying to figure out - "what happened?".
One thing I have yet to read (maybe it is posted somewhere - but I've yet to find it) is an evaluation of just 'how Google works' since the introduction of their Big Daddy system.
Here are my musings...
History:
Years ago, Google preformed what was referred to as the "Google Dance". The Google Dance happened approximately every 30 to 40 days and it was then you could see what new pages were added and the resulting serp changes to Google's database. At that time they had about 10 datacenters.
The next 'big' systematic change came around the time of the infamous "Florida Update". It was after that that Google started using what came to be refrred to as 'Freshbot' and 'Deepbot' and Google moved away from their ~ montly updates to what came to be known as "everflux". At this time new pages added to your site would show up (sometimes) within hours of upload and linking. Google grew to 56 active datacenters at that time.
BUT - due to unpresidented growth of the internet AND Google's inefficient system of purging their databases of 'dead - long gone - pages', their databases soon grew to an unmanageable size.
Current:
Google's solution to their buldging, out of control databases - the Big Daddy system of data handling!
Since the introduction of Big Daddy, month after month, websites (good websites) have been bouncing in and out of Googles serps AND sometimes even in and out of Googles index entirely.
I think it is about time that we consider just how this 'new system' works. Maybe then, we would have a better chance at understanding what is really happening to our sites.
Observations - since Big Daddy:
Supplimental database - Usage of a Supplimental database on a level never seen before.
Google spiders - Google spidering has changed (reduced). Google said relying on 'caching of pages more'.
Data pushes - Data Refreshes in lingo used by G employees.
Here are my speculations as to How the "New Google Works":
THE Google Index has now become 3 different Indexes OR a Three Tiered Index. There now is the Supplimental Index, the Secondary Index (where site:, link:,etc searches come from), and the PRIMARY INDEX (where the serps results come from).
I think that it is THE WAY these Indices are populated - that is problematic for the Google Search Engine results.
I have come to believe that internal Google spiders crawl Googles Secondary Index accumulating data for the 'next' data refresh. The Google spiders that you and I see in our server logs return their findings to either the SUpplimental Index or the Secondary Index.
I speculate that these 'internal Google spiders' - Primary Index Bots (I'll refer to them as PIBots) either are dropping data (pages) OR are not fast enough to accumulate ALL of the data from the Secondary Index in time for the next data refresh.
I speculate that each data refresh totally replaces the previous PRIMARY INDEX. Therefore, the resulting 'new' Primary Index is incomplete and many websites/pages have inadvertantly been OMITTED. This might explain why some webmasters report bobbing in and out and up and down in the Google serps.
Spidering issues would also explain why some webmasters are seeing a correlation with sitemap SPIDERING. Perhaps it is the SPIDERING that is the issue - NOT the sitemaps - per se.
The major problem with this hypothesis is that IF this is what indeed is happening (an internal Google issue) it leaves us POWERLESS to change things on our end and people can't deal with being powerless.
I am hoping that this will spark off some further discussion on this topic. Perhaps the 'needle in the haystack' we have been seeking (an answer to the un-stable Google) can be found in THIS haystack!
Caryl
PS - Yesterday I created a little tool to monitor my site for Google spidering and the IP addresses each spider originates from - perhaps I will find a 'clue' there - who knows...
98% of webmasters are not sitting around worrying about Google's use of their bandwidth, and almost exactly zero with sites under 1000 pages care about this. Google however has responded to their OWN needs by attempting to crawl established URLs far less than previously, with disasterous effect to their index.
Five PR6 links get you crawled rarely. 5000 PR0 blog comment links get you crawled every day. Google spares themselves bandwidth while deliberately changing its crawl priorities to spam domains from quality domains, especially smaller (under 10,000 pages) niche domains.
On top of that Google discards unique pages from its index for foolish ideas, stricly to save its own space. It's mindboggling that it will only index 28 of 33 pages (all PR4) of a photo section on a 500 page domain, while gobbling up page after page of cms-generated, blog comment-linked random text garbage. Those five photo pages apparently will make Google crash if they index them, so they take the deliberately anti-user position of not indexing something they *know* is unique content on a domain they generally respect, with moderate pagerank.
The "new google" sadly is based on a stupid decision that can't be justified except in Google's own inability to first index the web, and then to discern poor quality content, since obviously if they are incapable of storing everything, they should through out the bottom 20% and not the botom 15% and randomly 5% from the middle.
The "new google" is all about crawl paths, volume of links regardless of how crappy, even less valuation of niche authority, and most importantly for webmasters, a brave new world where Google's index is far weaker than previously -- which means a lot in terms of competing for rankings.
For instance, if a once active page is 301'd to a new page, after 90 days, or 6 months or so, it's probably safe to stop checking the old URL. This includes ALL URLs that have been changed from www to non-www, or vice versa. To just keep checking them forever seems to be a total waste, unless new links pop up from within the same site that show it's a good URL. (In case the webmaster decides to use that URL again in the future.)
Google should also add a feature in the Webmaster Tools that allows webmasters to tell Google when a URL is bad. For instance, URLs that can't be accessed, get a 404, or whatever, let the webmaster click a box to say "This is a bad URL, do not spider it again." And then don't.
This would free up a lot of bandwidth and save a lot of time. Time that could be spent spidering active pages that don't get spidered as often as they should. I have someone linking to me with a typo, but I can't find the site. I've checked my own site, and it doesn't show up. Yahoo doesn't have it either. So, Google once a week or so hits the "http://example.com/widdgets. html" page, which doesn't exist. "http://example.com/widgets. html" does exist. It would be nice to be able to tell Google to forget about the misspelled page.
Google .. has responded to their OWN needs
That was actually the point in pointing it out , and the new "Tools" reinforce that.
I do feel that the new "tools" and methods are really saying "tell us as we can't be bothered to look as we need to speed all this up and hey if we miss a few sites or pages well never mind".
Going back to the topic these are two new elements in what makes Google tick and maybe the final outcome of these are yet to be seen.
Whether the world of webmasters like it or not this is what we have "never mind the people outside doing the graft - look at the share price"
reminds me of "never mind the quality feel the width" if anyone remembers that catch phrase...
I agree completely with your thinking, and I would dearly love to be able to get my own pages out of their index in a more efficient manner. But it's clear Google wants to keep even outdated pages in their supplemental index for searchers.
Google got as big as it did - popularity and index-wise - because it was the best place to find even highly rare information. And in some cases, those outdated pages contain information that can't be found anywhere else.
1. Toolbar PR
2. Number of inbound links
3. Number of total results for a regular search
4. Number of total results for any "special operator" search - especially when it's over 1,000
5. Presence or absence of a "supplemental result" tag
6. Webmaster Tools feedback
Every one of these items has some usefulness. But Google's main purpose is providing primary search results to the end user, and not total precision in the above areas for webmasters or for those researching the competition. In fact, getting anywhere near precision in some of these areas is a monster of a technical challenge, given the way that Google "shards" their data.
So these kind of data are there to help as a general guide, but they are not worth major obsession over every little bump and dip. Traffic matters the most, and then the rankings that you can see (which may not be what everyone else sees.)
I agree that sometimes it feel like the new "crawl caching proxy" that came in with Big Daddy gets a bit hinky at times. This is data, after all, and any database I've ever worked with ends up with bad data from time to time. I'm sure Google does too.
Thanks for that link!
I think that this "new spidering system" (crawl caching proxy) IS the main culprit in many a webmaster's woes.
from MCs Blog (link above)
This crawl caching proxy was deployed with Bigdaddy, but it was working so smoothly that I didn’t know it was live. :) That should tell you that this isn’t some sort of webspam cloak-check; the goal here is to reduce crawl bandwidth.
Working Smoothly?
You need to look no further than your nearest webmaster forum to read of all the sites bouncing in and out of the serps to see that "Houston - We have a problem..."
Caryl
There's another Google fact we live with, and it causes needless upset for some webmasters. That's the secondary level of information Google tries to provide that is, speaking kindly now, often less than dependable. This includes things like:1. Toolbar PR
2. Number of inbound links
3. Number of total results for a regular search
4. Number of total results for any "special operator" search - especially when it's over 1,000
5. Presence or absence of a "supplemental result" tag
6. Webmaster Tools feedback...
Tedster,
These are precisely the types of information I was thinking of as coming from - what I referred to as - the Secondary Index.
Caryl
Google got as big as it did - popularity and index-wise - because it was the best place to find even highly rare information. And in some cases, those outdated pages contain information that can't be found anywhere else.
I agree. What I'm more concerned with are the pages that have never existed. You know the ones, they have strange URLs and you wonder how in the heck Google came up with them. In additional to the misspelled "widdget" page, there's always the fun "http://example. com/page1.html/maincontents.shtml" combo that combines parts of two different URLs. I get those a lot, and I know they didn't come from my site.
I think if a webmaster submits a sitemap to Google, they've verified their site, and Google knows who they are, then Google should index the URLs listed in that sitemap, and no others. If it finds some that it thinks belong, they can list them as a "Is this yours?" URL. Then you can say yea or nay. That would be an instant stop to the 302 redirects, the scrapers, and all the others.
Google has some great ideas to help webmasters, they just aren't implemented properly, or they aren't thorough enough.
You know, if Google really wanted to save bandwidth and preserve space, they would quit spidering bad/dead/old/incorrect URLs.
I agree with this. I 301'd pages in September that Google is still searching for. I recently decided to update my htacces thinking that it was safe to drop some of the 301s, but then I was surprised to see that those pages were again reported as missing in Webmaster tools.
also, the 'New Google' makes it exceedingly hard to get new content indexed and get an kind of appreciable traffic. Since anything that has a low pagerank gets put into supps it seems automatically. The only answer is to constantly be on the lookout for good quality links to every new section of content...that's a full time job in itself.
I 301'd pages in September that Google is still searching for. I recently decided to update my htacces thinking that it was safe to drop some of the 301s, but then I was surprised to see that those pages were again reported as missing in Webmaster tools.
Rarely do you remove 301s anytime soon. There are most likely links out there pointing to the old URIs and that is why Google is still searching for them.
I find that Yahoo picks up many more of these than Google, and yes I find it interesting to work out why these things are picked up in the first place. It often exposes deep problems within a website, not just incoming links with somple typos.
Since the introduction of Big Daddy, month after month, websites (good websites) have been bouncing in and out of Googles serps AND sometimes even in and out of Googles index entirely.I think it is about time that we consider just how this 'new system' works. Maybe then, we would have a better chance at understanding what is really happening to our sites.
Another less popular (and possibly feared) angle - it could simply be that the pages from (good websites) didn't score as well as others presented in a set of results, e.g., less activity for those sites; fewer clickthrus, bookmarking, return visits.
Google users will largely decide where your pages rank. And Google needs all that power to track their behavior and store personalized searches.
Another less popular (and possibly feared) angle - it could simply be that the pages from (good websites) didn't score as well as others presented in a set of results, e.g., less activity for those sites; fewer clickthrus, bookmarking, return visits.
Possibly, but this wouldnt explain why sites bounce in and out at the rate they do, as those stats you are referring to are not changing drastically daily.
So far, those have also been sites that drop the root index page from searches for "pages from the UK" at www.google.co.uk whereas the root page still shows up in "the web" searches.
Google has upped the filtering on "similar pages", especially on site:domain.com searches.
If you deliver a normal page (not a 404) you're most likely running a fully automated setup in which pages are constructed on-the-fly, in response to the URL. Or generated, as it is.
As for Google saving space; that rumor is highly overrated... remember, that we on the outside only see the tip of the Iceb0rg.
You probably think about this like forwarding your phone. But it's not. You're creating a "worm hole":
Think about it: You visit your good friend, and you know the house quite well. You go to the kitchen to get some item, and as you open the kitchen door ... *slam* you're in a flower shop in Tokyo with no way to get back.
That's a 404 redirect.
I would use a Document as ErrorDocument - that is: YES to 404.html. You can stil put links on your "404.html" document to the other domain.
a 404 *document* would be like leaving a note on the kitchen door saying "please don't use this door" and locking the door; that's the right way to do it.
A "404 error" where you specify the domain name in the ErrorDocument URL returns a 302 response! That will cause a LOT of problems.
Make a custom Error Document with helpful links to other parts of your site, and put it somewhere like /errors/error.404.html on the same site as where it is used. Make sure that this page also contains a <meta name="robots" content="noindex"> tag.
NEVER use your real index page as an ErrorDocument. That confuses the bot as to what your error page really is, and might affect the indexing of your root index page.
NEVER put a domain name in an ErrorDocument directive. When the error occurs, the handler will not deliver the correct HTTP status code if you do that.
Just explaining that stuff in a different way.
Now back to your regular scheduled progamming...
Rarely do you remove 301s anytime soon. There are most likely links out there pointing to the old URIs
I second this. I have a ten year old site that STILL has backlinks to pages that have not existed for years. Sometimes it's just not all google's fault, it's simply following links, business as normal.