Forum Moderators: Robert Charlton & goodroi
I was reading the New York Times article Microsoft and Google Set to Wage Arms Race [nytimes.com] and there was paragraph that caught my eye on page 2 that quoted Eric Schmidt (Google CEO) admitting that they have problems with being able to store more web site information because their "machines are full" (see page 2 of NYT article).
I am a webmaster who has had problems with getting / keeping my webpages indexed by Google. I follow Google's guidelines to the letter and I have not practiced any blackhat seo techniques.
Here are some problems I have been having;
1. Established websites having 95%+ pages dropped from Google's index for no reason.
2. New webpages being published on established websites not being indexed (pages that were launched as long as 6-8 weeks ago).
3. New websites being launched and not showing up in serps (as long as 12 months).
We're all well aware that Google has algo problems handling simple directives such as 301 and 302 redirects, duplicate indexing of www and non-www webpages, canonical issues, etc.
Does anybody think that Google's "huge machine crisis" has anything to do with any of the problems I mentioned above?
[edited by: tedster at 5:03 pm (utc) on May 3, 2006]
[edit reason] fix side scroll potential [/edit]
But don't they need to index the duplicate content to know it's duplicate content?
Not really. In this instance, I'm referring to query string duplicate content. As g1smd pointed out in another topic, vBulletin presents 12 different URI scenarios to the spider that all lead to the same content. Googlebot has been digging deeper and deeper into queries over the past 18 months. Personally I think it goes to deep into query strings and that's where part of the problem lies.
Q: Does this mean Google's SERPs quality will have to get worse soon?
Q: Are we talking about Google having to pull back on high-bandwidth products like Maps, Video, etc?
Q: Can we extrapolate from available data what G's infrastructure costs will be over the next 3 years if they continue business as usual?
Also, isn't there some role that P2P could play in saving G money on datacenter costs? If they developed a great G P2P client and distributed it, G's data serving costs would be distributed among its users. Does that idea have any merit?
Space needs grow exponentially as you process the data, parsing operations, temp files, even disk space used for VM (swap) while processing, so a crawl that is 1 TB would require 2 TB to process (roughly), and of course you might want want a backup so it all gets a bit out of hand quickly if you have too much toying around with data.
Their is a well known company out their that sells integrated Data Center modules, Racks, Cooling, Power and Security all in a pre-fab arrangement. This reduces the time to develop a datacenter from a year, down to a few months.
Or if they need a new datacenter overnight, just park a few of these babies in the Google parking lot.
[apcc.com ]
It was us webmasters that got them to where they are today and it will be us webmasters that influence people and determine the search engine of the future.
We as webmasters have been treated with contempt by google and unless they get their act together and start backing us up ..we will desert in droves. That is a fact.
You don't bite the hand that feeds you!
[theregister.com...]
With that many people, and making that much money, you think, they would have somebody at Google, doing capacity planning.
Maybe their capacity plans didn't allow for a flood of multimillion-page, template-based sites from Webmaster World members. :-)
With creating junk web pages is so cheap and easy to do, Google is engaged in an arms race with search engine optimizers. Each innovation designed to bring clarity to the web, such as tagging, is rapidly exploited by spammers or site owners wishing to harvest some classified advertising revenue.
Comments like the above from Reporter's surely don't help our industry name much!
And lingering in the background is the question of whether the explosion of junk content - estimates put robot-generated spam consists of anywhere between one-fifth and one-third of the Google index - can be tamed?
1/5 to 1/3 of the Google index? That's a big chunk of luncheon meat.
Please tell me what you think as a reply to this post below on the thread. Really odd things going on with one of our sites with huge publicity throughout the sector for us. Makes no sense we would drop 4 pages in results (30-40 spots) all across the board!
I think something is wrong and will be worse soon for most of the big sites.
[webmasterworld.com...]
Seems as if the Herald Tribune has already commented on this interview by April, 21st.
[iht.com...]
If you have a supplemental problem, just be happy you don't live in Lancashire, they had a "machine crisis" over there with their parking-ticket-machines;) So, what's worse?
I doubt that figure very much. I vaguely remember hearing that they used custom-built systems. If I were doing it I'd be using something like blade technology.
I've lost count of the number of times, during a 25-year career with mainframe manufacturers, that I've sat in discussions with user IT executives who had a self-written DBMS or major subsystem that had simply run out of scalability and left them high and dry. Originally they had turned down off-the-shelf DBMSes like DB2 and Oracle because they thought them too expensive and thought they could do a better job themselves. They were always wrong in the end.
pageonresults: So, Google has a storage problem eh? Could it be because they are out indexing every single thing they can get their hands on? Could it be that 30-40% of their index is duplicate content in one form or another? How about cleaning up the index first and then worry about increased storage. Forget about the BIG NUMBERS game for a while and focus on quality. That is what we've come to know Google for, the quality of the SERPs.
Agree with first part, but where did "G$$gle known for quality" came from? 20% relevancy, tops, scraped content, payed ads on top of it.
here, here sukkah, store this...(presses a big red "Generate junk for Google" button repeatedly)...
just love this. content scraper, number one abuser of webmasters finally running out of space to store all that scraped data. R.I.P. G$$gle :)
http://www.iht.com/articles/2006/04/21/business/GOOGLE.php
Only 6,790 employees can live of the work of let's say 100 million active web contributors out of 1 billion internet users. So to feed one G employee ~14727 other people have to write some form of content. No wonder content is nothing worth any more.
It was during Bourbon that I finally understood that Google could goof and not all those sites that lost their rankings had some something wrong. I've never felt secure with Google since.
I have had a directory completely disappear that was an AdSense affiliate + Google search.
I have other sites and established message boards that have AdSense code that have lost hundreds of listings, are showing main page only and/or very old cache dates.
I have sites that do not have AdSense code and nothing to do with AdSense or Google at all which have also lost hundreds of listings.
According to a post on MC's blog, engineering dept. does not take into account a user's business relationship with Google. I don't know whether that's fact or not, but it seems to be so or else it's just further proof of an active meltdown.
.::DC:.
A few posts ago tedster gave at least six figures, and that seems very very liekly. The NYT talks about "estimations", so who if not the people in here should have the knowledge to estimate that. Maybe it's not a million yet, but given the growth of the internet, this is only a question of months or perhaps a few years.
Whatever the exact figure may be, it should be clear that google (among others) is setting benchmarks in this respect. It has been mentioned before, that connecting such a huge mass of PCs in a reasonable way is far from easy.
I'm not a MBA, but something tells that there has to be more than just buying servers and calculating space. Infrastructure means DC maintenance, creation, buying the servers, modifying them, replacing, maintaining, bandwidth etc. etc. Plus, things go wrong, and eventually you reach a point where you don't get the same return on the investment (i.e. doubling the investment will not increase your "infrastructure" by 100%)
According to a post on MC's blog, engineering dept. does not take into account a user's business relationship with Google. I don't know whether that's fact or not, but it seems to be so or else it's just further proof of an active meltdown.
Which means the above mentioned communication problems are in fact deliberate and planned. While obviously planned with the good intention not to have a corrupted search engine, I wonder about the trade offs of this policy. :\
And don't forget about electricity. Each 1U server will cost a little over $300 a year in electricity alone. So:
1,000,000 servers = $25 million a month in electric bills.
Infrastructure means DC maintenance, creation, buying the servers, modifying them, replacing, maintaining, bandwidth etc. etc.
And don't forget about electricity. Each 1U server will cost a little over $300 a year in electricity alone. So: 1,000,000 servers = $25 million a month in electric bills.
And ... shall we put all these new servers in a parking lot? Anyone have some spare room in their garage for these puppies? Infrastructure often actually requires "a structure"! Housing an additional 10,000 or so bits-o-hardware as well as the employees to maintain them adds somewhat more to the infrastructure requirements.
For a couple billion, I'll have my Mom come over and tell them clean out their closets.
LOL! Good one ... and this suggestion likely has more than a small ring of truth to it!
Google has no storage problem. They have a conceptual problem. Massive amounts of their space are taken up by things that don't exist.
And most of the rest is taken up by things no one is interested in.
SERPS only ever return 1000 results. 99.99% of searchers are only interested in the first 20 of those (if that many) and Google keeps telling us that it has millions of pages of other crap that no one would be remotely interested in for that two word term.
Surely they could have one (or a few) huge machine(s) full of all of the crap and thousands of machines containing indexes of the good stuff that people actually search for and are interested in. Then if someone searches for something and they are not happy with the top of the barrel stuf they could click a button and send the search off to scrape stuff from the bottom of the big barrel.
Then, a bit like your local library who sell off books no one ever borrows, eventually Google could chuck out all of the stuff that no one has ever looked at via a search.
Best wishes
Sid
Is google running out of space so now they are dropping all the PR0 or low PR pages? Or are they only indexing 2-3 levels deep on lower PR pages?
I seen a post that a PR6 site was fine and increasing it's index count but PR5's and below are having problems.
I have a PR2 site with only 140 pages indexed down from 35,000.
I have a PR4 site with only 120+ pages and every page is indexed fine.
For some reason, I have a site that has been banned from google? But just today it suddenly has a PR4? It had a PR5 before it got banned or penalized and then went to PR0 until today. I still can't find the site in google but it suddenly has PR back?
Google has gone GaGa! HAHAHA!
Remember that we all saw a BigDaddy Index with million of pages indexed from our Websites. This Index ran on the Infrastructure Jan/Feb 2006. Why should the googlers shut down so many mchines, that their core-business "searching and indexing websites" go down to 10% of its power/capability?
So i think they did a major bug or a huge problem merging some databases (the old one aka "Supplemental #*$! 2004" and the new one aka "BigDaddy Data").
Another idea could be, that they show us a MiniGoogle (1-10% of old Indexdata) to have enough idleness and Serverpower for preparing the Mega-Super-Spam-Free BigDaddy-Index in the backround.
Conclusion: Sit down and wait and dont waste time with DC-Watchin'!
Grettings from sunny Germany,
Bonneville