|But don't they need to index the duplicate content to know it's duplicate content? |
Not really. In this instance, I'm referring to query string duplicate content. As g1smd pointed out in another topic, vBulletin presents 12 different URI scenarios to the spider that all lead to the same content. Googlebot has been digging deeper and deeper into queries over the past 18 months. Personally I think it goes to deep into query strings and that's where part of the problem lies.
For those of us non-techies, could someone please summarize this entire discussion in terms that Joe Schmoe can understand?
Q: Does this mean Google's SERPs quality will have to get worse soon?
Q: Are we talking about Google having to pull back on high-bandwidth products like Maps, Video, etc?
Q: Can we extrapolate from available data what G's infrastructure costs will be over the next 3 years if they continue business as usual?
Also, isn't there some role that P2P could play in saving G money on datacenter costs? If they developed a great G P2P client and distributed it, G's data serving costs would be distributed among its users. Does that idea have any merit?
I doubt that it is as trivial as a crawling issue.
Space needs grow exponentially as you process the data, parsing operations, temp files, even disk space used for VM (swap) while processing, so a crawl that is 1 TB would require 2 TB to process (roughly), and of course you might want want a backup so it all gets a bit out of hand quickly if you have too much toying around with data.
With that many people, and making that much money, you think, they would have somebody at Google, doing capacity planning.
Their is a well known company out their that sells integrated Data Center modules, Racks, Cooling, Power and Security all in a pre-fab arrangement. This reduces the time to develop a datacenter from a year, down to a few months.
Or if they need a new datacenter overnight, just park a few of these babies in the Google parking lot.
This biggest worry for google right now is not the lack of resources but the loss of trust and respectability of webmasters.
It was us webmasters that got them to where they are today and it will be us webmasters that influence people and determine the search engine of the future.
We as webmasters have been treated with contempt by google and unless they get their act together and start backing us up ..we will desert in droves. That is a fact.
You don't bite the hand that feeds you!
The Register is reporting on the problem and Search Engine Watch has FINALLY gotten around to a blog mention on the fiasco.
I did some calculations and I figured that this exact post will be the one that will fill up the last byte, of the last sector, of the last drive, on google's very la
|With that many people, and making that much money, you think, they would have somebody at Google, doing capacity planning. |
Maybe their capacity plans didn't allow for a flood of multimillion-page, template-based sites from Webmaster World members. :-)
|With creating junk web pages is so cheap and easy to do, Google is engaged in an arms race with search engine optimizers. Each innovation designed to bring clarity to the web, such as tagging, is rapidly exploited by spammers or site owners wishing to harvest some classified advertising revenue. |
Comments like the above from Reporter's surely don't help our industry name much!
|And lingering in the background is the question of whether the explosion of junk content - estimates put robot-generated spam consists of anywhere between one-fifth and one-third of the Google index - can be tamed? |
1/5 to 1/3 of the Google index? That's a big chunk of luncheon meat.
Please tell me what you think as a reply to this post below on the thread. Really odd things going on with one of our sites with huge publicity throughout the sector for us. Makes no sense we would drop 4 pages in results (30-40 spots) all across the board!
I think something is wrong and will be worse soon for most of the big sites.
Seems as if the Herald Tribune has already commented on this interview by April, 21st.
If you have a supplemental problem, just be happy you don't live in Lancashire, they had a "machine crisis" over there with their parking-ticket-machines;) So, what's worse?
> In figures: 1.000.000 PCs!
I doubt that figure very much. I vaguely remember hearing that they used custom-built systems. If I were doing it I'd be using something like blade technology.
I've lost count of the number of times, during a 25-year career with mainframe manufacturers, that I've sat in discussions with user IT executives who had a self-written DBMS or major subsystem that had simply run out of scalability and left them high and dry. Originally they had turned down off-the-shelf DBMSes like DB2 and Oracle because they thought them too expensive and thought they could do a better job themselves. They were always wrong in the end.
|pageonresults: So, Google has a storage problem eh? Could it be because they are out indexing every single thing they can get their hands on? Could it be that 30-40% of their index is duplicate content in one form or another? How about cleaning up the index first and then worry about increased storage. Forget about the BIG NUMBERS game for a while and focus on quality. That is what we've come to know Google for, the quality of the SERPs. |
Agree with first part, but where did "G$$gle known for quality" came from? 20% relevancy, tops, scraped content, payed ads on top of it.
here, here sukkah, store this...(presses a big red "Generate junk for Google" button repeatedly)...
just love this. content scraper, number one abuser of webmasters finally running out of space to store all that scraped data. R.I.P. G$$gle :)
Only 6,790 employees can live of the work of let's say 100 million active web contributors out of 1 billion internet users. So to feed one G employee ~14727 other people have to write some form of content. No wonder content is nothing worth any more.
Given Google's reported current space constraints and the need to keep their revenue stream up has anyone done research on the pages that are dropping out of the index, i.e., do they NOT contain AdSense or is the site NOT signed up with AdWords, etc? Seems to me they would retain those pages in the index.
Lorel, that would assume that what is happening is in Google's control. I wonder about that with some sites having all but thier homepage disappear and others having many times their total page count indexed.
It was during Bourbon that I finally understood that Google could goof and not all those sites that lost their rankings had some something wrong. I've never felt secure with Google since.
I have had a directory completely disappear that was an AdSense affiliate + Google search.
I have other sites and established message boards that have AdSense code that have lost hundreds of listings, are showing main page only and/or very old cache dates.
I have sites that do not have AdSense code and nothing to do with AdSense or Google at all which have also lost hundreds of listings.
According to a post on MC's blog, engineering dept. does not take into account a user's business relationship with Google. I don't know whether that's fact or not, but it seems to be so or else it's just further proof of an active meltdown.
> I doubt that figure very much.
A few posts ago tedster gave at least six figures, and that seems very very liekly. The NYT talks about "estimations", so who if not the people in here should have the knowledge to estimate that. Maybe it's not a million yet, but given the growth of the internet, this is only a question of months or perhaps a few years.
Whatever the exact figure may be, it should be clear that google (among others) is setting benchmarks in this respect. It has been mentioned before, that connecting such a huge mass of PCs in a reasonable way is far from easy.
>> If they use this to buy $10,000 servers(which would be a very bulky, name brand server - dual proc, couple gigs of ram, 6x300GB SCSI drives give or take) they can buy 150,000 servers. That's 225,000,000 Gigabytes of uncompressed information storage on servers that can handle thousands of requests per second EACH.
I'm not a MBA, but something tells that there has to be more than just buying servers and calculating space. Infrastructure means DC maintenance, creation, buying the servers, modifying them, replacing, maintaining, bandwidth etc. etc. Plus, things go wrong, and eventually you reach a point where you don't get the same return on the investment (i.e. doubling the investment will not increase your "infrastructure" by 100%)
|According to a post on MC's blog, engineering dept. does not take into account a user's business relationship with Google. I don't know whether that's fact or not, but it seems to be so or else it's just further proof of an active meltdown. |
Which means the above mentioned communication problems are in fact deliberate and planned. While obviously planned with the good intention not to have a corrupted search engine, I wonder about the trade offs of this policy. :\
>> Infrastructure means DC maintenance, creation, buying the servers, modifying them, replacing, maintaining, bandwidth etc. etc.
And don't forget about electricity. Each 1U server will cost a little over $300 a year in electricity alone. So:
1,000,000 servers = $25 million a month in electric bills.
Google will get huge discounts, though.
Google has no storage problem. They have a conceptual problem. Massive amounts of their space are taken up by things that don't exist.
For a couple billion, I'll have my Mom come over and tell them clean out their closets.
|Infrastructure means DC maintenance, creation, buying the servers, modifying them, replacing, maintaining, bandwidth etc. etc. |
|And don't forget about electricity. Each 1U server will cost a little over $300 a year in electricity alone. So: 1,000,000 servers = $25 million a month in electric bills. |
And ... shall we put all these new servers in a parking lot? Anyone have some spare room in their garage for these puppies? Infrastructure often actually requires "a structure"! Housing an additional 10,000 or so bits-o-hardware as well as the employees to maintain them adds somewhat more to the infrastructure requirements.
|For a couple billion, I'll have my Mom come over and tell them clean out their closets. |
LOL! Good one ... and this suggestion likely has more than a small ring of truth to it!
|Google has no storage problem. They have a conceptual problem. Massive amounts of their space are taken up by things that don't exist. |
And most of the rest is taken up by things no one is interested in.
SERPS only ever return 1000 results. 99.99% of searchers are only interested in the first 20 of those (if that many) and Google keeps telling us that it has millions of pages of other crap that no one would be remotely interested in for that two word term.
Surely they could have one (or a few) huge machine(s) full of all of the crap and thousands of machines containing indexes of the good stuff that people actually search for and are interested in. Then if someone searches for something and they are not happy with the top of the barrel stuf they could click a button and send the search off to scrape stuff from the bottom of the big barrel.
Then, a bit like your local library who sell off books no one ever borrows, eventually Google could chuck out all of the stuff that no one has ever looked at via a search.
So what is going on here?
Is google running out of space so now they are dropping all the PR0 or low PR pages? Or are they only indexing 2-3 levels deep on lower PR pages?
I seen a post that a PR6 site was fine and increasing it's index count but PR5's and below are having problems.
I have a PR2 site with only 140 pages indexed down from 35,000.
I have a PR4 site with only 120+ pages and every page is indexed fine.
For some reason, I have a site that has been banned from google? But just today it suddenly has a PR4? It had a PR5 before it got banned or penalized and then went to PR0 until today. I still can't find the site in google but it suddenly has PR back?
Google has gone GaGa! HAHAHA!
Crawl depth used to be a function of PR but now I would think it would be determined by what they describe as 'signals of quality'. We'll have to speculate what those are, as Google won't tell us any time soon.
Sorry  my mistake. A scraper. Here comes another Google six-month penalty for duplicate content.
"For a couple billion, I'll have my Mom come over and tell them clean out their closets."
Maybe your Mom can come over here and tell me how to recover the traffic I lost since Allegra update :-)
I don't think, the "dropping sites fast" thing is a technical or "run out of disk space error".
Remember that we all saw a BigDaddy Index with million of pages indexed from our Websites. This Index ran on the Infrastructure Jan/Feb 2006. Why should the googlers shut down so many mchines, that their core-business "searching and indexing websites" go down to 10% of its power/capability?
So i think they did a major bug or a huge problem merging some databases (the old one aka "Supplemental #*$! 2004" and the new one aka "BigDaddy Data").
Another idea could be, that they show us a MiniGoogle (1-10% of old Indexdata) to have enough idleness and Serverpower for preparing the Mega-Super-Spam-Free BigDaddy-Index in the backround.
Conclusion: Sit down and wait and dont waste time with DC-Watchin'!
Grettings from sunny Germany,
I'm sorry, but what kind of a publicly traded technology company is capable of not monitoring it's storage needs and increasing it as needed with time. Why wait for the last minute and then tell everybody: sorry guys we ran out of hard drive space. Don't they still have programmers working on the search algorithm anymore to see that coming?
| This 183 message thread spans 7 pages: < < 183 ( 1 2 3 4  6 7 ) > > |