Forum Moderators: Robert Charlton & goodroi
I was reading the New York Times article Microsoft and Google Set to Wage Arms Race [nytimes.com] and there was paragraph that caught my eye on page 2 that quoted Eric Schmidt (Google CEO) admitting that they have problems with being able to store more web site information because their "machines are full" (see page 2 of NYT article).
I am a webmaster who has had problems with getting / keeping my webpages indexed by Google. I follow Google's guidelines to the letter and I have not practiced any blackhat seo techniques.
Here are some problems I have been having;
1. Established websites having 95%+ pages dropped from Google's index for no reason.
2. New webpages being published on established websites not being indexed (pages that were launched as long as 6-8 weeks ago).
3. New websites being launched and not showing up in serps (as long as 12 months).
We're all well aware that Google has algo problems handling simple directives such as 301 and 302 redirects, duplicate indexing of www and non-www webpages, canonical issues, etc.
Does anybody think that Google's "huge machine crisis" has anything to do with any of the problems I mentioned above?
[edited by: tedster at 5:03 pm (utc) on May 3, 2006]
[edit reason] fix side scroll potential [/edit]
35 trillion IP enabled devices...potentially
That seems very high. This would be about 5,000 devices for every person on earth.
But a key question about content is how much of it is worthy of indexing? Seems to me the best search applications will be those that know what NOT to index in the first place rather than those that try to index everything and then sort it out later.
I'm sorry, but what kind of a publicly traded technology company is capable of not monitoring it's storage needs and increasing it as needed with time. Why wait for the last minute and then tell everybody: sorry guys we ran out of hard drive space
IMHO, it's usually a mistake to take hyperbolic remarks literally.
If the guy had said "We're getting killed by our electricity supplier," we'd probably see a thread here with the title: "Google staff electrocuted, bodies pile up at the plex." :-)
Example of an IPv6 address:
1080:0000:0000:0000:0000:0034:0000:417A
Here's the math:
2128, or about 3.403 × 1038 unique host interface addresses. That translates into 340,282,366,920,938,463,463,374,607,431,768,211,456 addresses.
39 digits equals 1000 sextillion
IPv6 goes officially live in 2008 (though there is already an active IPv6 network in place....and all Linux dists support this protocol currently..)
Matt Cutts did a post to the dropped pages today.
[mattcutts.com...]
He says all is fine! I think its a lie.
Matt Cutts did a post to the dropped pages today.
He says all is fine! I think its a lie.
I can't find that post. Can you identify it by timestamp?
maxD, last week when I checked there was a double-digit number of reports to the email address that GoogleGuy gave (bostonpubcon2006 [at] gmail.com with the subject line of “crawlpages”).
I asked someone to read through them in more detail and we looked at a few together. I feel comfortable saying that participation in Sitemaps is not causing this at all. One factor I saw was that several sites had a spam penalty and should consider doing a reinclusion request (I might do it through the webmaster console) but even that wasn’t a majority. There were a smattering of other reasons (one site appears to have changed its link structure to use more JavaScript), but I didn’t notice any definitive cause so far.
There will be cases where Bigdaddy has different crawl priorities, so that could partly account for things. But I was in a meeting on Wednesday with crawl/index folks, and I mentioned people giving us feedback about this. I pointed them to a file with domains that people had mentioned, and pointed them to the gmail account so that they could read the feedback in more detail.
So my (shorter) answer would be that if you’re in a potentially spammy area, you might consider doing a reinclusion request–that won’t hurt. In the mean time, I am asking someone to go through all the emails and check domains out. That person might be able to reply to all emails or just a sampling, but they are doing some replies, not only reading the feedback.
Sounds not so great
Given the amount of nastiness and bile that people like Matt Cutts and GoogleGuy have to take from unhappy Webmasters, it's a wonder they're willing to communicate at all.
As John Battelle, the editor of SearchBlog, stated:
"In the long run, it's about whether you have the best service."
Those of you who are pointing to recent shifts in the algo results as "evidence" of a storage problem would do well to look back through the recent past and remember that there is a shuffling every time algo changes are made, and we just experienced one, and that common sense tells us repeatedly to wait until things stabilize at all of the data centers before freaking out.
Google does not disclose technical details, but estimates of the number of computer servers in its data centers range up to a million.
Before spending an additional $1.5 billion on more.
Has anyone EVER ... in the history of the Earth ... tried to manage a project like that? Anyone? No, you haven't, and all of your wisdom on that topic (predicting need, rolling out infrastructure, etc.) is pretty frail in the face of the simply staggering numbers that this enterprising company is dealing with.
YOU GO, GOOGLE! ROCK OUR WORLD!
Free search is free and you get what you paid for, that's all. It's time for all of us to wake up, smell the coffee and admit that google has become a global advertisement agency. Free / organic searches are for collecting stats and demographics.
We are a bit crazed with Google are we not? Is this normal?
Has anyone EVER ... in the history of the Earth ... tried to manage a project like that? Anyone? No, you haven't, and all of your wisdom on that topic (predicting need, rolling out infrastructure, etc.) is pretty frail in the face of the simply staggering numbers that this enterprising company is dealing with.
Well yes, they've assembled the largest pile of internet garbage to date, and yes the numbers involved with that giant pile are truly staggering. But the thing about garbage is - even if you sort it carefully, and wash it before piling it, in the end it's still just garbage.
And EFV, I mostly agree with your view on all of this, but remember that Matt/GG/whomever are not posting those comments here and in blogs because they care about us webmasters - they're doing it as part of a PR operation for a giant, profit-minded company, and it's in their financial interest to do so.
For the last few days, on the "experimental" DC, the erroneous search that previously returned 900 vague supplemental results (that didn't match the search query) instead of just a few dozen relevant supplemental results (for deleted pages and expired domains), occasionally returned zero results - which is the correct result if Google ever cleaned up the old supplementals - for a phone number that has been completely removed from the web during the last few years.
Today, many DCs return zero results every time for this and several other similar queries for stuff that Google should have cleaned up long ago.
Now, is this a DC that has been cleaned up of old Supplemental Results, or is it a DC that has the Supplemental data missing and Google is going to add it back in again, in the next few days?
Time will tell.
.
A large website whose domain expired two weeks ago, had 12 000 pages listed, many of them supplemental for the last few years. The root shows a "domain expired" message. All other pages are gone from the site.
Google reindexed the site and overnight the number of listed pages has been reduced to under 100 on the "experimental" DC. It seems like Google is aggressively throwing away old data, whereas before they would have held on to it for years and years...
On the old "normal" DCs, Google still shows 12 000 pages listed.
Maybe their capacity plans didn't allow for a flood of multimillion-page, template-based sites from Webmaster World members. :-)
I know you're kidding, but in all seriousness, can there be any doubt that the adsense program has probably been THE major contributor to the development of millions of useless websites and pages that have to be indexed by googlesearch?
Oops. Talk about the law of unintended consequences.
the adsense program has probably been THE major contributor
SE spammers were already building the massive throw away sites well before Adsense. Any efficient monetization program would have had the same effect of increasing the quantity of those types of sites. (Regardless of whether it was Google's program or not.)
Point in case, the MFA sites are as much a problem for Yahoo and MSN (if not more so) then they are for Google.
In the older "BigDaddy" datacentres that page has failed to appear in the index for that search term (but still appears for the other search terms that it has ranked for, for the last 2 years).
I am guessing that those versions of the "BigDaddy" index are not being maintained, and will be phased out soon. Those "BigDaddy" datacentres usually return a higher number of pages, but are littered with ancient supplemental results.
In contrast, the "experimental" datacentres show a lower number of fully-indexed pages, but all of the supplmental results from before 2005 June are now gone. In their place some sites now show supplemental results from 2 to 10 months old instead. Supplemental results are for pages that no longer exist, or are the ghost of the content of pages that still exist but the supplemental results represent the previous version of the content of those pages.