Forum Moderators: open
Back to the drawing board
"Without question google has lost a significant amount of information."
I don't see where Google lost any information.
Would you care to add on to that statement?
I looked over SJ a bit for a couple of sites I work on and found the content to come close in count to what my logs showed as being grabed by freshbot.
There were no deepbot pages but if Google does multiple partial updates, they can bring in the rest of the pages at a later time.
Cheers,
Depending where they are in the process they don't even need all of the page content.
It is possible for the PR calculations to be done without all of the page content the same can be said about returning SERPs on a server no need to keep all that HTML, XML, XHTML crud that would be required to render the cache.
Cheers,
Did the April deepcrawl do a better job than than the March one? We'll see eventually.
-sj is just one more data point that something not-good has happened to Google's data the past two months. It may all correct itself and be fine and dandy when the update occurs in the next few days/hours, but at this point all we can see outside the Googleplex is an incomplete/not-good most recent deep crawl and the public showing of a very poor quality work in progress on -sj that was in no way ready for prime time.
It makes you go "hmmmm?", but beyond that we should wait for the update to actually update and draw some conclusions then.
I repeat myself...the update has not started. The past week we just got a glimpse into the internal workings of what makes up an update.Updates of such large magnitude are an incremental process. Knowing the amount of work that goers into our distributed databases, i know I am right here!
Not hardly. Google has 10,000 computers with 80 GB drives on each, Ill quote the rest:
[webmasterworld.com...]
We call it Everflux: it can act mysteriously at times.
Here's the short story on it:
Google is constantly crawling and updated selected pages that meet some predetermined criteria. That may involve last modified dates and PR values.Google has many data centers and runs a distributed load sharing system across more than 10k pc's running linux with 80 gig drives at last report. Somehow, the copy of the index must get transferred to all those hard drives in all those data centers. You ever transfer 80gig across the net? And then distribute that 80 gig down into thousands of hard drives?
All of that takes a great deal of time. It's a constant process for Google. More-than-likely, the daily updates only copy out those parts of the index that are really updated. That's yet another possibility where new and old data could get mixed.
Load sharing works transparently. You do a search on Google and the request is routed via dns magic to the either the nearest data center or the nearest data center with the least load (we don't know their load distribution criteria on that).
Lastly, they could be working on the index, rolling indexes back, switching parts of the index, backing up parts of the index, rewriting some offending part of the index, deleting parts of an index - or a multitude of other actions or problems that only Google could know about.
Take those combinations of not knowing which box you are going to connect to and which index it may have, and the possibility of daily updating going on at the same time, and results may be unpredictable. There could be dozens of different indexes floating around various data centers - we have no clue.
One minute you'll get one copy of a index during a search, and the next you'll get another. Sometimes that could be yesterdays crawl, or last months crawl, or four months ago crawl
Losing data for Google would be like your credit card company losing the amount you owe, it aint Never going to happen.
The March crawl seemed normal, but the pages that showed up on April 11, when the update kicked in, were 35,000 instead of the usual 50,000 to 60,000. What was more convincing is that my traffic has been cut in half since April 11.
Okay, so I look at www-sj. It showed 95,000 pages indexed a week ago. Yesterday it switched to 91,700 pages indexed. Sure, I'm excited. Even the slight dip didn't worry me. My backlinks are one-third what they should be, but so are everyone else's.
But now I decided to look deeper. Everything looks screwed up. The 90,000+ counts appear to be bogus. It's quite likely that the update will not be better for me at all.
In fact, I cannot explain the 35,000 number now. When I actually list the pages in one of the many directories, the current www situation looks like it beats the www2 situation easily. Not only are more pages showing up in www, but the ones that do show up don't have nearly as many URL-only links (such a link means that Google noticed the link, but didn't put the page in the index).
Am I confused? You betcha. Because the traffic still stinks.
Do I think Google engineers are confused? Yep, I think the whole thing is almost out of control out there. I don't know if it's a problem of bureaucratic non-communication, or if there are too many cooks in their gourmet kitchen, or if the Web has finally surpassed Google's ability to handle it all. But I'm losing a lot of confidence in anything I see on Google by way of numbers.
The only thing that counts is that little "percent of total traffic" number that comes from Google, by virtue of referrals in my logs. It's been flat-lined for the last 30 days. It's almost to the point where a Declaration of Independence is indicated. Until it creeps up and stays there, I'm going to stop obsessing over Google.
I have 2 sites, different URL, stored on same ip, they are in comletely different fields, non related what so ever. Now both of them are missing on google. It make sense to me that google crawl sites and collect data (sort data) according to ip address, and the data from my ip are lost, which causes both of the sites lost from sj.
If you say one of them is banned, then how to you explain the other one? And again, I don't spam!