Lost rankings possibly related to Google's internal file system?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Lost rankings possibly related to Google's internal file system?

JAB Creations

11:53 am on Jan 28, 2012 (gmt 0)

This is purely speculation though after reading an article over at Ars it occurred to me that Google's custom internal setup may inadvertently effect the rankings of sites unrelated to algorithms. Their system was described as delaying the updates of data on drives dedicated for more long term storage.

If a drive with temporary data (say an update for serps in example) dies before that data is moved to the more permanent drive Google's system may opt to fetch an earlier backup of that drive. In that circumstance it may not reiterate it's calculations and therefore use older copies of data against newer algorithm changes (even if no changes have occurred during this period when the temp drive dies).

This lapse in updates combined with the fact that Google relies on cheaper hardware and thus increases the possibility of higher hard drive failure rates which may play a factor in why some sites lose (or even gain) ranking without any correlation to other events. Thoughts?

- John

onebuyone

4:32 pm on Jan 28, 2012 (gmt 0)

Isn't google data storge distributed over network, so failure of 1 disk or even on node will not affect data?

Robert Charlton

11:09 pm on Jan 28, 2012 (gmt 0)

It's likely that the Ars Technica article you're referring to, which discusses the Google File System (GFS) is this recent one...

The Great Disk Drive in the Sky: How Web giants store big - and we mean big - data
[arstechnica.com...]

The article suggests that there is enough redundancy at a single data center to cover (most) issues of drive failure....

Much in the way that a RAID 5 storage array "stripes" data across multiple disks to gain protection from failures, GFS distributes files in fixed-size chunks which are replicated across a cluster of servers.... GFS is designed to be tolerant of that without losing (too much) data.

I think the rules built into the system are likely to address your other concern....

...it may not reiterate its calculations and therefore use older copies of data against newer algorithm changes (even if no changes have occurred during this period when the temp drive dies).

As I've understood it (and I'm no means a database guy, so someone else may be able to say this a lot better than I can)... large distributed database systems like Google's have rules and procedures built in to coordinate the order of operations, and to prevent older data or system errors from corrupting dependent calculations.

Such systems inherently involve a degree of latency that needs to be managed. This management would address not only inconsistencies between data centers, and would allow for synchronization of constantly changing data tables, but I think would also safeguard against errors as you suggest by managing the order of calculations and internal propagation throughout the system.

As the article describes it....

To ensure that the data firehose is highly available, GFS trades off some other things - like consistency across replicas. GFS does enforce data's atomicity - it will return an error if a write fails, then rolls the write back in metadata and promotes a replica of the old data, for example. But the master's lack of involvement in data writes means that as data gets written to the system, it doesn't immediately get replicated across the whole GFS cluster. The system follows what Google calls a "relaxed consistency model" out of the necessities of dealing with simultaneous access to data and the limits of the network.

This means that GFS is entirely okay with serving up stale data from an old replica if that's what's the most available at the moment - so long as the data eventually gets updated. The master tracks changes, or "mutations," of data within chunks using version numbers to indicate when the changes happened. As some of the replicas get left behind (or grow "stale"), the GFS master makes sure those chunks aren't served up to clients until they're first brought up - to-date...

The Google File System chunks are 64 megabytes.

graeme_p

4:31 am on Jan 29, 2012 (gmt 0)

It would explain something I have seen several times: the site loses its position very badly (e.g. half the traffic lose position for a wide range of search terms), I change nothing, and a week later it recovers.