Google Snubs XML: Invents Its Own Data Exchange Format: Protocol Buffers - (deprecated) XML Development forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Google Snubs XML: Invents Its Own Data Exchange Format: Protocol Buffers

XML is too complicated

httpwebwitch

3:08 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Google thought of using XML as a lingua franca to send messages between its different servers. But XML can be complicated to work with and, more significantly, creates large files that can slow application performance.

source [news.cnet.com]

Google has invented this new data exchange format called Protocol Buffers [code.google.com], and it's definitely worth a look.

Why not use XML?
The Developer Guide [code.google.com] says:

Protocol Buffers

are simpler
are 3 to 10 times smaller
are 20 to 100 times faster
are less ambiguous
generate data access classes that are easier to use programmatically

An example, also from the Developer Guide:

let's say you want to model a person with a name and an email. In XML, you need to do:
<person>
<name>John Doe</name>
<email>jdoe@example.com</email>
</person>
while the corresponding protocol buffer message definition (in protocol buffer text format) is:
person {
name = "John Doe"
email = "jdoe@example.com"
}
In binary format, this message would probably be 28 bytes long and take around 100-200 nanoseconds to parse. The XML version is at least 69 bytes (if you remove whitespace) and would take around 5,000-10,000 nanoseconds to parse.

Does the syntax of PB look familiar?
yeah... it's like someone got inspired by JSON and C, then removed all the fluff.

Protocol Buffers even have their own DTD-like definition syntax, which you store in a ".proto" file. At first glance, it looks like an evolved DTD with newer syntax, but actually it's far less kludgy and far more interesting.

This is another strong contender in an arsenal of data exchange formats that already includes XML, CSV, JSON, etc.

cmarshall

3:46 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Looks like it will end up nuking JSON, but XML is still pretty safe.

whoisgregg

4:03 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

My first reaction was "Umm, this is JSON."

Then I wondered why Google would want to kill JSON. Still don't have an answer to that one... :/

httpwebwitch

4:03 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Upon a little further reading, here's what Protocol Buffers offer:

- the message itself is a nicely compact definition of names and values
- the "proto" file defines what you expect to see in the message
- the proto is used to autogenerate parsing code in several languages (Java, C++, Python), with methods for get, set, et al.
- the autogenerated parser turns the message into a well-defined object or class
- after parsing, the data in the message is very easily accessed and manipulated

The Google developers don't assert that this is a "replacement" for XML. It's just a better data carrier in many situations. The things that XML claims to be good at are still accomplished best with XML:

- XML is better for describing a text-based document with markup, whereas PB doesn't easily interleave structure with text
- XML is human-readable, and human-editable, whereas PB is squished into binary
- XML is self-describing, whereas PB relies on a proto file to describe the content of a message

Protocol Buffers, to me, seems to accomplish for C++ and Java what JSON does for Javascript, albeit in a necessarily more complex way. It encapsulates data in a tight, slim format that is easily parsed by the recipient language.

Bravo, Google!

IMHO, the name is horrid... What the heck does "Protocol Buffer" mean? I had to look it up, and even then I wasn't satisfied with the etymology. I guess the name, like PB itself, is not "self-describing" ;)

httpwebwitch

4:13 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

PB isn't going to nuke JSON. JSON evaluates natively in Javascript; it's as fluid as a data carrier can get. IMHO, JSON is firmly secure as the lingua franca of AJAX.

PB can offer the same kind of data fluidity to Java, C++, and Python. Maybe a PHP5 parser will emerge next! That would be a thrill.

g1smd

4:28 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

How do you nest values in this new schema?

Can you have a new {} pair inside a "quoted" value.

In XML, you have nested <tag></tag> pairs, which you can extend almost infinitely.

In the new schema, you have { } pairings at one level and name = "data" structures at another (lower) level.

Commerce

4:30 pm on Jul 8, 2008 (gmt 0)

10+ Year Member

Actually, I think Google is right about XML, though, I am not certain their answer is right.

I've been dealing with data records for many years and I think that there are far better ways to describe data and present variable or fixed type records in ways that the receiver can then use them than either XML or PB.

-Commerce

mikedee

5:29 pm on Jul 8, 2008 (gmt 0)

10+ Year Member

I think this sounds like a good idea for data interchange. At the end of the day, there is server 1 which has an object in memory which they want to transfer to an in memory object on server 2. Why should it have to be piped through XML?

At the moment all PHB's have been sold the delusion that XML is the solution to all their worries. In most cases it is terrible because most language objects and xml do not map at all (for example attributes vs child nodes), most of my XML parsing code ends up just squeezing some XML into native objects to pass to the next part of my script.

There should be a PHP parser since that is the most widely used language. I hope that this gets some hold, but I am very doubtful. The lack of popular parsers is a hindrance, plus it has a stupid name. Protocol Buffers sounds like it is to do with protocol handlers. They should have called it Hyper XML 2.0 if they want traction in the enterprise.

cmarshall

5:41 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

They should have called it Hyper XML 2.0 if they want traction in the enterprise.

Kinda like JAVAscript?

Sorry. Couldn't resist...

Actually, I have to pick at one thing Yon Honorable Witch of The Web stated:

JSON evaluates natively in Javascript

Running a JSON object through an eval() is not what I call "native." They should have built in a more straightforward way of resolving it by now.

To wit:

var myGodThisIsSimple = someJSONParameter;

johnnie

7:06 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

LOL! Somehow I'm feeling nostalgia towards:

typedef person {
char* name,
char* email
} individual;

httpwebwitch

8:08 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

johnnie what is that nonsense you just typed? It looks like jibberish, it can't possibly be a real language. ;)

henry0

9:07 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Maybe a PHP5 parser will emerge next! That would be a thrill.

Agreed on, comming with a PHP background I feel I can learn that new one pretty swiftly

Could you call Mr.G and ask for that php parser :)

m0thman

1:07 am on Jul 9, 2008 (gmt 0)

10+ Year Member

And what, pray tell was wrong with:

000100 03 PERSON.
000110 05 NAME PIC X(60) VALUE SPACES.
000120 05 EMAIL PIC X(100) VALUE SPACES.

Hester

10:07 am on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

How about improving on G's example to use this instead?

person { 
name:John Doe 
email:jdoe@example.com 
}

Commerce: "I've been dealing with data records for many years and I think that there are far better ways to describe data and present variable or fixed type records in ways that the receiver can then use them than either XML or PB."

Can you give an example?

henry0

11:01 am on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

person {
name:John Doe
email:jdoe@example.com
}

Don't you need a stop somewhere?
will not this reads as name:John Doeemail:etc...
so why not
person {
name:John Doe;
email:jdoe@example.com;
}

g1smd

11:57 am on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I assume that [CR][LF] is it, but some systems might send just [CR] or just [LF].

How is that catered for?

mikedee

12:09 pm on Jul 9, 2008 (gmt 0)

10+ Year Member

I think there is some confusion here...

person {
name:John Doe
email:jdoe@example.com
}

This is not a valid .proto file because it actually has data. The proto file is supposed to define the class not actually supply data.

The original example should be

XML:

<person><name>Joe Blogs</name></person>

PB

0A 3C

Its a binary format, so none of it is readable except the proto files. The original post is confusing because the PB example is not correct. An actual example of a .proto file would be like this.

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}

1,2 and 3 are not data values, they are used by the parser to mark attributes.

httpwebwitch

1:23 pm on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

exactly right mikedee - the PB example could be confusing because it shows the data rendered as text. The real PB data is not human-readable. (at least not by the average human)

jezzer300

3:24 pm on Jul 9, 2008 (gmt 0)

10+ Year Member

I've worked with lots of big XML projects and it seams so ambiguous and the messages can get so big with all these long meaningful names. Then you have to decode it.

For a human readable format PB looks good to me. Less memory, traffic and less decoding.

cmarshall

3:39 pm on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

For a human readable format PB looks good to me.

I don't think it's human readable. I think it's a compiled format, like X.409 (Sorry, I looked for an authoritative link, but the industry has worked so hard to bury the nightmare that was X.409 that it looks like it's time to repeat it).

MatthewHSE

4:49 pm on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I don't know XML that well, but the protocol buffer example looks a lot simpler and more intuitive. Reminds me of CSS. ;)

aleksl

7:18 pm on Jul 9, 2008 (gmt 0)

I guess "invents" in title should be in quotes, or meant as a sarcasm. I bet there are a few people out there who will claim a copyright infringement on this one.

Hester

7:48 pm on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I assume that [CR][LF] is it, but some systems might send just [CR] or just [LF]. How is that catered for?

I assumed line endings, but the example using a semi-colon would do better.

The real PB data is not human-readable.

So how is that better than XML in terms of legibility? (Think of future generations trying to understand PB code.)

cmarshall

8:07 pm on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

For Google, it's all about brevity. Their home page probably gets a billion hits a day.

Every single byte that goes on their page needs to be championed. It is probably the most valuable real estate in the world, making the Emperor's Palace in Tokyo look like a double-wide at Lakeside Trailer Park.

I can definitely understand why they don't want to use XML.

However, I am not Google, and my property is more like a lean-to outside the Manilla Dump. There is no valid reason for me to publish an SDK like PB. XML will do just fine for me.

henry0

10:12 pm on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I don't know XML that well, but the protocol buffer example looks a lot simpler and more intuitive. Reminds me of CSS. ;)

That was my point, could it be easier on XML "chalenged" coders

mikedee

11:39 am on Jul 10, 2008 (gmt 0)

10+ Year Member

If you think this is easier, then I don't think you really understand it. XML is just an overused data description format, this is a code generator (from the .proto files) and a binary data serialization format.

I think the CSS like example and the comparison to XML is fairly confusing. It is more like CORBA or SOAP combined with PHP's serialize or python's pickle which can be read by any supported language. The fact that this serialized format can be transfered over the wire is incidental, the binary output could be saved to disk.

It is nothing to do with being easier than XML, its more about using the right tool for the job and cutting down on bloat.

Hester

8:27 am on Jul 14, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Mark Pilgrim has an interesting post about PB on his site.

Protocol buffers are "just" cross-platform data structures. All you have to write is the schema (a .proto file), then generate bindings in C++, Java, or Python. (Or Haskell. Or Perl.) The .proto file is just a schema; it doesn�t contain any data except default values. All getting and setting is done in code. The serialized over-the-wire format is designed to minimize network traffic, and deserialization (especially in C++) is designed to maximize performance.

Besides being blindingly fast, protocol buffers have lots of neat features. A zero-size PB returns default values. You can nest PBs inside each other. And most importantly, PBs are both backward and forward compatible, which means you can upgrade servers gradually and they can still talk to each other in the interim. (When you have as many machines as Google has, it�s always the interim somewhere.)

Comparisons to other data formats was, I suppose, inevitable. Old-timers may remember ASN.1 or IIOP. Kids these days seem to compare everything to XML or JSON. They�re actually closer to Facebook�s Thrift (written by ex-Googlers) or SQL Server�s TDS. Protocol buffers won�t kill XML (no matter how much you wish they would), nor will they replace JSON, ASN.1, or carrier pigeon. But they�re simple and they�re fast and they scale like crazy, and that�s the way Google likes it.

source [diveintomark.org]

[edited by: Hester at 8:29 am (utc) on July 14, 2008]

[edited by: httpwebwitch at 3:45 pm (utc) on July 14, 2008]
[edit reason] added source citation [/edit]

cmarshall

10:36 am on Jul 14, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Ha-ha. No mention of X.409.

This one actually has a chance, as it is backed by a corporation (which always seems to help standards become standards). However, this is pretty much exactly like X.409.

X.409 was developed for exactly the same reason, but was about messaging only, and didn't have some of the nice bells and whistles Google has added.

I doubt many of the folks on this board will be directly using PB, as it is really a server-level protocol that actually sits below the level of most implementations.

However, if they add a PB compiler/decompiler (not parser) to PHP (It would need to be a PEAR or PECL extension, for backwards compatibility), then it would start being a lot more accessible.

PHP has a problem, in that C++ programmers hate it. They view it like VB. They have a point, but PHP delivers on the empty promises of VB, and needs to be taken very, very seriously. PHP5.1 is finally becoming a usable OO language (but is still kinda Playskool, compared to C++).