Forum Moderators: open
Google thought of using XML as a lingua franca to send messages between its different servers. But XML can be complicated to work with and, more significantly, creates large files that can slow application performance.source [news.cnet.com]
Google has invented this new data exchange format called Protocol Buffers [code.google.com], and it's definitely worth a look.
Why not use XML?
The Developer Guide [code.google.com] says:
Protocol Buffers
An example, also from the Developer Guide:
let's say you want to model a person with a name and an email. In XML, you need to do:<person>
<name>John Doe</name>
<email>jdoe@example.com</email>
</person>while the corresponding protocol buffer message definition (in protocol buffer text format) is:
person {
name = "John Doe"
email = "jdoe@example.com"
}In binary format, this message would probably be 28 bytes long and take around 100-200 nanoseconds to parse. The XML version is at least 69 bytes (if you remove whitespace) and would take around 5,000-10,000 nanoseconds to parse.
Does the syntax of PB look familiar?
yeah... it's like someone got inspired by JSON and C, then removed all the fluff.
Protocol Buffers even have their own DTD-like definition syntax, which you store in a ".proto" file. At first glance, it looks like an evolved DTD with newer syntax, but actually it's far less kludgy and far more interesting.
This is another strong contender in an arsenal of data exchange formats that already includes XML, CSV, JSON, etc.
- the message itself is a nicely compact definition of names and values
- the "proto" file defines what you expect to see in the message
- the proto is used to autogenerate parsing code in several languages (Java, C++, Python), with methods for get, set, et al.
- the autogenerated parser turns the message into a well-defined object or class
- after parsing, the data in the message is very easily accessed and manipulated
The Google developers don't assert that this is a "replacement" for XML. It's just a better data carrier in many situations. The things that XML claims to be good at are still accomplished best with XML:
- XML is better for describing a text-based document with markup, whereas PB doesn't easily interleave structure with text
- XML is human-readable, and human-editable, whereas PB is squished into binary
- XML is self-describing, whereas PB relies on a proto file to describe the content of a message
Protocol Buffers, to me, seems to accomplish for C++ and Java what JSON does for Javascript, albeit in a necessarily more complex way. It encapsulates data in a tight, slim format that is easily parsed by the recipient language.
Bravo, Google!
IMHO, the name is horrid... What the heck does "Protocol Buffer" mean? I had to look it up, and even then I wasn't satisfied with the etymology. I guess the name, like PB itself, is not "self-describing" ;)
I've been dealing with data records for many years and I think that there are far better ways to describe data and present variable or fixed type records in ways that the receiver can then use them than either XML or PB.
-Commerce
At the moment all PHB's have been sold the delusion that XML is the solution to all their worries. In most cases it is terrible because most language objects and xml do not map at all (for example attributes vs child nodes), most of my XML parsing code ends up just squeezing some XML into native objects to pass to the next part of my script.
There should be a PHP parser since that is the most widely used language. I hope that this gets some hold, but I am very doubtful. The lack of popular parsers is a hindrance, plus it has a stupid name. Protocol Buffers sounds like it is to do with protocol handlers. They should have called it Hyper XML 2.0 if they want traction in the enterprise.
They should have called it Hyper XML 2.0 if they want traction in the enterprise.
Kinda like JAVAscript?
Sorry. Couldn't resist...
Actually, I have to pick at one thing Yon Honorable Witch of The Web stated:
JSON evaluates natively in Javascript
Running a JSON object through an eval() is not what I call "native." They should have built in a more straightforward way of resolving it by now.
To wit:
var myGodThisIsSimple = someJSONParameter;
person {
name:John Doe
email:jdoe@example.com
} Commerce: "I've been dealing with data records for many years and I think that there are far better ways to describe data and present variable or fixed type records in ways that the receiver can then use them than either XML or PB."
Can you give an example?
person {
name:John Doe
email:jdoe@example.com
}
This is not a valid .proto file because it actually has data. The proto file is supposed to define the class not actually supply data.
The original example should be
XML:
<person><name>Joe Blogs</name></person>
PB
0A 3C
Its a binary format, so none of it is readable except the proto files. The original post is confusing because the PB example is not correct. An actual example of a .proto file would be like this.
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}
1,2 and 3 are not data values, they are used by the parser to mark attributes.
I assume that [CR][LF] is it, but some systems might send just [CR] or just [LF]. How is that catered for?
I assumed line endings, but the example using a semi-colon would do better.
The real PB data is not human-readable.
So how is that better than XML in terms of legibility? (Think of future generations trying to understand PB code.)
Every single byte that goes on their page needs to be championed. It is probably the most valuable real estate in the world, making the Emperor's Palace in Tokyo look like a double-wide at Lakeside Trailer Park.
I can definitely understand why they don't want to use XML.
However, I am not Google, and my property is more like a lean-to outside the Manilla Dump. There is no valid reason for me to publish an SDK like PB. XML will do just fine for me.
I think the CSS like example and the comparison to XML is fairly confusing. It is more like CORBA or SOAP combined with PHP's serialize or python's pickle which can be read by any supported language. The fact that this serialized format can be transfered over the wire is incidental, the binary output could be saved to disk.
It is nothing to do with being easier than XML, its more about using the right tool for the job and cutting down on bloat.
Protocol buffers are "just" cross-platform data structures. All you have to write is the schema (a .proto file), then generate bindings in C++, Java, or Python. (Or Haskell. Or Perl.) The .proto file is just a schema; it doesn’t contain any data except default values. All getting and setting is done in code. The serialized over-the-wire format is designed to minimize network traffic, and deserialization (especially in C++) is designed to maximize performance.
Besides being blindingly fast, protocol buffers have lots of neat features. A zero-size PB returns default values. You can nest PBs inside each other. And most importantly, PBs are both backward and forward compatible, which means you can upgrade servers gradually and they can still talk to each other in the interim. (When you have as many machines as Google has, it’s always the interim somewhere.)
Comparisons to other data formats was, I suppose, inevitable. Old-timers may remember ASN.1 or IIOP. Kids these days seem to compare everything to XML or JSON. They’re actually closer to Facebook’s Thrift (written by ex-Googlers) or SQL Server’s TDS. Protocol buffers won’t kill XML (no matter how much you wish they would), nor will they replace JSON, ASN.1, or carrier pigeon. But they’re simple and they’re fast and they scale like crazy, and that’s the way Google likes it.
source [diveintomark.org]
[edited by: Hester at 8:29 am (utc) on July 14, 2008]
[edited by: httpwebwitch at 3:45 pm (utc) on July 14, 2008]
[edit reason] added source citation [/edit]
This one actually has a chance, as it is backed by a corporation (which always seems to help standards become standards). However, this is pretty much exactly like X.409.
X.409 was developed for exactly the same reason, but was about messaging only, and didn't have some of the nice bells and whistles Google has added.
I doubt many of the folks on this board will be directly using PB, as it is really a server-level protocol that actually sits below the level of most implementations.
However, if they add a PB compiler/decompiler (not parser) to PHP (It would need to be a PEAR or PECL extension, for backwards compatibility), then it would start being a lot more accessible.
PHP has a problem, in that C++ programmers hate it. They view it like VB. They have a point, but PHP delivers on the empty promises of VB, and needs to be taken very, very seriously. PHP5.1 is finally becoming a usable OO language (but is still kinda Playskool, compared to C++).