homepage Welcome to WebmasterWorld Guest from 23.20.77.156
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / XML Development
Forum Library, Charter, Moderators: httpwebwitch

XML Development Forum

    
Google Snubs XML: Invents Its Own Data Exchange Format: Protocol Buffers
XML is too complicated
httpwebwitch




msg:3693287
 3:08 pm on Jul 8, 2008 (gmt 0)

Google thought of using XML as a lingua franca to send messages between its different servers. But XML can be complicated to work with and, more significantly, creates large files that can slow application performance.
source [news.cnet.com]

Google has invented this new data exchange format called Protocol Buffers [code.google.com], and it's definitely worth a look.

Why not use XML?
The Developer Guide [code.google.com] says:

Protocol Buffers

  • are simpler
  • are 3 to 10 times smaller
  • are 20 to 100 times faster
  • are less ambiguous
  • generate data access classes that are easier to use programmatically

An example, also from the Developer Guide:

let's say you want to model a person with a name and an email. In XML, you need to do:

<person>
<name>John Doe</name>
<email>jdoe@example.com</email>
</person>

while the corresponding protocol buffer message definition (in protocol buffer text format) is:

person {
name = "John Doe"
email = "jdoe@example.com"
}

In binary format, this message would probably be 28 bytes long and take around 100-200 nanoseconds to parse. The XML version is at least 69 bytes (if you remove whitespace) and would take around 5,000-10,000 nanoseconds to parse.

Does the syntax of PB look familiar?
yeah... it's like someone got inspired by JSON and C, then removed all the fluff.

Protocol Buffers even have their own DTD-like definition syntax, which you store in a ".proto" file. At first glance, it looks like an evolved DTD with newer syntax, but actually it's far less kludgy and far more interesting.

This is another strong contender in an arsenal of data exchange formats that already includes XML, CSV, JSON, etc.

 

cmarshall




msg:3693327
 3:46 pm on Jul 8, 2008 (gmt 0)

Looks like it will end up nuking JSON, but XML is still pretty safe.

whoisgregg




msg:3693351
 4:03 pm on Jul 8, 2008 (gmt 0)

My first reaction was "Umm, this is JSON."

Then I wondered why Google would want to kill JSON. Still don't have an answer to that one... :/

httpwebwitch




msg:3693353
 4:03 pm on Jul 8, 2008 (gmt 0)

Upon a little further reading, here's what Protocol Buffers offer:

- the message itself is a nicely compact definition of names and values
- the "proto" file defines what you expect to see in the message
- the proto is used to autogenerate parsing code in several languages (Java, C++, Python), with methods for get, set, et al.
- the autogenerated parser turns the message into a well-defined object or class
- after parsing, the data in the message is very easily accessed and manipulated

The Google developers don't assert that this is a "replacement" for XML. It's just a better data carrier in many situations. The things that XML claims to be good at are still accomplished best with XML:

- XML is better for describing a text-based document with markup, whereas PB doesn't easily interleave structure with text
- XML is human-readable, and human-editable, whereas PB is squished into binary
- XML is self-describing, whereas PB relies on a proto file to describe the content of a message

Protocol Buffers, to me, seems to accomplish for C++ and Java what JSON does for Javascript, albeit in a necessarily more complex way. It encapsulates data in a tight, slim format that is easily parsed by the recipient language.

Bravo, Google!

IMHO, the name is horrid... What the heck does "Protocol Buffer" mean? I had to look it up, and even then I wasn't satisfied with the etymology. I guess the name, like PB itself, is not "self-describing" ;)

httpwebwitch




msg:3693363
 4:13 pm on Jul 8, 2008 (gmt 0)

PB isn't going to nuke JSON. JSON evaluates natively in Javascript; it's as fluid as a data carrier can get. IMHO, JSON is firmly secure as the lingua franca of AJAX.

PB can offer the same kind of data fluidity to Java, C++, and Python. Maybe a PHP5 parser will emerge next! That would be a thrill.

g1smd




msg:3693382
 4:28 pm on Jul 8, 2008 (gmt 0)

How do you nest values in this new schema?

Can you have a new {} pair inside a "quoted" value.

In XML, you have nested <tag></tag> pairs, which you can extend almost infinitely.

In the new schema, you have { } pairings at one level and name = "data" structures at another (lower) level.

Commerce




msg:3693388
 4:30 pm on Jul 8, 2008 (gmt 0)

Actually, I think Google is right about XML, though, I am not certain their answer is right.

I've been dealing with data records for many years and I think that there are far better ways to describe data and present variable or fixed type records in ways that the receiver can then use them than either XML or PB.

-Commerce

mikedee




msg:3693470
 5:29 pm on Jul 8, 2008 (gmt 0)

I think this sounds like a good idea for data interchange. At the end of the day, there is server 1 which has an object in memory which they want to transfer to an in memory object on server 2. Why should it have to be piped through XML?

At the moment all PHB's have been sold the delusion that XML is the solution to all their worries. In most cases it is terrible because most language objects and xml do not map at all (for example attributes vs child nodes), most of my XML parsing code ends up just squeezing some XML into native objects to pass to the next part of my script.

There should be a PHP parser since that is the most widely used language. I hope that this gets some hold, but I am very doubtful. The lack of popular parsers is a hindrance, plus it has a stupid name. Protocol Buffers sounds like it is to do with protocol handlers. They should have called it Hyper XML 2.0 if they want traction in the enterprise.

cmarshall




msg:3693486
 5:41 pm on Jul 8, 2008 (gmt 0)

They should have called it Hyper XML 2.0 if they want traction in the enterprise.

Kinda like JAVAscript?

Sorry. Couldn't resist...

Actually, I have to pick at one thing Yon Honorable Witch of The Web stated:

JSON evaluates natively in Javascript

Running a JSON object through an eval() is not what I call "native." They should have built in a more straightforward way of resolving it by now.

To wit:

var myGodThisIsSimple = someJSONParameter;

johnnie




msg:3693582
 7:06 pm on Jul 8, 2008 (gmt 0)

LOL! Somehow I'm feeling nostalgia towards:

typedef person {
char* name,
char* email
} individual;

httpwebwitch




msg:3693648
 8:08 pm on Jul 8, 2008 (gmt 0)

johnnie what is that nonsense you just typed? It looks like jibberish, it can't possibly be a real language. ;)

henry0




msg:3693711
 9:07 pm on Jul 8, 2008 (gmt 0)

Maybe a PHP5 parser will emerge next! That would be a thrill.

Agreed on, comming with a PHP background I feel I can learn that new one pretty swiftly

Could you call Mr.G and ask for that php parser :)

m0thman




msg:3693913
 1:07 am on Jul 9, 2008 (gmt 0)

And what, pray tell was wrong with:

000100 03 PERSON.
000110 05 NAME PIC X(60) VALUE SPACES.
000120 05 EMAIL PIC X(100) VALUE SPACES.

Hester




msg:3694195
 10:07 am on Jul 9, 2008 (gmt 0)

How about improving on G's example to use this instead?

person {
name:John Doe
email:jdoe@example.com
}

Commerce: "I've been dealing with data records for many years and I think that there are far better ways to describe data and present variable or fixed type records in ways that the receiver can then use them than either XML or PB."

Can you give an example?

henry0




msg:3694223
 11:01 am on Jul 9, 2008 (gmt 0)

person {
name:John Doe
email:jdoe@example.com
}

Don't you need a stop somewhere?
will not this reads as name:John Doeemail:etc...
so why not
person {
name:John Doe;
email:jdoe@example.com;
}

g1smd




msg:3694267
 11:57 am on Jul 9, 2008 (gmt 0)

I assume that [CR][LF] is it, but some systems might send just [CR] or just [LF].

How is that catered for?

mikedee




msg:3694281
 12:09 pm on Jul 9, 2008 (gmt 0)

I think there is some confusion here...

person {
name:John Doe
email:jdoe@example.com
}

This is not a valid .proto file because it actually has data. The proto file is supposed to define the class not actually supply data.

The original example should be

XML:

<person><name>Joe Blogs</name></person>

PB

0A 3C

Its a binary format, so none of it is readable except the proto files. The original post is confusing because the PB example is not correct. An actual example of a .proto file would be like this.

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}

1,2 and 3 are not data values, they are used by the parser to mark attributes.

httpwebwitch




msg:3694338
 1:23 pm on Jul 9, 2008 (gmt 0)

exactly right mikedee - the PB example could be confusing because it shows the data rendered as text. The real PB data is not human-readable. (at least not by the average human)

jezzer300




msg:3694459
 3:24 pm on Jul 9, 2008 (gmt 0)

I've worked with lots of big XML projects and it seams so ambiguous and the messages can get so big with all these long meaningful names. Then you have to decode it.

For a human readable format PB looks good to me. Less memory, traffic and less decoding.

cmarshall




msg:3694486
 3:39 pm on Jul 9, 2008 (gmt 0)

For a human readable format PB looks good to me.

I don't think it's human readable. I think it's a compiled format, like X.409 (Sorry, I looked for an authoritative link, but the industry has worked so hard to bury the nightmare that was X.409 that it looks like it's time to repeat it).

MatthewHSE




msg:3694559
 4:49 pm on Jul 9, 2008 (gmt 0)

I don't know XML that well, but the protocol buffer example looks a lot simpler and more intuitive. Reminds me of CSS. ;)

aleksl




msg:3694709
 7:18 pm on Jul 9, 2008 (gmt 0)

I guess "invents" in title should be in quotes, or meant as a sarcasm. I bet there are a few people out there who will claim a copyright infringement on this one.

Hester




msg:3694742
 7:48 pm on Jul 9, 2008 (gmt 0)

I assume that [CR][LF] is it, but some systems might send just [CR] or just [LF]. How is that catered for?

I assumed line endings, but the example using a semi-colon would do better.

The real PB data is not human-readable.

So how is that better than XML in terms of legibility? (Think of future generations trying to understand PB code.)

cmarshall




msg:3694772
 8:07 pm on Jul 9, 2008 (gmt 0)

For Google, it's all about brevity. Their home page probably gets a billion hits a day.

Every single byte that goes on their page needs to be championed. It is probably the most valuable real estate in the world, making the Emperor's Palace in Tokyo look like a double-wide at Lakeside Trailer Park.

I can definitely understand why they don't want to use XML.

However, I am not Google, and my property is more like a lean-to outside the Manilla Dump. There is no valid reason for me to publish an SDK like PB. XML will do just fine for me.

henry0




msg:3694895
 10:12 pm on Jul 9, 2008 (gmt 0)

I don't know XML that well, but the protocol buffer example looks a lot simpler and more intuitive. Reminds me of CSS. ;)

That was my point, could it be easier on XML "chalenged" coders

mikedee




msg:3695311
 11:39 am on Jul 10, 2008 (gmt 0)

If you think this is easier, then I don't think you really understand it. XML is just an overused data description format, this is a code generator (from the .proto files) and a binary data serialization format.

I think the CSS like example and the comparison to XML is fairly confusing. It is more like CORBA or SOAP combined with PHP's serialize or python's pickle which can be read by any supported language. The fact that this serialized format can be transfered over the wire is incidental, the binary output could be saved to disk.

It is nothing to do with being easier than XML, its more about using the right tool for the job and cutting down on bloat.

Hester




msg:3697812
 8:27 am on Jul 14, 2008 (gmt 0)

Mark Pilgrim has an interesting post about PB on his site.

Protocol buffers are "just" cross-platform data structures. All you have to write is the schema (a .proto file), then generate bindings in C++, Java, or Python. (Or Haskell. Or Perl.) The .proto file is just a schema; it doesn’t contain any data except default values. All getting and setting is done in code. The serialized over-the-wire format is designed to minimize network traffic, and deserialization (especially in C++) is designed to maximize performance.

Besides being blindingly fast, protocol buffers have lots of neat features. A zero-size PB returns default values. You can nest PBs inside each other. And most importantly, PBs are both backward and forward compatible, which means you can upgrade servers gradually and they can still talk to each other in the interim. (When you have as many machines as Google has, it’s always the interim somewhere.)

Comparisons to other data formats was, I suppose, inevitable. Old-timers may remember ASN.1 or IIOP. Kids these days seem to compare everything to XML or JSON. They’re actually closer to Facebook’s Thrift (written by ex-Googlers) or SQL Server’s TDS. Protocol buffers won’t kill XML (no matter how much you wish they would), nor will they replace JSON, ASN.1, or carrier pigeon. But they’re simple and they’re fast and they scale like crazy, and that’s the way Google likes it.

source [diveintomark.org]

[edited by: Hester at 8:29 am (utc) on July 14, 2008]

[edited by: httpwebwitch at 3:45 pm (utc) on July 14, 2008]
[edit reason] added source citation [/edit]

cmarshall




msg:3697863
 10:36 am on Jul 14, 2008 (gmt 0)

Ha-ha. No mention of X.409.

This one actually has a chance, as it is backed by a corporation (which always seems to help standards become standards). However, this is pretty much exactly like X.409.

X.409 was developed for exactly the same reason, but was about messaging only, and didn't have some of the nice bells and whistles Google has added.

I doubt many of the folks on this board will be directly using PB, as it is really a server-level protocol that actually sits below the level of most implementations.

However, if they add a PB compiler/decompiler (not parser) to PHP (It would need to be a PEAR or PECL extension, for backwards compatibility), then it would start being a lot more accessible.

PHP has a problem, in that C++ programmers hate it. They view it like VB. They have a point, but PHP delivers on the empty promises of VB, and needs to be taken very, very seriously. PHP5.1 is finally becoming a usable OO language (but is still kinda Playskool, compared to C++).

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / XML Development
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved