> Hence it should not be surprising that PB is strongly typed, has a separate schema file, and also requires a compilation step to output the language-specific boilerplate to read and serialize messages.
I've spent the last two years working on a Protocol Buffer implementation that does not have these limitations. Using my implementation upb (https://github.com/haberman/upb/wiki) you can import schema definitions at runtime (or even define your own schema using a convenient API instead of writing a .proto file), with no loss of efficiency compared to pre-compiled solutions. (My implementation isn't quite usable yet, but I'm working on a Lua extension as we speak).
I'm also working on making it easy to parse JSON into protocol buffer data structures, so that in cases where JSON is de facto typed (which I think is quite often the case) you can use Protocol Buffers as your data analysis platform even if your on-the-wire data is JSON. The benefits are greater efficiency (protobufs can be stored as structs instead of hash tables) and convenient type checking / schema validation.
This question of how to represent and serialize data in an interoperable way is a path that began with XML, evolved to JSON, and IMO will converge on a mix of JSON and Protocol Buffers. Having a schema is useful: you get evidence of this from the fact that every format eventually develops a schema language to go along with it (XML Schema, JSON Schema). Protocol Buffers hit a sweet spot between simplicity and capability with its schema definition.
Once you have defined a data model and a schema language, serialization formats are commodities. Protocol Buffer binary format and JSON just happen to be two formats that can both serialize trees of data that conform to a .proto file. On the wire they have different advantages/disadvantages (size, parsing speed, human readability) but once you've decoded them into data structures the differences between them can disappear.
If you take this idea even farther, you can consider column-striped databases to be just another serialization format for the same data. For example, the Dremel database described by Google's paper (http://static.googleusercontent.com/external_content/untrust...) also uses Protocol Buffers as its native schema, so your core analysis code could be written to iterate over either row-major logfiles or a column-major database like Dremel without having to know the difference, because in both cases you're just dealing with Protocol Buffer objects.
I think this is an extremely powerful idea, and it is the reason I have put so much work into upb. To take this one step further, I think that Protocol Buffers also represent parse trees very well: you can think of a domain-specific language as a human-friendly serialization of the parse tree for that DSL. You can really nicely model text-based protocols like HTTP as Protocol Buffer schemas:
Everything is just trees of data structures. Protocol Buffers are just a convenient way of specifying a schema for those data structures. Parsers for both binary and text formats are just ways of turning a stream of bytes into trees of structured data.
I've spent the last two years working on a Protocol Buffer implementation that does not have these limitations. Using my implementation upb (https://github.com/haberman/upb/wiki) you can import schema definitions at runtime (or even define your own schema using a convenient API instead of writing a .proto file), with no loss of efficiency compared to pre-compiled solutions. (My implementation isn't quite usable yet, but I'm working on a Lua extension as we speak).
I'm also working on making it easy to parse JSON into protocol buffer data structures, so that in cases where JSON is de facto typed (which I think is quite often the case) you can use Protocol Buffers as your data analysis platform even if your on-the-wire data is JSON. The benefits are greater efficiency (protobufs can be stored as structs instead of hash tables) and convenient type checking / schema validation.
This question of how to represent and serialize data in an interoperable way is a path that began with XML, evolved to JSON, and IMO will converge on a mix of JSON and Protocol Buffers. Having a schema is useful: you get evidence of this from the fact that every format eventually develops a schema language to go along with it (XML Schema, JSON Schema). Protocol Buffers hit a sweet spot between simplicity and capability with its schema definition.
Once you have defined a data model and a schema language, serialization formats are commodities. Protocol Buffer binary format and JSON just happen to be two formats that can both serialize trees of data that conform to a .proto file. On the wire they have different advantages/disadvantages (size, parsing speed, human readability) but once you've decoded them into data structures the differences between them can disappear.
If you take this idea even farther, you can consider column-striped databases to be just another serialization format for the same data. For example, the Dremel database described by Google's paper (http://static.googleusercontent.com/external_content/untrust...) also uses Protocol Buffers as its native schema, so your core analysis code could be written to iterate over either row-major logfiles or a column-major database like Dremel without having to know the difference, because in both cases you're just dealing with Protocol Buffer objects.
I think this is an extremely powerful idea, and it is the reason I have put so much work into upb. To take this one step further, I think that Protocol Buffers also represent parse trees very well: you can think of a domain-specific language as a human-friendly serialization of the parse tree for that DSL. You can really nicely model text-based protocols like HTTP as Protocol Buffer schemas:
Everything is just trees of data structures. Protocol Buffers are just a convenient way of specifying a schema for those data structures. Parsers for both binary and text formats are just ways of turning a stream of bytes into trees of structured data.