As explained in their website and Wikipedia, the Protocol Buffers (or Protobuf) is "used to serialize structured data". The definition of the data structure is done in a .proto file that can be compiled by protoc and turned into code (.cc/.h, .py, .java...) that can be imported to several languages to manipulate and serialize the data.
My understanding is that the .pb files contain that data in binary and the .pbtxt are an equivalent that contain it in ascii. Is that correct?
If so, why are .pbtxt so readable? I've found some with commentaries (https://github.com/google/mediapipe/blob/master/mediapipe/graphs/hand_tracking/subgraphs/hand_renderer_cpu.pbtxt).
Also, are .pb/.pbtxt enough to interpret the data? Or do you need their .proto?
In the docs for FieldMask the paths use the field names (e.g., foo.bar.buzz), which means renaming the message field names can result in a breaking change.
Why doesn't FieldMask use the field numbers to define the path?
Something like 1.3.1?
You may want to consider filing an issue on the GitHub protocolbuffers repo for a definitive answer from the code's authors.
Your proposal seems logical. Using names may be a historical artifact. There's a possibly relevant comment on an issue thread in that repo:
https://github.com/protocolbuffers/protobuf/issues/3793#issuecomment-339734117
"You are right that if you use FieldMasks then you can't safely rename fields. But for that matter, if you use the JSON format or text format then you have the same issue that field names are significant and can't be changed easily. Changing field names really only works if you use the binary format only and avoid FieldMasks."
The answer for your question lies in the fact FieldMasks are a convention/utility developed on top of the proto3 schema definition language, and not a feature of it (and that utility is not present in all of the language bindings)
While you’re right in your observation that it can break easily (as schemas tend evolve and change), you need to consider this design choice from a user friendliness POV:
If you’re building an API and want to allow the user to select the field set present inside the response payload (the common use case for field masks), it’ll be much more convenient for you to allow that using field paths, rather then binary fields indices, as the latter would force the user of the gRPC/protocol generated code to be “aware” of the schema. That’s not always the desired case when providing API as a code software packages.
While implementing this as a proto schema feature can allow the user to have the best of both worlds (specify field paths, have them encoded as binary indices) for binary encoding, it would also:
Complicate code generation requirements
Still be an issue for plain text encoding.
So, you can understand why it was left as an “external utility”.
Without any encryption, if the recipient has the serialized Protobuf file but does not have the generated Protobuf class (they don't have access to the .proto file that define its structure), is it possible for them to get any data in the Protobuf file from the binary?
If they have access to a part of the .proto file (for example, just one related message in the file) can they get a part of that data out from the entire file while skipping other unknown parts?
yes, absolutely; the protoc tool can help with this (see: --decode_raw), as can https://protogen.marcgravell.com/decode - so it should not be treated as "secure" at all
yes, absolutely - that's a key part built into the protocol that allows messages to be extensible such that they can decode the bits they understand and either ignore or just store (for round-trip or "extension" fields) the bits they don't understand
protobuf is not a security device; to someone with the right tools it is just as readable as xml or json, with the slight issue that it can be uncertain how to interpret some values; but: you can infer and guess and reverse engineer
Ok, I have found this page https://developers.google.com/protocol-buffers/docs/encoding
The message discards all the names and is just a pair of key number and values. The generated class might offer some protection for safely reading these data and could not read unknown data. (Sure enough because the generated class was generated from known structure, .proto file)
But if I am an attacker I could reference that Encoding page and try to figure out which area in the binary corresponds to which data. For example, varint might be easy to spot after changing some data. And proceed to write my own .proto file to attack this unknown data or even a custom binary reader that can selectively read part of the binary.
I want to store large amount of data in a protobuf format in which include time-stamp parameter. And I want to retrieve the data based on the time-stamp value.
Thanks.
Protobuf is a sequential-access format. There's no way to jump into the middle of a message looking for data; you have to parse through the whole thing.
Some options:
Devise a framing format that allows you to break up your datastore into many small chunks, each of which is a separate protobuf message. This is a fairly large project.
Use SQLite or even an actual database.
Use a random-access-fieldly format like Cap'n Proto instead. (Disclosure: I'm the author of Cap'n Proto, and also of Protobufs v2 (Google's open source release).)
I keep running into a certain kind of data structure, and wonder if there is a name for it. It maps very closely to JSON, but not exactly. The rules are:
It is composed entirely of maps, arrays, and primitives.
It is hierarchical. Maps contain name/value pairs, where a value can
be another map, an array, or a primitive. Arrays contain values with the same rules.
The top level is always a map.
The primitives are strings, integers, floats, booleans, and possibly
dates.
Sometimes the map is just an unordered hash, and sometimes the order
of the name/value pairs matter.
This is a really, really useful structure. You can use it to represent documents, database records, various messages, http requests, lots of stuff. I've run into it in Freemarker (as the 'data model'), Mongo, and anything that uses JSON.
It's not really JSON, because that's a file format, not a specification for a particular data structure. It's not an "object", because object trees can point to other things, like streams and functions. It's not a DOM.
What is it?
Around the office, we've started to call it a "garg", for "generalized argument".
It's not really JSON, because that's a file format, not a specification for a particular data structure.
It might not be JSON (since the specs include syntax rules), but your structure definition defines the same data structure as JSON does.
I don't think it's useful to name this structure. When you are talking about data, just call it data. When you need to interchange data you need a data-interchange format. Now JSON proves to be one damn good one.
JSON isn't just a file format. JSON is also a data structure.
From JSON.org
JSON is built on two structures:
A collection of name/value pairs. In various languages, this is
realized as an object, record, struct, dictionary, hash table, keyed
list, or associative array.
An ordered list of values. In most
languages, this is realized as an array, vector, list, or sequence.
These are universal data structures.
It is a generic data storage structure that carries around hierarchical data. I don't have a generic name for it, but if I were to implement such a beast in, say, C++, I'd probably call the abstract base class a Variant, and name the concrete types by their names: Integer, Array, Map, etc. I'd chuck them in a namespace that would relate to where I'd use them - or maybe I'd prefix the types themselves. I've seen such structures used as well, but I don't know if there is a name that I'd recognize. A DataStore, Environment, StorageBin, or anything that is generic and implies storage of data would do.
I don't see myself calling such a class hierarchy JSON, though. I would provide a JsonSerializer or some such to map this data to JSON, if I needed it.
It sounds like you're describing an associative array, with optional ordering.
That's what JSON represents, except that (I believe) JSON doesn't impose an ordering requirement. Naturally, many other representations also describe associative arrays, which is why JSON is a popular text serialization.
Update 1: JSON isn't properly an associative array. It is a description of object properties. Because it is very often construed as an associative array, many people make the same mistake I did. In fact, "object notation" is the proper name for it - surprise, surprise. :) In addition, JSON isn't a file format - it's a text serialization or markup language, which is different from a file format.
The structure is a tree with different kinds of values stored at its leafs.
In Boost, a similar structure is called Property Tree.