I am aware of this question.
I am considering using Protobuf 3 for a file format, due to its efficient encoding, explicit schema and wide support. However one part of the schema that is very inconvenient is that it doesn't allow required fields.
Google has its reasons for removing required fields in Protobuf 3. The problem they had is real (removing required fields is a breaking change), but their solution is nonsense.
Anyway my question is: Protobuf 3 allows you to add custom options for fields. Has anyone used that (or another method) to add unofficial support for required fields to Protobuf 3?
Do you want to have some implicit validation for your file format by using Protobuf, i.e. do you want the parsing of a file to fail if the required field does not exist in the data? If so, you may use the wrapper types, these will trigger an error if you try to read their value from a non-existing field.
Dropbox's Rust library supports annotations to do this!
(gogoproto.nullable)=false Generates non-nullable fields types
(rust.nullable)=false Generates oneofs as non-nullable (fail on deserialization)
(rust.err_if_default_or_unknown)=true Generates enums as non-zeroable (fail on deserialization)
Related
Are there any examples / references to see how protobuf data can be validated using json schema?
Apologies if I'm starting off with something too basic...
Protobuf data can be validated using protobuf deserialisers; if data is parsed by the parser that was generated for the message (and is part of the class representing that message), then it's valid data. To generate that parser / class, you'd have started with a protobuf schema and compiled that with protoc.
Generally speaking, I'd say that wanting to validate such data against a json schema is possibly not a good idea. The point is that, to also have a json schema for the same data is to then have "two versions of the truth", which is generally a bad idea. Which one is right; the .proto schema, or the json schema? If I edit one, have I accurately edited the other?
JSON Can Do More Than Protobuf
I can see why you may want to check such data against a json schema. In a json schema you can define things like value and size constraints that cannot be expressed in a protobuf schema. For example, a message field "bearing" might in the application have a limited valid value between 0 and 359. There is no way to implement such a constraint in protobuf, but if expressed in a json schema used to validate json data, the validator would object if "bearing" were set to 412.
So, why not generate code from the json schema? I have tried (some time ago - I'm out of date) code generators for languages like C# using json schema as input, but found the result unsatisfactory (the code generators I tried didn't want to implement all the things in my json schema, e.g. unions). Things may have got a lot better since then.
Is there a Better Solution?
If this is indeed the kind of thing you need to do, then it's likely that choosing protobuf is not ideal for the purpose (due to the lack of constraints in protobuf schema). The question then is, what are the alternatives?
In my experience, if you want to stick to the concept of starting with a schema and generating code, the best I've ever used is ASN.1 (where "best" assumes you're willing to pay for good commercial ASN.1 tools from companies like Objective Systems or Nokalva - I've been a customer of both).
These days, ASN.1 can even serialise to json (or xml in several flavours, or other text and packed / unpacked binary data formats). The ASN.1 schema language does have constraints on sizes of lists and/or values of fields. There is an official translation between ASN.1 schema and XML schema (XSD), with the better ASN.1 tools able to do that translation. There may now be an defined translation between ASN.1 and json schema too (I don't know), plus tools to do that.
The point of that is, with translation tools, one can then say that the ASN.1 schema and XSD (or json) schema are "one single truth" - one being automatically generated from the other which was hand written.
A Good Halfway Hosue?
I notice (from a quick search) that there are various git* projects purporting to translate between protobuf and json schema, which if satisfactory means that your json and protocol buffer schema can be automatically translated between one and the other (which means that my 2nd para above is junk!).
Unless something has happened recently, those json protobuf schema translations are going to be limited, or disappointing. ASN.1, XSD and json schema are broadly similar in terms of what their syntaxes allow to be expressed (including size and value constraints), so translation between them doesn't necessarily lose "information". However, the syntax of protobuf schema is a lot more limited than that of json schema, so a translation from json schema to protobuf might lose the very information that you want.
The good news though would be that the protobuf schema would still be a "form of the truth" having been translated from the json schema. If you were using protobuf to generate json data instead of protobuf binary format data, the "original form of the truth" (the json schema) can be used to validate the protobuf generated json, with constraints on value and size still intact. That would be a good result!
Good luck!
I am reading the official protobuf encoding doc. It states that protobuf message encodes the type of each field in the message. But, I thought the client side has the schema class file as well, so client should be able to know the types already. Why does protobuf even bother to send the type info the client already knows?
It says right there in your linked docs:
When the message is being decoded, the parser needs to be able to skip fields that it doesn't recognize. This way, new fields can be added to a message without breaking old programs that do not know about them. To this end, the "key" for each pair in a wire-format message is actually two values – the field number from your .proto file, plus a wire type that provides just enough information to find the length of the following value.
(emphasis mine)
In the docs for FieldMask the paths use the field names (e.g., foo.bar.buzz), which means renaming the message field names can result in a breaking change.
Why doesn't FieldMask use the field numbers to define the path?
Something like 1.3.1?
You may want to consider filing an issue on the GitHub protocolbuffers repo for a definitive answer from the code's authors.
Your proposal seems logical. Using names may be a historical artifact. There's a possibly relevant comment on an issue thread in that repo:
https://github.com/protocolbuffers/protobuf/issues/3793#issuecomment-339734117
"You are right that if you use FieldMasks then you can't safely rename fields. But for that matter, if you use the JSON format or text format then you have the same issue that field names are significant and can't be changed easily. Changing field names really only works if you use the binary format only and avoid FieldMasks."
The answer for your question lies in the fact FieldMasks are a convention/utility developed on top of the proto3 schema definition language, and not a feature of it (and that utility is not present in all of the language bindings)
While you’re right in your observation that it can break easily (as schemas tend evolve and change), you need to consider this design choice from a user friendliness POV:
If you’re building an API and want to allow the user to select the field set present inside the response payload (the common use case for field masks), it’ll be much more convenient for you to allow that using field paths, rather then binary fields indices, as the latter would force the user of the gRPC/protocol generated code to be “aware” of the schema. That’s not always the desired case when providing API as a code software packages.
While implementing this as a proto schema feature can allow the user to have the best of both worlds (specify field paths, have them encoded as binary indices) for binary encoding, it would also:
Complicate code generation requirements
Still be an issue for plain text encoding.
So, you can understand why it was left as an “external utility”.
I'm looking for a string representation of arbitrary fields inside protocol buffer messages. Is there any library that implements this? I've looked at using field masks, however they don't have a strong support for repeated fields.
Protocol buffer message and field descriptors provide field access by name. This allows you to find a particular field using a path and to erase it, if that's what you are asking for (if not, I'd recommend to expand the question to include an example for what you'd like to do).
One corresponding Java method is getDescriptorForType (the return type is a message descriptor, where you'll find field descriptors).
There is a similar descriptor API for C++ (in Java, you could theoretically also use reflection).
This API is not available in light mode.
Without any encryption, if the recipient has the serialized Protobuf file but does not have the generated Protobuf class (they don't have access to the .proto file that define its structure), is it possible for them to get any data in the Protobuf file from the binary?
If they have access to a part of the .proto file (for example, just one related message in the file) can they get a part of that data out from the entire file while skipping other unknown parts?
yes, absolutely; the protoc tool can help with this (see: --decode_raw), as can https://protogen.marcgravell.com/decode - so it should not be treated as "secure" at all
yes, absolutely - that's a key part built into the protocol that allows messages to be extensible such that they can decode the bits they understand and either ignore or just store (for round-trip or "extension" fields) the bits they don't understand
protobuf is not a security device; to someone with the right tools it is just as readable as xml or json, with the slight issue that it can be uncertain how to interpret some values; but: you can infer and guess and reverse engineer
Ok, I have found this page https://developers.google.com/protocol-buffers/docs/encoding
The message discards all the names and is just a pair of key number and values. The generated class might offer some protection for safely reading these data and could not read unknown data. (Sure enough because the generated class was generated from known structure, .proto file)
But if I am an attacker I could reference that Encoding page and try to figure out which area in the binary corresponds to which data. For example, varint might be easy to spot after changing some data. And proceed to write my own .proto file to attack this unknown data or even a custom binary reader that can selectively read part of the binary.