Globally unique option field numbers for protobufs - protocol-buffers

In the protobuf documentation (https://developers.google.com/protocol-buffers/docs/proto#customoptions) it says this about custom options:
One last thing: Since custom options are extensions, they must be
assigned field numbers like any other field or extension. In the
examples above, we have used field numbers in the range 50000-99999.
This range is reserved for internal use within individual
organizations, so you can use numbers in this range freely for
in-house applications. If you intend to use custom options in public
applications, however, then it is important that you make sure that
your field numbers are globally unique. To obtain globally unique
field numbers, please send a request to
protobuf-global-extension-registry#google.com. Simply provide your
project name (e.g. Object-C plugin) and your project website (if
available). Usually you only need one extension number.
Why do the options field numbers have to be globally unique for public applications? In what way can collisions be a problem?

Basically, because you wouldn't know whether the data you get is correct.
The protobuf binary wire format only stores the field numbers and the payload (which is itself, for complex types, just field numbers and sub-payloads). There is no name data. So: when you store and retrieve an extension field, all you're saying is "fetch field {field number}, interpret it as {type}". If two different systems have extended the same data using the same field number, then you don't have any way of knowing whether the data you're fetching was actually in that format.
Normally this isn't a problem - as it is rare to conflict like this on the same data; but custom options are different! I'm a library author; I might want to add a custom option that my schema parsing tools recognize, by extending (say) MessageOptions. MessageOptions is the extension point for DescriptorProto, which is to say: that's what option (foo) = "bar"; goes to inside a message.
To do that, I need to assign a number for foo. I choose 5000 arbitrarily (MessageOptions defines extensions 1000 to max;, so that's fine). All is good. My tooling works.
Unknown to me, another library author has chosen to do something similar and has also used 5000. Once the schema is compiled (by protoc or similar), all I have is numbers. If I ask for the data from field 5000, I don't know whether I'm getting my extension, or the other one. The meaning is lost. OK, at a push I could also check the dependency list on the FileDescriptorProto, but ... that's hit and miss.
I don't know whether the presense of value 1 in field 5000 is:
option (.mystuff.someext) = 1;
vs
option (.anotherlib.whatever) = -1; // stored as sint32
vs
option (.yetanother.library.option) = true;
If those extensions all have number 5000, they appear identically on the wire.

Related

Why does protobuf's FieldMask use field names instead of field numbers?

In the docs for FieldMask the paths use the field names (e.g., foo.bar.buzz), which means renaming the message field names can result in a breaking change.
Why doesn't FieldMask use the field numbers to define the path?
Something like 1.3.1?
You may want to consider filing an issue on the GitHub protocolbuffers repo for a definitive answer from the code's authors.
Your proposal seems logical. Using names may be a historical artifact. There's a possibly relevant comment on an issue thread in that repo:
https://github.com/protocolbuffers/protobuf/issues/3793#issuecomment-339734117
"You are right that if you use FieldMasks then you can't safely rename fields. But for that matter, if you use the JSON format or text format then you have the same issue that field names are significant and can't be changed easily. Changing field names really only works if you use the binary format only and avoid FieldMasks."
The answer for your question lies in the fact FieldMasks are a convention/utility developed on top of the proto3 schema definition language, and not a feature of it (and that utility is not present in all of the language bindings)
While you’re right in your observation that it can break easily (as schemas tend evolve and change), you need to consider this design choice from a user friendliness POV:
If you’re building an API and want to allow the user to select the field set present inside the response payload (the common use case for field masks), it’ll be much more convenient for you to allow that using field paths, rather then binary fields indices, as the latter would force the user of the gRPC/protocol generated code to be “aware” of the schema. That’s not always the desired case when providing API as a code software packages.
While implementing this as a proto schema feature can allow the user to have the best of both worlds (specify field paths, have them encoded as binary indices) for binary encoding, it would also:
Complicate code generation requirements
Still be an issue for plain text encoding.
So, you can understand why it was left as an “external utility”.

Way to Reference Arbitrary Field in Protobuf Message

I'm looking for a string representation of arbitrary fields inside protocol buffer messages. Is there any library that implements this? I've looked at using field masks, however they don't have a strong support for repeated fields.
Protocol buffer message and field descriptors provide field access by name. This allows you to find a particular field using a path and to erase it, if that's what you are asking for (if not, I'd recommend to expand the question to include an example for what you'd like to do).
One corresponding Java method is getDescriptorForType (the return type is a message descriptor, where you'll find field descriptors).
There is a similar descriptor API for C++ (in Java, you could theoretically also use reflection).
This API is not available in light mode.

HL7 FHIR mark resources as anonymized

I am trying to map an existing domain into HL7 FHIR.
So far it was pretty easy to find FHIR resources that more or less represent the same data and can be used for that purpose. But now I am running into a problem of which I am not sure how to solve it.
The existing domain allows that data can be anonymized depending on the users access level. e.g. a patient's name or address might be removed and marked as anonymized. Other data will be pseudonymised, for example a the birthdate in 1980 will be replaced with 01.01.1980. An Age of 37 will be replaced with a category of 30-40.
So I am unsure how to integrate that into the FHIR domain. I was thinking I could create an extension holding a boolean, indicating if a value was anonymized or not and always replace or remove the original value. This might work, but I will run into big problems when the anonymized value is of a different type than the original value (e.g. Age is replaced by a range of values)
Is that even a valid approach? I thought this might be common problem, but I could not find any examples where people described methods of how to mark data as altered. Unfortunately the documentation at http://build.fhir.org/extensibility-registry.html does not contain anything that would help my case.
You can use security labels for this purpose (Resource.meta.security). Take a look at REDACTED and SUBSETTED in the security label value set: https://www.hl7.org/fhir/valueset-security-labels.html
If you need to convey a data type other than the one allowed by the resource (e.g. wanting to convey a range rather than a birthdate), you'd need to use an extension. (Note that dates are valid even if you only include the year.)

What data structure is best to represent search filters?

I'm implementing a native iOS application, which provides a feature to search for articles using keywords and several search filters such as author, year, publisher etc. Each of these search filters has several options such as 2014, 2013, and 2012 for the year filter, or Academe Research Journals, Annex Publishers, and Elmer Press for the publisher filter. Each of these options has a BOOL stating whether the option is selected or not. I need an object that wraps the search keywords and search filters so that I can send it to the server, which is responsible for the search operation.
Which data structure should I use to represent these search filters in the wrapper class?
Something like XML comes to mind:
<year>2014</year>
<publisher>Annex Publishers</publisher>
Although I find it rather bulky. I'd probably prefer something like:
year=2014|publisher=Annex Publishers
You'll need to escape = and | appearing in the field names or values, but this is easy to do.
This could just be a single string sent across.
In terms of actual data structures, you could have a map of field name to value, only containing entries where the option is selected. Or you could have a class containing pointers / references for each field, set to null if the option is not selected.
Another totally different consideration is to use an enumerated type, i.e. mapping each possible value to an integer, typically resulting in less memory used and faster (and possibly more robust) code, depending on how exactly this is done.
You could map it as follows, for example:
Academe Research Journals 0
Annex Publishers 1
Elmer Press 2
Then, rather than sending "Annex Publishers" as publisher, you could just send 1.
year=2014|publisher=1
The extension for multiple possible values for a field can be done in various ways, but it's fairly easy to do:
<year>2014</year>
<year>2013</year>
<publisher>Annex Publishers</publisher>
Or:
year=2014,2013|publisher=Annex Publishers

Searching through the protocol buffer file

I'm new to protocol buffers and I was wondering whether it was possible to search a protocol buffers binary file and read the data in a structured format. For example if a message in my .proto file has 4 fields I would like to serialize the message and write multiple messages into a file and then search for a particular field in the file. If I find the field I would like to read back the message in the same structured format as it was written. Is this possible with protocol buffers ? If possible any sample code or examples would be very helpful. Thank you
You should treat protobuf library as one serialization protocol, not an all-in-one library which supports complex operations (such as querying, indexing, picking up particular data). Google has various libraries on top of open-sourced portion of protobuf to do so, but they are not released as open source, as they are tied with their unique infrastructure. That being said, what you want is certainly possible, yet you need to write some code.
Anyhow, some of your requirements are:
one file contains various serialized binaries.
search a particular field in each serialized binary and extract that chunk.
There are several ways to achieve them.
The most popular way for serial read/write is that the file contains a series of [size, type, serialization output]. That is, one serialized output is always prefixed by size and type (either 4/8 byte or variable-length) to help reading and parsing. So you just repeat this procedure: 1) read size and type, 2) read binary with given size, 3) parse with given type 4) goto 1). If you use union type or one file shares same type, you may skip type. You cannot drop size, as there is no way know the end of output by itself. If you want random read/write, other type of data structure is necessary.
'search field' in binary file is more tricky. One way is to read/parse output one by one and to check the existance of field by HasField(). It is most obvious and slow yet straightforward way to do so. If you want to search field by number (say, you want to search 'optional string email = 3;'), thus search by binary blob (like 0x1A, field number 3, wire type 2), it is not possible. In a serialized binary stream, field information is saved merely a number. Without an exact context (.proto scheme or binary file's structure), the number alone doesn't mean anything. There is no guarantee that 0x1A is from field information, or field information from other message type, or actually number 26, or part of other number, etc. That is, you need to maintain the information by yourself. You may create another file or database with necessary information to fetch particular message (like the location of serialization output with given field).
Long story short, what you ask is beyond what open-sourced protobuf library itself does, yet you can write them with your requirements.
I hope, this is what you are looking for:
http://temk.github.io/protobuf-utils/
This is a command line utility for searching within protobuf file.

Resources