Protocol buffers - unique numbered tag - clarification? - protocol-buffers

I'm using protocol buffers and everything is working fine. except that the fact that I don't understand - why do I need the numbered tags in the proto file :
message SearchRequest {
required string query = 1;
optional int32 page_number = 2;
optional int32 result_per_page = 3;
}
Sure I've read the docs :
As you can see, each field in the message definition has a unique
numbered tag. These tags are used to identify your fields in the
message binary format, and should not be changed once your message
type is in use.
I didn't understand what difference does it make if I change it . ( I will create a new proto and compile it - so why does it care ?)
Another article states that :
Numbered fields in proto definitions obviate the need for version
checks which is one of the explicitly stated motivations for the
design and implementation of Protocol Buffers. As the developer
documentation states, the protocol was designed in part to avoid “ugly
code” like this for checking protocol versions:
if (version == 3) {
...
} else if (version > 4) {
if (version == 5) {
...
}
...
}
Question
Is it just me or it is completely unclear ?
let me ask it in a different way :
If I have a proto file like the above file , and then I change it to :
message SearchRequest {
required string query = 3; //reversed order
optional int32 page_number = 2;
optional int32 result_per_page = 1;
}
What does it care ? I re-compile and add the file ( i've done it multiple times in the last week).
what am I missing ? can you please supply a human-to human explanation for this numbered tags ?

The numbered tags are used to match fields when serializing and deserializing the data.
Obviously, if you change the numbering scheme, and apply this change to both serializer and deserializer, there is no issue.
Consider though, if you saved data with the first numbering scheme, and loaded it with the second one, it would try to load query into result_per_page, and deserialization would likely fail.
Now, why is this useful?
Let's say you need to add another field to your data, long after the schema is already in use:
message SearchRequest {
required string query = 1;
optional int32 page_number = 2;
optional int32 result_per_page = 3;
optional int32 new_data = 4;
}
Because you explicitly give it a number, your deserializer is still able to load data serialized with the old numbering scheme, ignoring deserialization of non-existent data.

These field numbers are used by protobuf while encoding and decoding. See here for more details.
So each and every field has wire type so int32 has wire type as 0 and your field number say is 2 so it will be encoded as 0001 0000 i.e. 10 in hex.
And later on when its decoded, its left shifted by 1 which makes it as 001 0000 and last three lsb decides wire type i.e. it then makes out its of type int field and rest decides which field in proto it is i.e. 00010 is 2. So field 2 of wire type 0 (int)

Related

How to handle a change in the interpretation of a field in a protobuf message?

If a field stores a specific value and is interpreted in a specific manner, is it possible to change this interpretation in a backwards compatible way?
Let's say I have a field that stores values of different data types.
The most generic case is to store it as a byte array and let the apps encode and decode it to the correct data type.
Common cases for data types are integers and strings, so support for those types is present.
Using a oneof structure this looks as follows:
message Foo
{
...
oneof value
{
uint32 integer = 1;
string text = 2;
bytes data = 3;
}
}
Applications that want to store an ip prefix in the value field, have to use the generic data field and do the encoding and decoding correctly.
If I now want to add support for ip prefixes to the Foo message itself so the apps don't have to deal with the encoding and decoding anymore, I could add a new field to the oneof structure with an IpPrefix datatype:
message Foo
{
...
oneof value
{
uint32 integer = 1;
string text = 2;
bytes data = 3;
IpPrefix ip_prefix = 4;
}
}
Even though this makes life easier for the apps, I believe it breaks backwards compatibility.
If a sending app has support for the new field, it will put its ip prefix value in the ip_prefix field.
But if a receiving app does not have support for this new field yet, it will ignore the field.
It will look for the ip prefix value in the data field, as it always did, but it won't find it there.
So the receiving app can no longer correctly read the ip prefix value anymore.
Is there a way to make this scenario somehow backwards compatible?
PS: I realize this is a somewhat vague and perhaps unrealistic example. The exact case I need it for is for the representation of RADIUS attributes in a protobuf message. These attributes are in essence a byte array that is sent over the network, but the bytes in the array have meaning and could be stored as different fields in the protobuf message. A basic attribute exists of a Type field and a Value field where the value field can be a string, integer, ip address... From time to time new datatypes (even complex ones) are added and I would like to be able to add new datatypes in a backwards compatible way.
There are two ways to go about this:
1. Enforce an update schedule, readers before writers
Add the new type of field to the .proto definition, but document that it should not be used except for testing and reception. Document that all readers of the message must support both the old and the new field by a specific milestone/date, after which the writers can start using it. Eventually you can deprecate the old field and new readers don't need to support it anymore.
2. Have both fields during the transition period
message Foo
{
...
oneof value
{
uint32 integer = 1;
string text = 2;
bytes data = 3;
}
IpPrefix ip_prefix = 4;
}
Document that writers should set both data and ip_prefix during the transition period. The readers can start using ip_prefix as soon as writers have added support, and can optionally fall back to data.
Later, you can deprecate data and move ip_prefix to inside the oneof without breaking compatibility.

Can simple protobuf types be migrated to "optional"

In protobuf version 3 required and optional keywords first have been removed, since required often caused problems protobuf issue 2497.
Recently the 'optional' keyword has been reintroduced protobuf v3.15.0.
Is it possible to simply add the optional keyword to an existing message?
I.e. change
message Test {
int32 int32_value = 1;
string text_value = 2;
}
to
message Test {
optional int32 int32_value = 1;
optional string text_value = 2;
}
Or will this break the binary format?
non-optional primitive types in protobuf don't accept null-values and normally also map to non-nullable types like int in Java or C#.
But this doesn't mean, that the field is always included in the binary representation.
In fact, if a field contains the default value for the corresponding type the field is omitted in the binary representation.
Thus the following message
message Test {
int32 int32_value = 1;
string text_value = 2;
}
Test test = new Test();
byte[] buffer = test.ToByteArray();
gets serialized to buffer containing an empty byte[].
So missing fields default to default values without the use of optional.
If the optional keyword is changing the behaviour for missing fields in the binary format and for default values specified:
Missing fields indicate the field has not been specified and indicate null. Setting default values will not result in an empty byte[] but in the default values being serialized.
Thus changing a primitive field to optional won't break the format, but will change the semantics:
All fields of old messages that have been specified with the default value will be interpreted as null. Other values are not affected.
The same for optional being removed from a field:
The api won't break, but change semantics. Unspecified fields will then default to default values for the corresponding type.

What is the relation of protobuf message field id and field order?

I want to understand if the messages bellow are compatible from the perspective of protobuf and serialization/deserialization.
message HelloReply {
string message = 1;
string personalized_message = 2;
}
message HelloReply {
string personalized_message = 2;
string message = 1;
}
Does the order matter for compatibility in any situation?
The textual order is largely irrelevant, although it may impact some code generation tooling - but most languages don't care about declaration order, so even that: won't matter. The fields are still defined semantically equivalent - the numbers match the existing meaning (name) and type. It is the number that is the determining feature in identifying a field.
At the protocol level:
parsers must allow fields in any order
serializers should (but not must) write fields in ascending numerical field order

grpc and protobuf - How to handle a new field when the other side is not releasing in sync

I've got a situation where the other end of the grpc communication is not in sync with their releases. My higher ups, would like me to therefore add a field that is going to work if the other side does or doesn't fill it out, for a short time period (like two weeks)
I believe I can do this by adding it to the end of the proto message such that the indices for the other fields do not change. From, what I've Googled, the optional field is not avail prior to version 3.15, so I have to use a work around.
The workaround that was described to me was to use oneof. However, I am not 100% sure what that looks like. All examples show the oneof field by itself. Are the indices that belong to the oneof values indendent of the indices that belong to the rest of the message?
message TestMessage {
string somefield = 1;
int someotherfield = 2;
oneof mynewoptionalfield
{
string mynewfield = ???? Does this have to be 3 or is it 1?
int ifihadanother = ???? Does this need to be 4 or 2?
}
}
Questions:
What are the indices I use where the ??? marks are
Is this the proper work around to use when the other side isn't going to recompile and deploy with the changes to the protofile?
How do I then check if the field was filled in my C++ code?
Your use-case is exactly what protobufs were designed to handle. All you need to do is: add a new field to the message. In the easiest case, the client application code simply doesn't look at the new field until the server roll-out is complete and so doesn't notice sometimes it is present and other times missing.
You are correct that you should not change the indices (field ids) of the pre-existing fields. Although I'll note that you can add your new field anywhere within the message; the order the fields are written in does not matter for protobuf.
So you'd just add another field like:
message TestMessage {
string somefield = 1;
int someotherfield = 2;
string mynewfield = 3;
}
You don't have to use 3 as the id. You could use 4, or 10, or 10000. But small numbers are more efficient for protobuf and it is typical to just choose the "next" id. On-the-wire protobuf uses the id to identify the field, so it is important you don't change the id later.
In protobuf 3, all fields are "optional" in the protobuf 2 sense; there are no "required" fields. However, protobuf 2 also provided "field presence" for all fields. Protobuf 3 only provided field presence for oneofs and messages... until the recent re-introduction of the "optional" keyword.
In protobuf 3 if you call textMessage.getMynewfield() it will always return a non-null string. If the string was not sent, it will use the empty string (""). For integers 0 is returned and for messages the "default message" (all defaults) is returned. This is plenty for many use-cases, and may be enough for you.
But let's say you need to distinguish between "" and <notsent>. That's what field presence provides. Messages in protobuf 3 have "has" methods that return true if a value is present. But primitives don't have that presence information. One option is to "box" the primitive with standard wrappers that make the primitive a message. Another option available in newer versions of protobuf is the optional keyword. Both options will provide a method like textMessage.hasMynewfield().
message TestMessage {
string somefield = 1;
int someotherfield = 2;
google.protobuf.StringValue mynewfield = 3;
// -or-
optional string mynewfield = 3;
}

Protocol buffer: does changing field name break the message?

With protocol buffer, does changing field name of a message still let it compatible backward? I couldn't find any cite about that.
Eg: original message
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}
Change to:
message Person {
required string full_name = 1;
required int32 id = 2;
optional string email = 3;
}
Changing a field name will not affect the protobuf encoding or compatibility between applications that use proto definitions which differ only by field names.
The binary protobuf encoding is based on tag numbers, so that is what you need to preserve.
You can even change a field type to some extent (check the type table at https://developers.google.com/protocol-buffers/docs/encoding#structure) providing its wire type stays the same, but that requires additional considerations whether, for example, changing uint32 to uint64 is safe from the point of view of your application code and, for some definition of 'better', is better than simply defining a new field.
Changing a field name will affect json representation, if you use that feature.

Resources