Can I retire a required protobuf field? - protocol-buffers

I have a client-server application where the server transmits serialized objects in protobuf format to a client, and I would like to retire a required field. Unfortunately I can't change both client and server at the same time to use the new .proto definition.
If I change a required field to be optional, but only for code that serializes messages (i.e. deserializing code has not been rebuilt and still thinks it's a required field), can I continue to publish messages that can be deserialized as long as I populate a value for the now-optional field?
(It appears to be fine to do so, at least for a few trivial cases I experimented with (only using Java), but I'm interested if it's a generally sensible approach, and whether there are any edge cases etc I should worry about).
Motivation: My goal is to retire a required field in a client-server application where the server publishes messages that are deserialized by the client. My intended approach is:
Change required field to optional on the trunk.
If it's necessary to deploy new server code (for unrelated features/fixes), ensure that the optional field continues to be populated in the message.
Gradually deploy updated code for all clients (this will take time as it requires involvement of other teams with their own release schedules)
Confirm that all clients have been updated.
Begin to retire (i.e. not populate) the optional field.

According to the encoding format documentation, whether a field is required or not is not encoded in serialized byte stream itself. That is, optional or required makes no difference in the encoded serialized message.
I've confirmed this in practice, using the Java generated code, by writing serialized messages to disk and comparing the output - using a message containing all of the supported primitive types as well as fields representing other types.

As long as the field is set, using the parseFrom(byte[]) method to deserialize will still work, because the byte[] will be the same.
However, one would wonder why you would change the field from required to optional until you are ready to allow it to be optional? Basically you are just making it "optional" in the .proto file, but you are enforcing that it is required by always populating it. Just a thought.

Related

Springboot Rest(ful) Api and lookup values

I want to write some Rest(ful) application with Spring Boot and Spring Data JPA.
Let's assume that for business reasons I have a database with the following tables:
customer(id number, first_name text, last_name text, type text);
customer_type(type text, description text);
where:
id is generated by the database at inserion time
type column in customer table is a foreign key to type column in customer_type table and it is immutable from a microservice point of view, just a lookup table.
Assuming I want to create APIs for CRUD operations on a customer but want to minimize api calls when just reading, I suppose I need the following operations:
GET /customer/{id}
POST /customer
PUT /customer/{id}
DELETE /customer/{id}
How the body should be structured?
For GET operation the response should be
{
"id":123,
"firstName":"John",
"lastName":"Doe",
"customerType":{
"type":"P",
"description":"Premium Customer"
}
}
But for POST I imagine I need to avoid sending the id and send just the customer type since the description is immutable and the client needs the description only for visualizing the information on screen, but this leads to different request body from the one returned in the GET operation.
For the PUT operation is the same but also should the id field be sent? How to handle the case where the id in the API path is different from the id in the request body if sent?
DELETE should not be a problem since it just deletes the row in customer table.
Thank you
How the body should be structured?
Let's make a step back first and let us discuss quickly what you basically try when following a REST architecture and why and how REST installs those mechanisms.
REST is an architectural style that helps in decoupling clients from servers by introducing indirection mechanisms which may seem odd at first but in the end allow you to achieve the required level of decoupling which allows clients to introduce changes which clients will naturally adept to. Such indirection mechanisms include attaching URIs to link-relation names, using form-based representation formats to tell a client how to create requests, content-type negotiation to return representations supported and understood by others and so forth. If you don't need such properties, i.e. as client and servers always go hand in hand in regards to changes and communicate on predefined messages, REST is probably not the best style to follow. If you though have a server that is contacted by various clients not under your control or a client that has to contact various servers, also not under your direct control, this is where REST truly starts to shine if all parties adhere to these concepts.
One of RESTs premise is that a server will teach clients everything they need to know in order to construct requests. If you look at the Web, where HTML is basically used everywhere, you might see that HTML defines HTML forms which basically allow a server to explain to a client what properties of a resource the server expects as input. On top of that the form also tells you client which HTTP operation to use, which target URI to send the request to and which media-type to represent the state in. In HTML this is usually implicitly given as application/x-www-form-urlencoded which chains properties together i.e. like this:
firstName=Roman&lastName=Vottner&role=Dev
or the like. This is in essence what HATEOAS or hypertext as the engine of application state is all about. You use in-build controls of the media-type exchanged to allow your client to progress its task instead of having to consult external documentation to lookup the "API" of some services. I.e. a form could state that an input only allows numeric values, that a sub-portion of the form represents a date/time picker widget which a client could render to a user accordingly, or an element represents a slider with a given range of admissible values and the like.
How the actual representation format you have to send to the server has to look like depends on the instructed media-type. I.e. HAL forms uses application/json by default and also specifies that application/x-www-form-urlencoded needs to be supported. Other media-types have explicitly negotiated between client and server. Ion states that application/json or application/ion+json have to be negotiated via the Content-Type request header.
In plain application/json the url-encoded payload from above could simply be expressed as:
{
"firstName": "Roman",
"lastName": "Vottner",
"role": "Dev"
}
and this is OK as the server basically instructed you to send this data in that format.
There are further media-types available that are worth a closer look whether they could fit your need or not. I.e. Hydra has a bit of a different take on this matter by connecting Linked Data to REST and its affordances called operations and allows to describe resources and its properties through LD classes. So the presence of an affordance for a certain resource tells you what you can do with that resource, like i.e. updating its state, and therefore also which class it belongs to and therefore which properties it has.
This just should illustrate how a negotiated media type finally decides how the actual representation needs to look like that has to be sent to the server.
In regards of whether to put in resource identifiers in the payload or not it depends. Usually resources are identified by the URI/IRI and this, as a whole, is the identifier of the resource. In your application though you will reference related domain objects through their ID which does not necessarily need to be, and probably also should not be, part of the IRI itself. I.e. let's assume we retrieve a resource that represents an order. That order contains the users name and address, the various items that got ordered including some meta data describing those items and what not. It usually makes sense in such a case to add the orderId which you use in your application even though the URI may contain that information already. Users of that API are usually not interested in those URIs but the actual content and might also never see those URIs if they are hidden behind automated processes or user interfaces. If a user now wants to print out that order s/he has all the information needed to file complaints later on via phone i.e. In other cases, i.e. if you design a resource to be an all-purpose clipboard like, copy&paste location, an ID does not make any sense unless you grant the user to explicitly reference one of that states directly.
The reason why IDs should not be part of the URI itself stems from the fact that a URI shouldn't change if the actual resource does not change. I.e. we have a customer who went through a merge a couple of years ago. They used to expose all their products via own URIs that exposed the productId as part of the URI. During the merger the tried to combine the various different data models to reduce the number of systems they had to operate while serving each of their customers with the same data as before as the underlying products didn't change. As they tried to stay "backwards" compatible for the purpose of supporting legacy systems of their customers, they quickly noticed that exposing those productIds as part of the URI was causing them some troubles. If they had used a mapping table of i.e. exposed UUIDs to internal productIds (again an introduction of indirection) earlier they could have reduced their whole data model and thus complexity by a lot while being able to change the mapping from internal prodcutId to UUID on the fly while allowing their clients to lookup the product information.
Long story short, as hopefully can be seen the structure of a representation depends on the exchanged media type. There are loads of different media-types available. Use the ones that allow you to describe resources to clients, such as HAL/HAL forms, Ion, Hydra, .... In regards to URIs, don't overengineer URIs. They are, as a whole, just a pointer to a resource and clients are usually interested in the content, not the URI! As such, make use of indirection-features like link-relation names, content-type negotiation and so forth to help remove the direct coupling of clients to services but instead rely more on the document type exchanged. The media-type here becomes basically the contract of the message. Through mappings on the client and server side resources of various representations can be "translated" to an object which you can use in your application.
As you've tagged your question with spring-boot and spring-data-jpa, you might want to look into spring-hateoas. It supports HAL out of the box, HAL forms can be used via affordances though the media-type needs to be enabled explicitly for it otherwise you might miss out on the form-template in the responses. Hydra support in spring-hateoas seems to be added through hydra-java which implements the Spring HATEOAS SPI. While Amazon provides implementation for Ion for various programming languages, including Java, it does not yet support Spring HATEOAS or Spring in general. Here a custom SPI implementation may be necessary.
For PUT operations you need to send the id of the entity that you want to update.
If you want to generate the same response as you would get in GET, then you need to write a DTO and map details accordingly.

Is there a way to check for new fields in the protobuf message?

What I want to do is to validate the data inside a protobuf message before I send it to an external network. This is providing a security check.
The problem is that protobufs allow sending additional fields using an updated proto file, which allows backwards compatibility.
What this means is when I go to check a message, my autogenerated code parses the object, but drops the unknown fields. So this means the transmitted bytes could have information I don't know about.
A work around would be to transmit the version of data I have parsed and checked, which would mean dropping the new fields. That's the right security thing to do, but I still won't know that someone is sending me new version of messages. It would be nice to log that and be told I might need to update. I also want to communicate back to the sender that some of their data is being dropped.
Is there a way to know if the format of the message I received mismatches from the format I expect to receive?

How to do routing and avoid deserialization in Grpc / Protobuf?

In our Java app we need to accept a (large) Grpc message, extract a field, and then based on the value of that field forward the message on to another server.
I'm trying to avoid the overhead of completely deserializing the message before passing it on.
One way to do this would be to send the field as a separate query or header parameter, but Grpc doesn't support them.
Another way would be to extract just the field of interest from the payload, but Protobuf doesn't support partial or selective deserialization.
How else can I do this?
One way you can do this is by doing it on the server side. When the server is about to send a response, it can extract the field and set it as part of the initial headers sent. You can do this by using a ServerInterceptor to extract the field that you want from the response and add it to the Metadata.
Aside from that, Protocol buffers currently require that you parse the message before accessing the internal fields.

Can field tags be dropped from protobuf/thrift messages?

I understand that protobuf/thrift needs unique numerical field tags to provide version compatibility. They provide version compatibility by serializing messages (kind of) in this fashion:
<tag1> <value1> ... <tagN> <valueN>
When deserializing, they pick up the tag value, looks up message schema, and knows which field to fill the value into. In this way, as long as we add new fields with different tag value, the messages will be compatible.
But I don't think this is a very good design:
The tag value has to be encoded within the message. This has some overhead.
For example. When a client invokes an RPC method on a remote server many times, the tag values in every request/response are the same. It would be nice to only send <tag1> <value1> ... <tagN> <valueN> once, and then only send <value1> ... <valueN>.
When changing the type of a field, we also need to change the tag value. Forgetting to do this will lead to bugs.
Developers have to ensure tag values are unique. Usually people track the last used tag id and increase it when adding new fields. But when two people add fields in separate branches and make a merge, it's hard to resolve conflict.
I think a better design could be:
Create a compact schema for each message type, like this:
<field_name_1> <field_type_1> ... <field_name_N> <field_type_N> (sorted according to field_name)
To address issue 1, exchange message schema before doing anything. For the RPC example, the client will send its message schema before sending first RPC, then in the following RPC, it only sends <value_1> ... <value_N>. The server will have message schema when request arrives, and knows how to deserialize it.
To address issue 2, when the field type is changed, the compact message schema will be changed, too. Programs will be able to find out the old and new schema does not match, and reports error.
To address issue 3, developers no longer need to take care of assigning unique tag values. They still need to take care of assigning unique field names, but this should be easier, and less likely to lead to merge conflicts.
Could this be a usable design? And what will be the problems of it?
I believe Apache Avro works like you describe, so perhaps you should try that.
However, I would argue that the upfront schema negotiation adds a huge amount of complication to the protocol which outweighs any benefit. It may seem easy enough in the simple case, but in a large-scale system where you have proxies (that don't know what they're proxying), dedicated storage servers, messages composed from pieces received from multiple senders with different protocol versions, etc., the complexity of tracking schema versions becomes a huge burden.

HTTP POST - nameless data VS named data

Our server A notifies 3rd party server B with an XML-formatted message, sent as HTTP POST request. It's us who specify the message format and other aspects of interaction.
We can specify that the XML is sent as
a) raw data (just the XML)
b) single POST parameter having some specific name (say, xml=XML)
The question is which way is better for the 3rd party in general, if we don't know the platform and language they are using.
I thought I had seen some problems in certain languages to easily parse the nameless raw data, though I don't remember any specific case. While my colleague insists that the parameter name is redundant, and it's really better to send the raw data without any name.
If you don't need send extra information in other post parameters the xml parameter name is redundant and innecesary as your teammate said, if the 3rd party waits only for a XML data only send the raw data in the POST body with the correct mime type and encoding and and do not complicate.
The process for Getting raw data is easy in most application server containers, so you dont care about that, most of them uses a Reader to get received data and manipulate it.

Resources