Difference of proto messages with the relative positions of its attributes - protocol-buffers

what is the difference if the attribute in a protobuf message is first or second member?
The Request proto message i have, has 2 fields. i am asked to interchange the position of the attributes
message SomeRequest {
SomeMessage1 message1 = 1;
SomeMessage2 message2 = 2;
}
Changed to :
message SomeRequest {
SomeMessage2 message2 = 1;
SomeMessage1 message1 = 2;
}
What could be the possible reasons of such a advice? message2 is expected to be more predominately searched?
Also can i have few more scenarios to understand it better?
Thanks in advance!!

Ultimately, whoever asked for this change - maybe ask them for their reasons?
There are scenarios where lower field numbers are cheaper (space-wise) and so should be preferred, but fields 1 and 2 are identical in terms of space, so this this isn't a concern here.
If this was for byte compatibility with a pre-existing similar type, it might be a reasonable request.

Related

Can I define an ordered mixed message array in proto3?

I want to define in proto3 an ordered list of unrelated classes (messages) like this:
Frog
Dirt
Air
Computer 1
Computer 2
Politics
Is it possible? I can also live with having a base class (base message) if that exists in proto3... Its not clear to mean if the feature set of proto3 allows this. Thanks!
The typical way of representing this would be
message Wrapper {
oneof Thing {
Frog frog = 1;
//...
Politics politics = 6;
}
}
and use repeated Wrapper for the list/array. There is no one-step repeated oneof.
Alternatively, you could just use
repeated Frog frogs = 1;
//...
repeated Politics politics = 6;
However this second layout cannot preserve the order between different kinds of element.

proto3 - oneof vs fields with identifier

I am writing a proto3 class for an object which currently have around 2 variations, and will grow up to 6 or 7. Only one of them would be used in a message. These variations do not share common fields. They will be encoded as a submessage in a parent message. These messages would be written once and read tens of thousands of time.
I was wondering what would be the most performant way, memory and time of parsing wise, to achieve this so that as more variations are added, the performance is not lost.
Consider the following variations.
message B1 {
repeated string value = 1;
bool hasMeta = 2;
}
message B2 {
repeated int32 value = 1;
map<string, string> foo = 2;
}
First option: define an oneof field that refers the specific subtype.
message P1 {
oneof parents {
B1 boo = 1;
B2 baz = 2;
// add more variations here in future..
}
// other non-related fields...
}
Second option: define an integer that acts as an identifier for the available variation. At runtime this integer can be used to determine which variation has been set (another way is to null check the variations and use the first non-null).
message P1 {
int32 type = 1;
B1 boo = 2;
B2 baz = 3;
// other non-related fields...
}
I am particularly interested in the wire size and performance.
In second option, considering only one of the variations would be set (enforced in app layer), will the wire size be more than that in first? Is memory reserved for null fields as well?
The oneof method is slightly better compared to the message where you define the variable type with respect to processing power and wire size. Protobuf always serializes the tag number before a nested message. So for the oneof message it is not required to serialize a variable like type. Making it's wire size slightly smaller compared to the second message definition.
With respect to memory allocation this highly depends on the programming language you are using and how they have implemented oneof's and nested messages. If I am not mistaken the default C++ implementation dynamically allocates memory for sub messages. I suspect no difference here between either of your suggestions. Looking at NanoPB however, there oneof's are implemented as unions allocating only memory for the bigger message. This while for your second option would allocated memory for both B1 and B2.

Is using >= and <= to specify enum values good code practice?

I am working on a project with about 8 other people, and would like to know the best code practice here, given that other people will be working on this code for years to come.
Say I have an enum with 10 values:
typedef enum {
Tag1 = 1,
Tag2,
Tag3,
Tag4,
Tag5,
Tag6,
Tag7,
Tag8,
Tag9,
Tag10
} Tag;
If I wanted to check if a tag is equal to Tag6, Tag7, Tag8, Tag9, or Tag10, is it good practice to using a comparison like:
if(myTag >= Tag6 && myTag <= Tag10) {
//Do something
}
?
Or is it best to use an OR and check for each tag?
Using >= and <= looks nicer and is less clunky, but if down the line, someone were to insert a new Tag between Tag7 and Tag8, it would mess up all the logic.
Can I expect that someone wouldn't add a new Tag between other Tags?
Yes, but only for enums that express a scale of values, for instance:
enum Priority {
None = 0,
Low,
Medium,
High,
Critical
}
Then this code makes sense and is readable:
if(message.Priority >= Priority.Medium) {
// Notify user
}
If the enum doesn't express a scale like this then avoid using < or > as they can be rather confusing. Use bit flags instead.
Flags enums use binary values so that values can be combined:
enum UserAudiences {
// Basic values: dec // binary
None = 0, // 0000
Client = 1, // 0001
Employee = 2, // 0010
Contractor = 4, // 0100
Key = 8, // 1000
// Combined: dec // binary
KeyClient = 9, // 1001 : Key + Client
BoardMember = 10, // 1010 : Key + Employee
CounterParty = 5, // 0101 : Client + Contractor
BusinessPartner = 13 // 1101 : Key + Client + Contractor
}
Then, when checking for a combined enum value we look at the binary number and whether the appropriate bit is set. For instance if we want to check for UserAudiences.Employee we can just look for the bit that represents 2, if it is set then we have one of the enum values that includes it:
if((message.Audience & UserAudiences.Employee) != 0) {
// Post on intranet
} else {
// Send externally
}
There's no way to set that bit through any combination of Key, Client or Contractor enums, it can only be set if Employee is one of the 'source' enums.
Most languages have helpers for this (or you can write your own):
if(message.Audience.HasFlag(UserAudiences.Employee)) { ...
The maths could work in any base - you could use 1, 10, 100, etc in decimal. However, you'd need much bigger numbers much sooner.
Finally, there's a convention to use singular names for regular enums, and plural names for flagged enums, hinting to the programmer whether to use equality or bitwise checks.
Can I expect that someone wouldn't add a new Tag between other Tags?
I wouldn't bet on it. Unless the enum's ordinal/underlying values have some inherent meaning or order I would avoid using them to much.
I would only use range checks if I actually want somebody to be able to insert additional enums without adapting all checks. This is probably a rather rare case though. Keith gives a good examples with the Priority enum, another example I can think of are log levels.
The exact syntax depends on the language of course but I would usually consider something like this as most readable:
if(myTag in [Tag6, Tag7, Tag8]) {
// ...
}
Or even better use some describing variable names which make it obvious what the other tags are:
topTags = [Tag6, Tag7, Tag8]
if(myTag in topTags) {
// ...
}

how bytes are used to store information in protobuf

i am trying to understand the protocol buffer here is the sample , what i am not be able to understand is how bytes are being used in following messages. i dont know what this number
1 2 3 is used for.
message Point {
required int32 x = 1;
required int32 y = 2;
optional string label = 3;
}
message Line {
required Point start = 1;
required Point end = 2;
optional string label = 3;
}
message Polyline {
repeated Point point = 1;
optional string label = 2;
}
i read following paragraph in google protobuf but not able to understand what is being said here , can anyone help me in understanding how bytes are being used to store info.
The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional element.
The general form of a protobuf message is that it is a sequence of pairs of the form:
field header
payload
For your question, we can largely forget about the payload - that isn't the bit that relates to the 1/2/3 and the <=16 restriction - all of that is in the field header. The field header is a "varint" encoded integer; "varint" uses the most-significant-bit as an optional continuation bit, so small values (<=127, assuming unsigned and not zig-zag) require one byte to encode - larger values require multiple bytes. Or in other words, you get 7 useful bits to play with before you need to set the continuation bit, requiring at least 2 bytes.
However! The field header itself is composed of two things:
the wire-type
the field-number / "tag"
The wire-type is the first 3 bits, and indicates the fundamental format of the payload - "length-delimited", "64-bit", "32-bit", "varint", "start-group", "end-group". That means that of the 7 useful bits we had, only 4 are left; 4 bits is enough to encode numbers <= 16. This is why field-numbers <= 16 are suggested (as an optimisation) for your most common elements.
In your question, the 1 / 2 / 3 is the field-number; at the time of encoding this is left-shifted by 3 and composed with the payload's wire-type; then this composed value is varint-encoded.
Protobuf stores the messages like a map from an id (the =1, =2 which they call tags) to the actual value. This is to be able to more easily extend it than if it would transfer data more like a struct with fixed offsets. So a message Point for instance would look something like this on a high level:
1 -> 100,
2 -> 500
Which then is interpreted as x=100, y=500 and label=not set. On a lower level, protobuf serializes this tag-value mapping in a highly compact format, which among other things, stores integers with variable-length encoding. The paragraph you quoted just highlights exactly this in the case of tags, which can be stored more compactly if they are < 16, but the same for instance holds for integer values in your protobuf definition.

Comparison of streaming message implementations in protobuf

What are the trade-offs, advantages and disadvantages of each of these streaming implementations where multiple messages of the same type are encoded?
Are they any different at all ? What I want achieve is to store a vector of box'es, into a protobuf.
Impl 1 :
package foo;
message Boxes
{
message Box
{ required int32 w = 1;
required int32 h = 2;
}
repeated Box boxes = 1;
}
Impl 2:
package foo;
message Box
{ required int32 w = 1;
required int32 h = 2;
}
message Boxes
{ repeated Box boxes = 1;
}
Impl 3 : Stream multiple of these messages into the same file.
package foo;
message Box
{ required int32 w = 1;
required int32 h = 2;
}
Marc Gravell answer is certainly correct, but one point he missed is
option's 1 & 2 (Repeated option) will serialise / deserialise all the box's at once
option 3 (multiple messages in the file) will serialise / deserialise box by box.
If using java, you can use delimited files (which will add a Var-Int length at the start of the message).
Most of the time it will not matter wether you use a Repeated or Multiple messages, but if there are millions / billions of box's, memory will be an issue for option's 1 and 2 (Repeated) and option 3 (multiple messages in the file) would be the best to choose.
So in summary:
If there millions / billions of Boxes use - Option 3 (multiple messages in the file).
Otherwise use one of the Repeated options (1/2) because it simpler and supported across all Protocol buffers versions.
Personally I would like to see a "standard" Multiple Message format
1 & 2 only change where / how the types are declared. The work itself will be identical.
3 is more interesting: you can't just stream Box after Box after Box, because the root object in protobuf is not terminated (to allow concat === merge). If you only write Boxes, when you deserialize you will have exactly one Box with the last w and h that were written. You need to add a length-prefix; you could do that arbitrarily, but: if you happen to choose to "varint"-encode the length, you're close to what the repeated gives you - except the repeated also includes a field-header (field 1, type 2 - so binary 1010 = decimal 10) before each "varint" length.
If I were you, I'd just use the repeated for simplicity. Which of 1 / 2 you choose would depend on personal choice.

Resources