About protobuf repeating varint decoding - protocol-buffers

I use charles and got a protobuf http message from other iOS applications. Now I want to genereate the same http packet but the output is not the same.
My protobuf file:
message TTCreateConversationBody
{
repeated uint32 imUid1 = 2;
}
I'm using objective-c:
TTCreateConversationBody *body = [TTCreateConversationBody new];
GPBUInt32Array *arr = [[GPBUInt32Array alloc] initWithCapacity:2];
[arr addValue:123123];
[arr addValue:9999999];
body.imUid1Array = arr;
and my output, charles decode it as a length-delimited string:
it's raw data and mine:
8A-26-10-08-01-10-AE-F7-81-80-9F-03-10-D4-E4-82-F0-D2-01
8A-26-10-08-01-12-0C-F9-F6-C3-9D-FA-02-AE-F7-81-80-9F-03
What's the correct protobuf file format?

They're actually both valid... ish.
This comes down to "packed" fields; without "packed", your two integers are encoded as
[header, varint][value][header, varint][value]
[10][AE-F7-81-80-9F-03][10][D4-E4-82-F0-D2-01]
where-as with "packed", it becomes
[header, string][length][value][value]
[12][0C][F9-F6-C3-9D-FA-02][AE-F7-81-80-9F-03]
note: the actual values look very different in the two runs... I'm assuming that is accidental.
To quote from the specification:
Protocol buffer parsers must be able to parse repeated fields that were compiled as packed as if they were not packed, and vice versa. This permits adding [packed=true] to existing fields in a forward- and backward-compatible way.
So: serializers should write the layout that is defined by whether your data is "packed" or not, but decoders must be able to handle it either way. Some libraries, when encountering data that should be "packed": determine which layout will actually be shorter, and make the final decision based on that. In reality, this can be approximated to "use packed encoding whenever there's at least two items".

Related

how to put multiple protobufs on a wire

We have a communication channel on which we send protobufs. In order to be able to send more than one type of protobuf, we double-serialise:
message Coprolite {
enum Bezoar {
KBezoarUndef = 0;
KBezoarFriedEggs = 1;
KBezoarHam = 2;
}
Bezoar payload_type = 1;
bytes payload = 2;
}
If I have a FriedEggs protobuf, I serialise it, assign it to the payload of a Coprolite, set the payload_type to KBezoarFriedEggs, serialise the Coprolite, and send it on its way.
On receipt, I deserialise, check what I've got, and deserialise that.
This works on all of our platforms. I've not, however, found examples of others doing this this way (nor any other way, really). So this suggests I should ask for advice. Is there a better strategy or a reason that I should be wary of this?
If you want to prevent having to set payload_type you can use an oneof. Oneof's implicitly serialize the payload type by adding the tag number in front. Just like any other field serialized. So you have to write less administrative code.
There is however one advantage in your approach over using oneof's. Deepening on your programming language and how oneof's and byte arrays are implemented on your platform and in your protobuf library. An oneof implemented as an union might allocate memory the size of the largest nested message. So depending on your situation dynamically allocating the bytes array might use less memory when you send a lot of small message and only sometimes a big one.
There are two common approaches here; the simplest (especially when the types are predictable in advance) is usually oneof:
message Coprolite {
oneof payload_type {
FriedEggs eggs = 1;
Ham ham = 2;
}
}
This acts like a discriminated union, where you can check the embedded type.
In some niche scenarios, you may prefer to use Any

What is the point of google.protobuf.StringValue?

I've recently encountered all sorts of wrappers in Google's protobuf package. I'm struggling to imagine the use case. Can anyone shed the light: what problem were these intended to solve?
Here's one of the documentation links: https://developers.google.com/protocol-buffers/docs/reference/csharp/class/google/protobuf/well-known-types/string-value (it says nothing about what can this be used for).
One thing that will be different in behavior between this, and simple string type is that this field will be written less efficiently (a couple extra bytes, plus a redundant memory allocation). For other wrappers, the story is even worse, since the repeated variants of those fields will be written inefficiently (official Google's Protobuf serializer doesn't support packed encoding for non-numeric types).
Neither seems to be desirable. So, what's this all about?
There's a few reasons, mostly to do with where these are used - see struct.proto.
StringValue can be null, string often can't be in a language interfacing with protobufs. e.g. in Go strings are always set; the "zero value" for a string is "", the empty string, so it's impossible to distinguish between "this value is intentionally set to empty string" and "there was no value present". StringValue can be null and so solves this problem. It's especially important when they're used in a StructValue, which may represent arbitrary JSON: to do so it needs to distinguish between a JSON key which was set to empty string (StringValue with an empty string) or a JSON key which wasn't set at all (null StringValue).
Also if you look at struct.proto, you'll see that these aren't fully fledged message types in the proto - they're all generated from message Value, which has a oneof kind { number_value, string_value, bool_value... etc. By using a oneof struct.proto can represent a variety of different values in one field. Again this makes sense considering what struct.proto is designed to handle - arbitrary JSON - you don't know what type of value a given JSON key has ahead of time.
In addition to George's answer, you can't use a Protobuf primitive as the parameter or return value of a gRPC procedure.

Is there an off the shelf binary format that allows string caching

I am investigating migrating of a highly customized and efficient binary format to one of the available binary formats. The data is stored on some low powered mobile among other places, so performance is important requirement.
Advantage of the current format is that all strings are stored in a pool. This means that we don't repeat the same string hundred of times in file, we read it only once during deserialization and all objects are referencing it by its index. It also means that we keep only one copy in memory. So a lot of advantages :)
I was not able to find a way for capnproto or flatbuffers to support this. Or would I need to build layer on top, and in generated object use integer index to strings explicitly?
Thanks you!
FlatBuffers supports string pooling. Simply serialize a string once, then refer to that string multiple times in other objects. The string will only occur in memory once.
Simplest example, schema:
table MyObject { name: string; id: string; }
code (C++):
FlatBufferBuilder fbb;
auto s = fbb.CreateString("MyPooledString");
// Both string fields point to the same data:
auto o = CreateMyObject(fbb, s, s);
fbb.Finish(o);
You can always do this manually like:
struct MyMessage {
stringTable #0 :List(Text);
# Now encode string fields as integer indexes into the string table.
someString #1 :UInt32;
otherString #2 :UInt32;
}
Cap'n Proto could in theory allow multiple pointers to point at the same object, but currently prohibits this for security reasons: it would be too easy to DoS servers that don't expect it by sending messages that are cyclic or contain lots of overlapping references. See the section on amplification attacks in the docs.

Why are there no custom default values in proto3?

The proto2 version of Protocol Buffers allows to specify default values for message elements:
optional double scaling_factor = 3 [default = 1.0];
Why is this no longer possible in proto3? I consider this a neat feature to save additional bytes on the wire without the need of writing any wrapper code.
My understanding is that proto3 no longer allows you to detect field presence and no longer supports non-zero default values because this makes it easier to implement protobufs in terms of "plain old structs" in various languages, without the need to generate accessor methods. This is perceived as making Protobuf easier to use in those languages.
(I personally think that languages which lack accessors and properties aren't very good languages and protobuf should not design down to them, but it's not my project anymore.)
This is a work around instead of a direct answer to your question, but I've found myself using wrappers.proto optional values and then setting the default value myself programatically when I absolutely must know if this was a default value or a value that was explicitly set.
Not optimal that your code has to enforce the value instead of the generated code itself, but if you own both sides, at least it's a viable alternative versus having no idea if the value was the default or explicity set as such, especially when looking at a bool set to false.
I am unclear how this affects bytes on the wire. For the instances where I've used it, message length was not a design constraint.
Proto File
import "google/protobuf/wrappers.proto";
google.protobuf.BoolValue optional_bool = 1;
Java code
//load or receive message here
if( !message.hasOptionalBool() )
message.setOptionalBool( BoolValue.newBuilder().setValue( true ) );
In my autogenerated file .pb.cc I see few places like this:
if (this->myint() != 0) {
and few like this:
myint_ = 0;
So, why not to enable default value and generate
static ::google::protobuf::int32 myint_defaultvalue = 5;
...
if (this->myint() != myint_defaultvalue) {
...
...
myint_ = myint_defaultvalue;
...
instead?

protocol buffer uint32 field with data always in [0,255]

In a Google protocol buffer, I'm going to use a field to store values that will be integers in [0,255]. From http://code.google.com/apis/protocolbuffers/docs/proto.html#scalar, it looks like the uint32 will be the appropriate value type to use. Despite the field being able to hold up to 32-bit integers, those extra bits will not be wasted in my case due to the variable length encoding. (Correct me if I'm wrong up to here.)
My question is: how should I indicate that the reader of a serialized message can assume that the largest value in that field will be 255? Just a comment in the protocol buffer specification? Is there any other way?
In .proto there is no such specification; you must simply document it (and presumably cast it appropriately at the consuming code).
Aside: if you happen to be using the C# protobuf-net implementation, then you can do this by working outside a .proto definition (protobuf-net allows code-first):
[ProtoMember(3)] // <=== field number
public byte SomeValue {get;set;}
This is then obviously constrained to 0-255, but is encoded on the wire as you expect (like a uint32). It also does a checked conversion when deserializing, to sanity-check the values.
In .proto, the above is closest to:
optional uint32 someValue = 3;

Resources