How to modify a serialized protobuf byte array in place?

How to modify a serialized protobuf byte array in place? - protocol-buffers

Consider a protobuf message:
message DataMessage {
int32 custId = 1;
string uuid = 2;
int32 version = 3;
string firmName = 4;
google.protobuf.Timestamp date = 5;
int32 accountNo = 6;
string firmName = 7;
bytes payload = 8;
}
I populate this and Marshal it to a byte array and publish this as a Kafka message. All this works wonderfully.
Now to the question: the Kafka handler gets the byte array. What I'd like to do is modify fields 2 and 3 ONLY without having to unmarshal the byte array back into a DataMessage, modify fields 2 and 3 then marshal it back into another byte array so it can be Kafka published to the next hop. Field 3 is a string but is an ISO-formatted GUID. The modification here will not increase or decrease the field length.
protob.Buffer doesn't seem to facilitate this. It seems to allow encode/decode on single fields at a time.
The payload (field 8) is 99% by size and won't be modified. I'd like to skip all the temporary copies and marshal/unmarshal work to modify the 1%.
Possible? I guess if I knew the offsets into the byte array for fields 2 and 3 I could use modify the bytes in place ... however this undermines the integrity of the marshaled bytes to some extent. This might be made easier by using fixed32 instead of int32 ... so the offsets into the byte array are consistent.

Related

What is the memory difference between serializing a value as string and serializing a value as other data type in protobuf

Let's say I have a variable a, which I want to assign a decimal.
If my proto file is
syntax = "proto3";
message Test{
string a = 1;
}
How much memory will that take and how much will be the difference if I change string a =1 to float a =1.
Is there a documentation where you can see how much memory is assigned to different datatype?

Read FT2332H FIFO Data

I tried to read the FIFO buffer in FT2332H and it was successful but the to data is coming is a format make it difficult to process or plot it .. Here is the example ... I use ftd2xx library
while True:
rxn = d.getQueueStatus()
if (rxn>1024):
print(bytearray(d.read(1024)))
The output is as below .. Each 4 is a byte received from the buffer .. How to get each
bytearray(b'4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444')
This is the result without bytearray
print((d.read(1024)))
b'4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444'

Assuming this is python:
You can index each individual byte in the byte array with []
my_buffer = bytearray(d.read(1024)
now my_buffer[0] holds the value of the first byte in your byte array, represented as an integer with value 0-255. You will need to additionally cast this integer to a character to create a character array / string. ASCII is the typical correlation between an integer value and its character representation. The order of the bytes in your FIFO buffer is dependent on what is putting bytes into the FIFO on the not-USB side of the FT232. Many devices send data most-significant first, but you should verify this against that device's data sheet.

proto3 - oneof vs fields with identifier

I am writing a proto3 class for an object which currently have around 2 variations, and will grow up to 6 or 7. Only one of them would be used in a message. These variations do not share common fields. They will be encoded as a submessage in a parent message. These messages would be written once and read tens of thousands of time.
I was wondering what would be the most performant way, memory and time of parsing wise, to achieve this so that as more variations are added, the performance is not lost.
Consider the following variations.
message B1 {
repeated string value = 1;
bool hasMeta = 2;
}
message B2 {
repeated int32 value = 1;
map<string, string> foo = 2;
}
First option: define an oneof field that refers the specific subtype.
message P1 {
oneof parents {
B1 boo = 1;
B2 baz = 2;
// add more variations here in future..
}
// other non-related fields...
}
Second option: define an integer that acts as an identifier for the available variation. At runtime this integer can be used to determine which variation has been set (another way is to null check the variations and use the first non-null).
message P1 {
int32 type = 1;
B1 boo = 2;
B2 baz = 3;
// other non-related fields...
}
I am particularly interested in the wire size and performance.
In second option, considering only one of the variations would be set (enforced in app layer), will the wire size be more than that in first? Is memory reserved for null fields as well?

The oneof method is slightly better compared to the message where you define the variable type with respect to processing power and wire size. Protobuf always serializes the tag number before a nested message. So for the oneof message it is not required to serialize a variable like type. Making it's wire size slightly smaller compared to the second message definition.
With respect to memory allocation this highly depends on the programming language you are using and how they have implemented oneof's and nested messages. If I am not mistaken the default C++ implementation dynamically allocates memory for sub messages. I suspect no difference here between either of your suggestions. Looking at NanoPB however, there oneof's are implemented as unions allocating only memory for the bigger message. This while for your second option would allocated memory for both B1 and B2.

RCFile - emitting GZip compressed int columns

For some reason, Hive is not recognizing columns emitted as integers, but does recognize columns emitted as strings.
Is there something about Hive or RCFile or GZ that is preventing proper rendering of int?
My Hive DDL looks like:
create external table if not exists db.table (intField int, strField string) stored as rcfile location '/path/to/my/data';
And the relevant portion of my Java looks like:
BytesRefArrayWritable dataWrite = new BytesRefArrayWritable(2);
byte[] byteArray;
BytesRefWritable bytesRefWritable = new BytesRefWritable(); intWritable.set(myObj.getIntField());
byteArray = WritableUtils.toByteArray(intWritable.get());
bytesRefWritable.set(byteArray, 0, byteArray.length);
dataWrite.set(0, bytesRefWritable); // sets int field as column 0
bytesRefWritable = new BytesRefWritable();
textWritable.set(myObj.getStrField());
bytesRefWritable.set(textWritable.getBytes(), 0, textWritable.getLength());
dataWrite.set(1, bytesRefWritable); // sets str field as column 1
The code runs fine, and through logging I can see the various Writables have bytes within them.
Hive can read the external table as well, but the int field shows up as NULL, indicating some error.
SELECT * from db.table;
OK
NULL my string field
Time taken: 0.647 seconds
Any idea what might be going on here?

So, I'm not sure exactly why this is the case, but I got it working using the following method:
In the code that writes the byte array representing the integer value, instead of using WritableUtils.toByteArray(), I instead Text.set(Integer.toString(intVal)).getBytes().
In other words, I convert the integer to its String representation, and use the Text writable object to get the byte array as if it were a string.
Then, in my Hive DDL, I can call the column an int and it interprets it correctly.
I'm not sure what was initially causing the problem, be it a bug in WritableUtils, some incompatibility with compressed integer byte arrays, or a faulty understanding of how this stuff works on my part. In any event, the solution described above successfully meets the task's needs.

how bytes are used to store information in protobuf

i am trying to understand the protocol buffer here is the sample , what i am not be able to understand is how bytes are being used in following messages. i dont know what this number
1 2 3 is used for.
message Point {
required int32 x = 1;
required int32 y = 2;
optional string label = 3;
}
message Line {
required Point start = 1;
required Point end = 2;
optional string label = 3;
}
message Polyline {
repeated Point point = 1;
optional string label = 2;
}
i read following paragraph in google protobuf but not able to understand what is being said here , can anyone help me in understanding how bytes are being used to store info.
The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional element.

The general form of a protobuf message is that it is a sequence of pairs of the form:
field header
payload
For your question, we can largely forget about the payload - that isn't the bit that relates to the 1/2/3 and the <=16 restriction - all of that is in the field header. The field header is a "varint" encoded integer; "varint" uses the most-significant-bit as an optional continuation bit, so small values (<=127, assuming unsigned and not zig-zag) require one byte to encode - larger values require multiple bytes. Or in other words, you get 7 useful bits to play with before you need to set the continuation bit, requiring at least 2 bytes.
However! The field header itself is composed of two things:
the wire-type
the field-number / "tag"
The wire-type is the first 3 bits, and indicates the fundamental format of the payload - "length-delimited", "64-bit", "32-bit", "varint", "start-group", "end-group". That means that of the 7 useful bits we had, only 4 are left; 4 bits is enough to encode numbers <= 16. This is why field-numbers <= 16 are suggested (as an optimisation) for your most common elements.
In your question, the 1 / 2 / 3 is the field-number; at the time of encoding this is left-shifted by 3 and composed with the payload's wire-type; then this composed value is varint-encoded.

Protobuf stores the messages like a map from an id (the =1, =2 which they call tags) to the actual value. This is to be able to more easily extend it than if it would transfer data more like a struct with fixed offsets. So a message Point for instance would look something like this on a high level:
1 -> 100,
2 -> 500
Which then is interpreted as x=100, y=500 and label=not set. On a lower level, protobuf serializes this tag-value mapping in a highly compact format, which among other things, stores integers with variable-length encoding. The paragraph you quoted just highlights exactly this in the case of tags, which can be stored more compactly if they are < 16, but the same for instance holds for integer values in your protobuf definition.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to modify a serialized protobuf byte array in place? - protocol-buffers

Related

What is the memory difference between serializing a value as string and serializing a value as other data type in protobuf

Read FT2332H FIFO Data

proto3 - oneof vs fields with identifier

RCFile - emitting GZip compressed int columns

how bytes are used to store information in protobuf

Categories

Resources