How do I use BinData? - ruby

I'm trying to parse a binary file, and I can't quite figure out how to use BinData properly.
The way the binary is set up is like such:
first 4 bytes (UINT32) represent the length of the property name.
next 8 * length represent the property name (as a string).
next 4 bytes (UINT32) represent the length of the property type.
next 8 * length represent the property type (as a string).
next 8 bytes (UINT64) represent the length of the data.
after this the data could be any number of bytes (depending on what type of data it is, could be int (4), string (4 * len), float (4), or an array).
After this the process repeats with the next property.
I guess my questions are:
When I call MyBinDataClass.read() how do I feed it the correct portion to read and not the whole file, and since I don't know how long each property really is (they vary by type), how could I divide it properly?
How do I make a single BinData handle the different property types?

Related

When to use fixed value protobuf type? Or under what scenarios?

I want to transfer a serialized protobuf message over TCP and I've tried to use the first field to indicate the total length of the serialized message.
I know that the int32 will change the length after encoding. So, maybe a fixed32 is a good choice.
But at last of the Encoding chapter, I found that I can't depend on it even if I use a fixed32 with field_num #1. Because Field Order said that the order may change.
My question is when do I use fixed value types? Are there any example scenarios?
"My question is when do I use fixed value types?"
When it comes to serializing values, there's always a tradeoff. If we look at the Protobuf-documentation, we see we have a few options when it comes to 32-bit integers:
int32: Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.
uint32: Uses variable-length encoding.
sint32: Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.
fixed32: Always four bytes. More efficient than uint32 if values are often greater than 2^28.
sfixed32: Always four bytes.
int32 is a variable-length data-type. Any information that is not specified in the type itself, needs to be expressed somehow. To deserialize a variable-length number, we need to know what the length is. That is contained in the serialized message as well, which requires additional storage space. The same goes for an optional negative sign. The resulting message may be smaller because of this, but may be larger as well.
Say we have a lot of integers between 0 and 255 to encode. It would be cheaper to send this information as a two bytes (one byte with that actual value, and one byte to indicate that we just have one byte), than to send a full 32-bit (4 bytes) integer [fictional values, actual implementation may differ]. On the other hand, if we want to serialize a large value, that can only fit in 4 bytes the result may be larger (4 bytes and an additional byte to indicate that the value is 4 bytes; a total of 5 bytes). In this case it will be more efficient to use a fixed32. We simply know a fixed32 is 4 bytes; we don't need to serialize that fixed32 is a 4-byte number.
And if we look at fixed32 it actually mentions that the tradeoff point is around 2^28 (for unsigned integers).
So some types are good [as in, more efficient in terms of storage space] for large values, some for small values, some for positive/negative values. It all depends on what the actual values represent.
"Are there any example scenarios?"
32-bit hashes (ie: CRC-32), IPv4 addresses/masks. A predictable message sizes could be relevant.

maximum field number in protobuf message

The official document for protocol buffers https://developers.google.com/protocol-buffers/docs/proto3 says the maximum field number for fields in protobuf message is 2^29-1. But why is this limit?
Please anyone can explain in some detail? I am newbie to this.
I read answers to the this question at why 2^29-1 is the biggest key in protocol buffers.
But I am not clarified
Each field in an encoded protocol buffer has a header (called key or tag) prefixed to the actual encoded value. The encoding spec defines this key:
Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.
Here the spec says the tag is a varint where the first 3 bits are used to encode the wire type. A varint could encode a 64 bit value, thus just by going on this definition the limit would be 2^61-1.
In addition to this, the Language Guide narrows this down to a 32 bit value at max.
The smallest field number you can specify is 1, and the largest is 2^29 - 1, or 536,870,911.
The reasons for this are not given. I can only speculate for the reasons behind this:
Artificial limit as no one is expecting a message to have that many fields. Just think about fitting a message with that many fields into memory.
As the key is a varint, it isn't simply the next 4 bytes in the raw buffer, rather a variable length of bytes (Java code reading a varint32). Each byte has 7 bit of actual data and 1 bit indicating if the end is reached. It cloud be that for performance reasons it was deemed to be better to limit the range.
Since proto3 is the 3rd version of protocol buffers, it could be that either proto1 or proto2 defined the tag to be a varint32. To keep backwards compatibility this limit is still true in proto3 today.
Because of this line:
#define GOOGLE_PROTOBUF_WIRE_FORMAT_MAKE_TAG(FIELD_NUMBER, TYPE) \
static_cast<uint32>((static_cast<uint32>(FIELD_NUMBER) << 3) | (TYPE))
this line create a "tag", which left only 29 (32 - 3) bits to save field indice.
Don't know why google use uint32 instead of uint64 though, since field number is a varint, may be they think 2^29-1 fields is large enough for a single message declaration.
I suspect this is simply so that a field-header (wire-type and tag-number) can be decoded and handled as a 32-bit value. The wire-type is always the 3 least significant bits, leaving 29 bits for the tag number. Technically "varint" should support 64 bits, but it makes sense to limit it to reasonable numbers, not least because "varint" encoding means that larger numbers take more bytes to encode.
Edit: I realise now that this is similar to the linked post, but... it remain true! Each field in protobuf is prefixed by a "varint" that expresses what field (tag-number) follows, and what data type it is (wire-type). The latter is important especially so that unexpected fields (version differences) can be stored or skipped correctly. It is convenient for that field-header to be trivially processed by most frameworks, and most frameworks are fine with 32-bit integers.
this is another question rather a comment, in the document it says,
Field numbers in the range 16 through 2047 take two bytes. So you
should reserve the numbers 1 through 15 for very frequently occurring
message elements. Remember to leave some room for frequently occurring
elements that might be added in the future.
Because for the first byte, top 5 bits are used for field number, and bottom 3 bits for field type, isn't it that field number from 31 (because zero is not used) to 2047 take two bytes? (and I also guess the second bytes' lower 3 bits are used also for field type.. I'm in the middle of reading it, so I'll fix it when I know it)

What size Integer is guranteed to be 4 bytes?

How big does an Integer have to be in Java to definitely be 4 bytes long when converted into a byte[] using ByteBuffer.allocate(int_value).array()?
I ask this because I use Integers for Entity Ids in a game I'm working on and it's much cheaper to generate 4 byte Ids as as opposed to fill each byte[] with bytes that hold the value of 0x00.
As far as I understand you, you're making wrong assumptions here. There is no conversion/truncation/expansion done with allocate() nor array() - you just allocate int_value amount of bytes, and get bytes[int_value]-sized array from array() call if array() is supported at all. https://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html#allocate%28int%29
To make the array 4 bytes long, simply use ByteBuffer.allocate(4), that's all. Then, if you want, use putInt(somevalue), and you get 4 byte buffer filled with given int, because that's the size of a Java int (32-bits, as per https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html), regardless of it's value.
Note: you're probably approaching this from a wrong angle, btw. It's best to use big buffers, giving you continuous memory regions, and simply segment them based on some metric, e.g. for 4 byte (int) cells, allocate 4*totalInts and then, e.g. get(4*i) etc, or use bulk getting.
An integer (the primitive int) in Java should always be 4 bytes long hence the type isn't dynamic. See Primitive Data Types.
However, if your purpose is just to create an empty byte array, then just create it. There is no need to fill it with zeros hence in Java, the default value for bytes is 0.
If you want to ensure, that the byte array has the length 4, you could use
ByteBuffer.allocate(4).putInt(int_value).array();

LZ4 compression algorithm explanation

Description from Wikipedia:
The LZ4 algorithm represents the data as a series of sequences. Each sequence begins with a one byte token that is broken into two 4 bit fields. The first field represents the number of literal bytes that are to be copied to the output. The second field represents the number of bytes to copy from the already decoded output buffer (with 0 representing the minimum match length of 4 bytes). A value of 15 in either of the bitfields indicates that the length is larger and there is an extra byte of data that is to be added to the length. A value of 255 in these extra bytes indicates that yet another byte to be added. Hence arbitrary lengths are represented by a series of extra bytes containing the value 255. After the string of literals comes the token and any extra bytes needed to indicate string length. This is followed by an offset that indicates how far back in the output buffer to begin copying. The extra bytes (if any) of the match-length come at the end of the sequence
I didn't understand that at all! Does anyone have an easy way to understand example?
For example, in the above explanation what is a literal byte and what is a match? How can we have a decoded output buffer when we're just beginning to compress? Length of what?
The explanation at here was also impenetrable for me.
A simple example would be nice unless you have a better way of explaining it.
First, read about LZ77, the core approach being used. The text is a description of a particular way to code a series of literals and string matches in the preceding data.
A match is when the next bytes in the uncompressed data occur in the previously decompressed data. So instead of sending those bytes directly, a length and an offset is sent. Then you go offset bytes backwards and copy length bytes to the output.
Yes, you can't have a match at the beginning of the stream. You have to start with literals. (Unless there is a preset dictionary, which is another topic.)

Efficient encoding of basic numeric datatypes in protobuffers

In Protobuffers documentation, it has been given
"For historical reasons, repeated fields of basic numeric types aren't encoded as
efficiently as they could be. New code should use the special option [packed=true] to get
a more efficient encoding. For example:
repeated int32 samples = 4 [packed=true];"
Can someone clearly explain how does the statement "packed=true" improve the efficieny of encoding basic numeric datatypes??
Basically, under the original encoding the field header (which is composed of the wire type combined with the field-number, bit-shifted and or'd) occurs for every element. Because the header is varint encoded, it is at least one byte per element, but possibly more. So 10 4-byte floats would be at least 50 bytes and quite possibly 90 bytes if the header takes 5 bytes (large field numbers take more space than small field numbers).
With the packed encoding, the field header occurs only once, followed by a varint that indicates the number of bytes to follow. So for 10 floats, the payload length is 40, which is varint-encoded in a single byte for the length prefix. At deserialization time it simply consumes that-many bytes, reading elements as it does so. Therefore for the same data (50 to 90 bytes previously) we are now using 42 to 46 bytes (again, for the range of field numbers that take 1 to 5 bytes each).
These 2 layouts are very different on the wire, and code expecting one can not usually decode the other. As such, it needs to be explicitly enabled to prevent breaking existing messages.

Resources