How to Create LMDB for Caffe Using C - protocol-buffers

I need to create LMDBs dynamically that can be read by Caffe's data layer, and the constraint is that only C is available for doing so. No Python.
Another person examined the byte-level contents of a Caffe-ready LMDB file here: Caffe: Understanding expected lmdb datastructure for blobs
This is a good illustrative example but obviously not comprehensive. Drilling down led me to the Datum message type, defined by caffe.proto, and the ensuing caffe.pb.h file created by protoc from caffe.proto, but this is where I hit a dead end.
The Datum class in the .h file defines a method that appears to be a promising lead:
void SerializeWithCachedSizes(::google::protobuf::io::CodedOutputStream* output) const
I'm guessing this is where the byte-level magic happens for encoding messages before they're sent.
Question: can anyone point me to documentation (or anything) that describes how the encoding works, so I can replicate an abridged version of it? In the illustrative example, the LMDB file contains MNIST data and metadata, and 0x08 seems to signify that the next value is "Number of Channels". And 0x10 and 0x18 designate heights and widths, respectively. 0x28 appears to designate an integer label being next. And so on, and so forth.
I'd like to gain a comprehensive understanding of all possible bytes and their meanings.

Additional digging yielded answers on the following page: https://developers.google.com/protocol-buffers/docs/encoding
Caffe.proto defines Datum by:
optional int32 channels = 1
optional int32 height = 2
optional int32 width = 3
optional bytes data = 4
optional int32 label = 5
repeated float float_data = 6
optional bool encoded = 7
The LMDB record's header in the illustrative example cited above is "08 01 10 1C 18 1C 22 90 06", so with the Google documentation's decoder ring, these hexadecimal values begin to make sense:
08 = Field 1, Type = int32 (since tags are encoded by: (field_number << 3) | wire_type)
01 = Value of Field 1 (i.e., number of channels) is 01
10 = Field 2, Type = int32
1C = Value of Field 2 (i.e., height) is 28
18 = Field 3, Type = int32
1C = Value of Field 3 (i.e., width) is 28
22 = Field 4, Type = length-delimited in bytes
90 06 = Value of Field 4 (i.e., number of bytes) is 1580 using the VarInt encoding methodology
Given this, efficiently creating LMDB entries directly with C for custom, non-image data sets that are readable by Caffe's data layer becomes straightforward.

Related

Casting Primitive Values

Value: 1,921,222, is too large to be stored as a short, so numeric overflow occurs and it becomes 20,678.
Can anyone demonstrate the process of 1,921,222 becoming 20,678 ?
how to “wraps around” to the next lowest value and counts up from there to get 20,678
Thank you in advance
In language C, the "short" type has 2 bytes. Every integer value is treated by the compiler as a 32-bit or 4-byte "int" type (this can vary depending on the compiler).
short s = 1921222;
In this sentence you are losing 2 bytes of data:
Information that remains in the variable (2 bytes)
^ ^
00000000 00011101 01010000 11000110 -> total data (4 bytes, 32 bits)
v v
Information discarded when you put this value in a short type.
In other words, you "crop" the data, leaving only the part that fits the specified type.
01010000 11000110
"01010000 11000110" is 20678.
This site can help you to understand better how this process works:
https://hexed.it/

How do you determine the length of the string in an OUTPUT_DEBUG_STRING_INFO?

The documentation for the OUTPUT_DEBUG_STRING_INFO structure doesn't explain, how to determine the length (or size) of the string value it points to. Specifically, the documentation for nDebugStringLength is confusing:
The lower 16 bits of the length of the string in bytes. As nDebugStringLength is of type WORD, this does not always contain the full length of the string in bytes.
For example, if the original output string is longer than 65536 bytes, this field will contain a value that is less than the actual string length in bytes.
As I understand it, the true size can be any value that's a solution to the equation:
size = nDebugStringLength + (n * 65536)
for any n in [0..65536).
Question:
How do I determine the correct size of the string? Unless I'm overlooking something, the documentation appears to be insufficient in this regard.
initially the debug event comes in the form DBGUI_WAIT_STATE_CHANGE
if use WaitForDebugEvent[Ex] api - it internally convert DBGUI_WAIT_STATE_CHANGE to DEBUG_EVENT by using DbgUiConvertStateChangeStructure[Ex]
the DbgExceptionStateChang ( in NewState) event with DBG_PRINTEXCEPTION_WIDE_C and DBG_PRINTEXCEPTION_C (in ExceptionCode) converted to OUTPUT_DEBUG_STRING_INFO. the nDebugStringLength is taken from Exception.ExceptionRecord.ExceptionInformation[0] or from ExceptionInformation[3] (in case DBG_PRINTEXCEPTION_C and api version without Ex ). but because nDebugStringLength is only 16 bit length, when original value is 32/64 bit length - it truncated - only low 16 bit of ExceptionInformation[0] (or [3]) is used.
note that ExceptionInformation[0] (and [3] in case DBG_PRINTEXCEPTION_WIDE_C ) containing string length in characters, including terminating 0.
in contrast nDebugStringLength in bytes (if we using WaitForDebugEventEx and DBG_PRINTEXCEPTION_WIDE_C exception - nDebugStringLength = (WORD)(ExceptionInformation[0] * sizeof(WCHAR))

How to modify a serialized protobuf byte array in place?

Consider a protobuf message:
message DataMessage {
int32 custId = 1;
string uuid = 2;
int32 version = 3;
string firmName = 4;
google.protobuf.Timestamp date = 5;
int32 accountNo = 6;
string firmName = 7;
bytes payload = 8;
}
I populate this and Marshal it to a byte array and publish this as a Kafka message. All this works wonderfully.
Now to the question: the Kafka handler gets the byte array. What I'd like to do is modify fields 2 and 3 ONLY without having to unmarshal the byte array back into a DataMessage, modify fields 2 and 3 then marshal it back into another byte array so it can be Kafka published to the next hop. Field 3 is a string but is an ISO-formatted GUID. The modification here will not increase or decrease the field length.
protob.Buffer doesn't seem to facilitate this. It seems to allow encode/decode on single fields at a time.
The payload (field 8) is 99% by size and won't be modified. I'd like to skip all the temporary copies and marshal/unmarshal work to modify the 1%.
Possible? I guess if I knew the offsets into the byte array for fields 2 and 3 I could use modify the bytes in place ... however this undermines the integrity of the marshaled bytes to some extent. This might be made easier by using fixed32 instead of int32 ... so the offsets into the byte array are consistent.

how bytes are used to store information in protobuf

i am trying to understand the protocol buffer here is the sample , what i am not be able to understand is how bytes are being used in following messages. i dont know what this number
1 2 3 is used for.
message Point {
required int32 x = 1;
required int32 y = 2;
optional string label = 3;
}
message Line {
required Point start = 1;
required Point end = 2;
optional string label = 3;
}
message Polyline {
repeated Point point = 1;
optional string label = 2;
}
i read following paragraph in google protobuf but not able to understand what is being said here , can anyone help me in understanding how bytes are being used to store info.
The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional element.
The general form of a protobuf message is that it is a sequence of pairs of the form:
field header
payload
For your question, we can largely forget about the payload - that isn't the bit that relates to the 1/2/3 and the <=16 restriction - all of that is in the field header. The field header is a "varint" encoded integer; "varint" uses the most-significant-bit as an optional continuation bit, so small values (<=127, assuming unsigned and not zig-zag) require one byte to encode - larger values require multiple bytes. Or in other words, you get 7 useful bits to play with before you need to set the continuation bit, requiring at least 2 bytes.
However! The field header itself is composed of two things:
the wire-type
the field-number / "tag"
The wire-type is the first 3 bits, and indicates the fundamental format of the payload - "length-delimited", "64-bit", "32-bit", "varint", "start-group", "end-group". That means that of the 7 useful bits we had, only 4 are left; 4 bits is enough to encode numbers <= 16. This is why field-numbers <= 16 are suggested (as an optimisation) for your most common elements.
In your question, the 1 / 2 / 3 is the field-number; at the time of encoding this is left-shifted by 3 and composed with the payload's wire-type; then this composed value is varint-encoded.
Protobuf stores the messages like a map from an id (the =1, =2 which they call tags) to the actual value. This is to be able to more easily extend it than if it would transfer data more like a struct with fixed offsets. So a message Point for instance would look something like this on a high level:
1 -> 100,
2 -> 500
Which then is interpreted as x=100, y=500 and label=not set. On a lower level, protobuf serializes this tag-value mapping in a highly compact format, which among other things, stores integers with variable-length encoding. The paragraph you quoted just highlights exactly this in the case of tags, which can be stored more compactly if they are < 16, but the same for instance holds for integer values in your protobuf definition.

UUID format: 8-4-4-4-12 - Why?

Why are UUID's presented in the format "8-4-4-4-12" (digits)? I've had a look around for the reason but can't find the decision that calls for it.
Example of UUID formatted as hex string:
58D5E212-165B-4CA0-909B-C86B9CEE0111
It's separated by time, version, clock_seq_hi, clock_seq_lo, node, as indicated in the following rfc.
From the IETF RFC4122:
4.1.2. Layout and Byte Order
To minimize confusion about bit assignments within octets, the UUID
record definition is defined only in terms of fields that are
integral numbers of octets. The fields are presented with the most
significant one first.
Field Data Type Octet Note
#
time_low unsigned 32 0-3 The low field of the
bit integer timestamp
time_mid unsigned 16 4-5 The middle field of the
bit integer timestamp
time_hi_and_version unsigned 16 6-7 The high field of the
bit integer timestamp multiplexed
with the version number
clock_seq_hi_and_rese unsigned 8 8 The high field of the
rved bit integer clock sequence
multiplexed with the
variant
clock_seq_low unsigned 8 9 The low field of the
bit integer clock sequence
node unsigned 48 10-15 The spatially unique
bit integer node identifier
In the absence of explicit application or presentation protocol
specification to the contrary, a UUID is encoded as a 128-bit object,
as follows:
The fields are encoded as 16 octets, with the sizes and order of the
fields defined above, and with each field encoded with the Most
Significant Byte first (known as network byte order). Note that the
field names, particularly for multiplexed fields, follow historical
practice.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| time_low |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| time_mid | time_hi_and_version |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|clk_seq_hi_res | clk_seq_low | node (0-1) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| node (2-5) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The format is defined in IETF RFC4122 in section 3. The output format is defined where it says "UUID = ..."
3.- Namespace Registration Template
Namespace ID: UUID Registration Information:
Registration date: 2003-10-01
Declared registrant of the namespace:
JTC 1/SC6 (ASN.1 Rapporteur Group)
Declaration of syntactic structure:
A UUID is an identifier that is unique across both space and time,
with respect to the space of all UUIDs. Since a UUID is a fixed
size and contains a time field, it is possible for values to
rollover (around A.D. 3400, depending on the specific algorithm
used). A UUID can be used for multiple purposes, from tagging
objects with an extremely short lifetime, to reliably identifying
very persistent objects across a network.
The internal representation of a UUID is a specific sequence of
bits in memory, as described in Section 4. To accurately
represent a UUID as a URN, it is necessary to convert the bit
sequence to a string representation.
Each field is treated as an integer and has its value printed as a
zero-filled hexadecimal digit string with the most significant
digit first. The hexadecimal values "a" through "f" are output as
lower case characters and are case insensitive on input.
The formal definition of the UUID string representation is
provided by the following ABNF [7]:
UUID = time-low "-" time-mid "-"
time-high-and-version "-"
clock-seq-and-reserved
clock-seq-low "-" node
time-low = 4hexOctet
time-mid = 2hexOctet
time-high-and-version = 2hexOctet
clock-seq-and-reserved = hexOctet
clock-seq-low = hexOctet
node = 6hexOctet
hexOctet = hexDigit hexDigit
hexDigit =
"0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" /
"a" / "b" / "c" / "d" / "e" / "f" /
"A" / "B" / "C" / "D" / "E" / "F"
128 bits
The "8-4-4-4-12" format is just for reading by humans. The UUID is really a 128-bit number.
Consider the string format requires the double of the bytes than the 128 bit number when stored or in memory. I would suggest to use the number internally and when it needs to be shown on a UI or exported in a file, use the string format.

Resources