Why are UUID's presented in the format "8-4-4-4-12" (digits)? I've had a look around for the reason but can't find the decision that calls for it.
Example of UUID formatted as hex string:
58D5E212-165B-4CA0-909B-C86B9CEE0111
It's separated by time, version, clock_seq_hi, clock_seq_lo, node, as indicated in the following rfc.
From the IETF RFC4122:
4.1.2. Layout and Byte Order
To minimize confusion about bit assignments within octets, the UUID
record definition is defined only in terms of fields that are
integral numbers of octets. The fields are presented with the most
significant one first.
Field Data Type Octet Note
#
time_low unsigned 32 0-3 The low field of the
bit integer timestamp
time_mid unsigned 16 4-5 The middle field of the
bit integer timestamp
time_hi_and_version unsigned 16 6-7 The high field of the
bit integer timestamp multiplexed
with the version number
clock_seq_hi_and_rese unsigned 8 8 The high field of the
rved bit integer clock sequence
multiplexed with the
variant
clock_seq_low unsigned 8 9 The low field of the
bit integer clock sequence
node unsigned 48 10-15 The spatially unique
bit integer node identifier
In the absence of explicit application or presentation protocol
specification to the contrary, a UUID is encoded as a 128-bit object,
as follows:
The fields are encoded as 16 octets, with the sizes and order of the
fields defined above, and with each field encoded with the Most
Significant Byte first (known as network byte order). Note that the
field names, particularly for multiplexed fields, follow historical
practice.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| time_low |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| time_mid | time_hi_and_version |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|clk_seq_hi_res | clk_seq_low | node (0-1) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| node (2-5) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The format is defined in IETF RFC4122 in section 3. The output format is defined where it says "UUID = ..."
3.- Namespace Registration Template
Namespace ID: UUID Registration Information:
Registration date: 2003-10-01
Declared registrant of the namespace:
JTC 1/SC6 (ASN.1 Rapporteur Group)
Declaration of syntactic structure:
A UUID is an identifier that is unique across both space and time,
with respect to the space of all UUIDs. Since a UUID is a fixed
size and contains a time field, it is possible for values to
rollover (around A.D. 3400, depending on the specific algorithm
used). A UUID can be used for multiple purposes, from tagging
objects with an extremely short lifetime, to reliably identifying
very persistent objects across a network.
The internal representation of a UUID is a specific sequence of
bits in memory, as described in Section 4. To accurately
represent a UUID as a URN, it is necessary to convert the bit
sequence to a string representation.
Each field is treated as an integer and has its value printed as a
zero-filled hexadecimal digit string with the most significant
digit first. The hexadecimal values "a" through "f" are output as
lower case characters and are case insensitive on input.
The formal definition of the UUID string representation is
provided by the following ABNF [7]:
UUID = time-low "-" time-mid "-"
time-high-and-version "-"
clock-seq-and-reserved
clock-seq-low "-" node
time-low = 4hexOctet
time-mid = 2hexOctet
time-high-and-version = 2hexOctet
clock-seq-and-reserved = hexOctet
clock-seq-low = hexOctet
node = 6hexOctet
hexOctet = hexDigit hexDigit
hexDigit =
"0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" /
"a" / "b" / "c" / "d" / "e" / "f" /
"A" / "B" / "C" / "D" / "E" / "F"
128 bits
The "8-4-4-4-12" format is just for reading by humans. The UUID is really a 128-bit number.
Consider the string format requires the double of the bytes than the 128 bit number when stored or in memory. I would suggest to use the number internally and when it needs to be shown on a UI or exported in a file, use the string format.
Related
The documentation for the OUTPUT_DEBUG_STRING_INFO structure doesn't explain, how to determine the length (or size) of the string value it points to. Specifically, the documentation for nDebugStringLength is confusing:
The lower 16 bits of the length of the string in bytes. As nDebugStringLength is of type WORD, this does not always contain the full length of the string in bytes.
For example, if the original output string is longer than 65536 bytes, this field will contain a value that is less than the actual string length in bytes.
As I understand it, the true size can be any value that's a solution to the equation:
size = nDebugStringLength + (n * 65536)
for any n in [0..65536).
Question:
How do I determine the correct size of the string? Unless I'm overlooking something, the documentation appears to be insufficient in this regard.
initially the debug event comes in the form DBGUI_WAIT_STATE_CHANGE
if use WaitForDebugEvent[Ex] api - it internally convert DBGUI_WAIT_STATE_CHANGE to DEBUG_EVENT by using DbgUiConvertStateChangeStructure[Ex]
the DbgExceptionStateChang ( in NewState) event with DBG_PRINTEXCEPTION_WIDE_C and DBG_PRINTEXCEPTION_C (in ExceptionCode) converted to OUTPUT_DEBUG_STRING_INFO. the nDebugStringLength is taken from Exception.ExceptionRecord.ExceptionInformation[0] or from ExceptionInformation[3] (in case DBG_PRINTEXCEPTION_C and api version without Ex ). but because nDebugStringLength is only 16 bit length, when original value is 32/64 bit length - it truncated - only low 16 bit of ExceptionInformation[0] (or [3]) is used.
note that ExceptionInformation[0] (and [3] in case DBG_PRINTEXCEPTION_WIDE_C ) containing string length in characters, including terminating 0.
in contrast nDebugStringLength in bytes (if we using WaitForDebugEventEx and DBG_PRINTEXCEPTION_WIDE_C exception - nDebugStringLength = (WORD)(ExceptionInformation[0] * sizeof(WCHAR))
I need to create LMDBs dynamically that can be read by Caffe's data layer, and the constraint is that only C is available for doing so. No Python.
Another person examined the byte-level contents of a Caffe-ready LMDB file here: Caffe: Understanding expected lmdb datastructure for blobs
This is a good illustrative example but obviously not comprehensive. Drilling down led me to the Datum message type, defined by caffe.proto, and the ensuing caffe.pb.h file created by protoc from caffe.proto, but this is where I hit a dead end.
The Datum class in the .h file defines a method that appears to be a promising lead:
void SerializeWithCachedSizes(::google::protobuf::io::CodedOutputStream* output) const
I'm guessing this is where the byte-level magic happens for encoding messages before they're sent.
Question: can anyone point me to documentation (or anything) that describes how the encoding works, so I can replicate an abridged version of it? In the illustrative example, the LMDB file contains MNIST data and metadata, and 0x08 seems to signify that the next value is "Number of Channels". And 0x10 and 0x18 designate heights and widths, respectively. 0x28 appears to designate an integer label being next. And so on, and so forth.
I'd like to gain a comprehensive understanding of all possible bytes and their meanings.
Additional digging yielded answers on the following page: https://developers.google.com/protocol-buffers/docs/encoding
Caffe.proto defines Datum by:
optional int32 channels = 1
optional int32 height = 2
optional int32 width = 3
optional bytes data = 4
optional int32 label = 5
repeated float float_data = 6
optional bool encoded = 7
The LMDB record's header in the illustrative example cited above is "08 01 10 1C 18 1C 22 90 06", so with the Google documentation's decoder ring, these hexadecimal values begin to make sense:
08 = Field 1, Type = int32 (since tags are encoded by: (field_number << 3) | wire_type)
01 = Value of Field 1 (i.e., number of channels) is 01
10 = Field 2, Type = int32
1C = Value of Field 2 (i.e., height) is 28
18 = Field 3, Type = int32
1C = Value of Field 3 (i.e., width) is 28
22 = Field 4, Type = length-delimited in bytes
90 06 = Value of Field 4 (i.e., number of bytes) is 1580 using the VarInt encoding methodology
Given this, efficiently creating LMDB entries directly with C for custom, non-image data sets that are readable by Caffe's data layer becomes straightforward.
I want to do a demo for some peers, and I want to force IRB to return values in only binary.
For example it currently returns the result in base 10 no matter which base the input is in:
0b1111
# => 15 #should return 1111 or 0b1111
0b0001 | 0b0011
# => 3 #should return 0011 or 0b0011
Is there a way to force this result? I want to demo bitwise operators and it is much easier for them to understand if they see the bits flowing around rather than base 10 numbers being returned, that I would have to convert to base 2 afterwards.
Also I would like for all results to be in multiples of 4 bits. If possible with underscores or spaces separating the half byte groupings.
For example:
0b0001_0101
# => 0b0001_0101 #or 0b0001 0101, or 0001 0101, or 0001_0101
If I result does not need to be represented by 4 bits (example 3, 11) pad it to 4, 8, 16 bits in length depending on the number.
If you write 0b0001.class you will find that it is Fixnum.
Writing puts 0b1000 shows 8, and it's clear this Fixnum is stored in base 10.
So as far as I'm aware, there isn't any way to prevent the conversion to base 10.
If you want to control the way that Fixnum objects are displayed in IRB, you can implement a custom inspect method for the class:
class Fixnum
def inspect
unpadded_binary_string = to_s(2)
binary_width = if unpadded_binary_string.length % 4 == 0
unpadded_binary_string.length
else
((unpadded_binary_string.length / 4) * 4) + 4
end
padded_binary_string = "%0#{binary_width}d" % unpadded_binary_string
# join groups of 4 with an underscore
padded_binary_string = padded_binary_string.scan(/.{4}/).join("_")
"0b" + padded_binary_string
end
end
results:
irb(main):007:0> 0b1000
=> 0b1000
irb(main):011:0> 99999999
=> 0b0101_1111_0101_1110_0000_1111_1111
The inspect method uses to_s(2), which takes an integer and produces a string representation of it's binary. But the zeroes at the front of the binary are lost when it's converted to base 10. That's why the inspect method
needs to manually add zeroes to the front of the string.
There's no way I can think of to add the correct number of zeroes to the front of the string in a completely dynamic way.
What I'm doing here is calculating the minimum width (in a multiple of 4) that can contain the unpadded binary string. So if the unpadded length is 5 characters, the final width will be 8. If the unpadded length is 2, the final length is 4.
Instead of calculating it on-the-go, you could alternatively set the binary_width as an external variable that you change at runtime, then reference it from the inspect function.
I have quite a specific data set that I need to store in most compact way as a byte array. It is a live stream of integers that are constantly increasing, often by one, but not always one. Each integer value has a tag that is a byte value. There may be values with same value and tag, but I need to store only distincts. Only supported operations are adding new elements, removal and check if element exists - I keep this data set to check if some pair has been 'seen' recently.
Some sample data:
# | value | tag |
1 | 1000 | 0 |
2 | 1000 | 1 |
3 | 1000 | 2 |
4 | 1001 | 0 |
5 | 1002 | 2 |
6 | 1004 | 1 |
7 | 1004 | 2 |
8 | 1005 | 0 |
As I said this is a live stream, but I can tolerate storing only last few thousands. The goal is to make it as memory efficient as possible in the storage (and in RAM), operations can cost much.
If I had no tags, I could store ranges or values, (1000-1002), (1002-1005) etc, there are usually about 5-6 values in a row without gaps. But the tags mess all this.
My current approach is to encode each value + tag pair in a few bytes - one byte for tag and 1 or more bytes for 'delta' from previous value.
This way I need to store first value, 1000 in above case, and than I store deltas - 0 for #1, #2, 1 for #4, 1 for #5, 2 for #6 etc.
Most deltas are small 1-10, so I can store it in one byte only - first bit is a flag if value is small enough to fit in 7 bits, if not - next 7 bits store a value of how may bytes delta occupies.
Maybe there is a better, more compact, approach?
Since you have only 127 different tag values, you could maintain 127 different tables, one for each tag, thus saving yourself from having to store the tags. In each table you could still use your nifty trick with deltas.
Let the pair (value, tag) where value is a uint32 and tag is a uint8 be a typical item stored in your data structure.
Use an associative array data structure that maps uint32 to an array list of uint16. In C++ terms, the data structure is the following.
std::map<std::uint32_t, std::vector<std::uint16_t>>
Each array list stays sorted with distinct values and never exceeds a size of 216.
Let D be an instance of this data structure. We store (value, tag) in the array list D[value >> 8] as (static_cast<std::uint16_t>(value) << 8) + tag.
The idea is basically that the data is paged. The most-significant 3 bytes of value determine the page, and then the least-significant byte of value and the single byte of tag are stored in the page.
This should exploit the structure of your data very efficiently because, assuming each page is holding many values, you're using 2 bytes per item.
i am trying to understand the protocol buffer here is the sample , what i am not be able to understand is how bytes are being used in following messages. i dont know what this number
1 2 3 is used for.
message Point {
required int32 x = 1;
required int32 y = 2;
optional string label = 3;
}
message Line {
required Point start = 1;
required Point end = 2;
optional string label = 3;
}
message Polyline {
repeated Point point = 1;
optional string label = 2;
}
i read following paragraph in google protobuf but not able to understand what is being said here , can anyone help me in understanding how bytes are being used to store info.
The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional element.
The general form of a protobuf message is that it is a sequence of pairs of the form:
field header
payload
For your question, we can largely forget about the payload - that isn't the bit that relates to the 1/2/3 and the <=16 restriction - all of that is in the field header. The field header is a "varint" encoded integer; "varint" uses the most-significant-bit as an optional continuation bit, so small values (<=127, assuming unsigned and not zig-zag) require one byte to encode - larger values require multiple bytes. Or in other words, you get 7 useful bits to play with before you need to set the continuation bit, requiring at least 2 bytes.
However! The field header itself is composed of two things:
the wire-type
the field-number / "tag"
The wire-type is the first 3 bits, and indicates the fundamental format of the payload - "length-delimited", "64-bit", "32-bit", "varint", "start-group", "end-group". That means that of the 7 useful bits we had, only 4 are left; 4 bits is enough to encode numbers <= 16. This is why field-numbers <= 16 are suggested (as an optimisation) for your most common elements.
In your question, the 1 / 2 / 3 is the field-number; at the time of encoding this is left-shifted by 3 and composed with the payload's wire-type; then this composed value is varint-encoded.
Protobuf stores the messages like a map from an id (the =1, =2 which they call tags) to the actual value. This is to be able to more easily extend it than if it would transfer data more like a struct with fixed offsets. So a message Point for instance would look something like this on a high level:
1 -> 100,
2 -> 500
Which then is interpreted as x=100, y=500 and label=not set. On a lower level, protobuf serializes this tag-value mapping in a highly compact format, which among other things, stores integers with variable-length encoding. The paragraph you quoted just highlights exactly this in the case of tags, which can be stored more compactly if they are < 16, but the same for instance holds for integer values in your protobuf definition.