I want to ask about the Google document protocol-buffers encoding - protocol-buffers

https://developers.google.com/protocol-buffers/docs/encoding?hl=eng#structure
enter image description here
I understand why drop msb from "value".
However, I don't know why msb is dropped on the "key" side.
? 000 1000: key
1 001 0110: value (msb: 1)
0 000 0001: value (msb: 0 = end)
"key" has a byte after it, so the number to enter msb is 1 (1 000 1000)
or
I think there is no reason to drop msb because "key" is 1 byte.(0000 1000)
but It dropped msb and msb bit is 0.
Why is the value 0 even though it dropped msb?

You need to think of the key, in isolation, as an integer, and varint encode that integer by itself using varint encoding. The integer shown (b1000) only needs 4 bits, so in varint encoding that is a single byte with a 0 MSB. But keys can be much much larger. Think of field number 1708 - the number (not including the wire type) is b11010101100 - so that's 11 bits for the number, plus 3 bits for the wire type, so we have 14 bits of data to encode. Varint is a 7-bit encoding, so we can store that in 2 bytes: the first will have the MSB 1, and the second will have the MSB 0.
Or in other words: any MSB when discussing the key is there to help you find where the key stops being encoded. It has nothing to do with the payload that comes afterwards.

Related

Questions about websocket framing

According to the RFC 6455 specification about websocket's.
Data frame structure is follows:
frame-fin ; 1 bit in length
frame-rsv1 ; 1 bit in length
frame-rsv2 ; 1 bit in length
frame-rsv3 ; 1 bit in length
frame-opcode ; 4 bits in length
frame-masked ; 1 bit in length
frame-payload-length ; either 7, 7+16,
; or 7+64 bits in
; length
[ frame-masking-key ] ; 32 bits in length
frame-payload-data ; n*8 bits in
; length, where
; n >= 0
So the minimum length of byte array to hold a frame would be 224 bytes (56 bits)? As I read on internet to represent a bit in byte array we need 4 bytes (1000).
How do I mask data? And what data should I mask? Only frame-payload-data or all the frame except the mask key?
The frame-masking-key field is only present when the frame is masked, which is only done for frames sent by a client to a server. And the frame-payload-data is optional; a frame may be empty, containing no data. Therefore the minimum length of a frame in the client-to-server direction is (1+1+1+1+4+1+7+32)=48 bits or 6 bytes, and the minimum length of a frame in the server-to-client direction is (1+1+1+1+4+1+7)=16 bits or 2 bytes.
Those would be frames that carry no payload. Obviously frames that carry payload data will require additional space.
As I read on internet to represent a bit in byte array we need 4 bytes
(1000).
Umm, no, each byte holds 8 bits. It might be convenient within a program to use larger data units to represent bit values, but that is completely independent of the format that is used in the actual frame.
How do I mask data? And what data should I mask? Only frame-payload-data
or all the frame except the mask key?
You mask by XOR-ing the frame-masking-key over the frame-payload-data. This is described in section 5.3 of RFC 6455.

What is the maximum size of LENGTH field in SNMP frames?

Implementing a SNMP v1 decoder and working with some Wireshark captures, I can see that sometimes length field of a BER if coded with one byte and other times with two bytes.
Reading BER rules, if more significative bit is setted to 1, then the length value must be extended with next byte to represent values bigger than 255.
So, if firt byte is 0x81, and next byte is 0x9F, then the extended Length field should take the 0x9F value... OK
My question is:
If second byte is 0x9F, the more significative bit is 1 again.
Wireshark only takes two bytes for this length.
Why in this case size of Length is only two bytes?
Length fields are restricted to 2 bytes?
Thanks.
According to the BER rule, the length field can be multiple bytes (much more than 2),
http://en.wikipedia.org/wiki/KLV
Length Field
Following the bytes for the Key are bytes for the
Length field which will tell you how many bytes follow the length
field and make up the Value portion. There are four kinds of encoding
for the Length field: 1-byte, 2-byte, 4-byte and Basic Encoding Rules
(BER). The 1-, 2-, and 4-byte variants are pretty straightforward:
make an unsigned integer out of the bytes, and that integer is the
number of bytes that follow.
BER length encoding is a bit more
complicated but the most flexible. If the first byte in the length
field does not have the high bit set (0x80), then that single byte
represents an integer between 0 and 127 and indicates the number of
Value bytes that immediately follows. If the high bit is set, then the
lower seven bits indicate how many bytes follow that themselves make
up a length field.
For example if the first byte of a BER length field
is binary 10000010, that would indicate that the next two bytes make
up an integer that then indicates how many Value bytes follow.
Therefore a total of three bytes were taken up to specify a length.
"If second byte is 0x9F, the more significative bit is 1 again." Is that a question? Only the first byte in the bytes determines how many following bytes are used to determine the length. So you never need to care about the most significant bit of the second byte. Never.
How Wireshark represents the bytes is not very critical. Unless Wireshark shows you a wrong value for length, you should not pay much attention to it.
In ASN.1 BER Length Encoding Rules:
a) If the number of content octets <= 127 then length octet encodes number of content octets.
b) Else the most signicant bit of the rst length octet is set to 1 and other 7 bits describe number of length octets following.
c) The following length octets encode the length of content octets in Big Endian byte order.
Example:
Length 126: 01111110
Length 127: 01111111
Length 128: 10000001 10000000
Length 1031: 10000010 00000100 00000111
Number | MSB of 1st Byte | Bytes to represent the Number | BE Binary
128 | 1 | 0000001 (=1) | 10000000 (=128+0+0+0+0+0+0+0)
1031 | 1 | 0000010 (=2) | 00000100 00000111 (=1024+0+0+0+0+0+0+4+2+1)
I'd add that SNMP (usually) uses UDP datagram for transport, which are limited to 65535 bytes, that is 0xffff. Exactly 2 bytes are needed at most to encode the length.

Ruby - How to represent message length as 2 binary bytes

I'm using Ruby and I'm communicating with a network endpoint that requires the formatting of a 'header' prior to sending the message itself.
The first field in the header must be the message length which is defined as a 2 binary byte message length in network byte order.
For example, my message is 1024 in length. How do I represent 1024 as binary two-bytes?
The standard tools for byte wrangling in Ruby (and Perl and Python and ...) are pack and unpack. Ruby's pack is in Array. You have a length that should be two bytes long and in network byte order, that sounds like a job for the n format specifier:
n | Integer | 16-bit unsigned, network (big-endian) byte order
So if the length is in length, you'd get your two bytes thusly:
two_bytes = [ length ].pack('n')
If you need to do the opposite, have a look at String#unpack:
length = two_bytes.unpack('n').first
See Array#pack.
[1024].pack("n")
This packs the number as the network-order byte sequence \x04\x00.
The way this works is that each byte is 8 binary bits. 1024 in binary is 10000000000. If we break this up into octets of 8 (8 bits per byte), we get: 00000100 00000000.
A byte can represent (2 states) ^ (8 positions) = 256 unique values. However, since we don't have 256 ascii-printable characters, we visually represent bytes as hexadecimal pairs, since a hexadecimal digit can represent 16 different values and 16 * 16 = 256. Thus, we can take the first byte, 00000100 and break it into two hexadecimal quads as 0000 0100. Translating binary to hex gives us 0x04. The second byte is trivial, as 0000 0000 is 0x00. This gives us our hexadecimal representation of the two-byte string.
It's worth noting that because you are constrained to a 2-byte (16-bit) header, you are limited to a maximum value of 11111111 11111111, or 2^16 - 1 = 65535 bytes. Any message larger than that cannot accurately represent its length in two bytes.

Huffman code tables

I didn't understand what do the Huffman tables of Jpeg contain, could someone explain this to me?
Thanks
Huffman encoding is a variable-length data compression method. It works by assigning the most frequent values in an input stream to the encodings with the smallest bit lengths.
For example, the input Seems every eel eeks elegantly. may encode the letter e as binary 1 and all other letters as various other longer codes, all starting with 0. That way, the resultant bit stream would be smaller than if every letter was a fixed size. By way of example, let's examine the quantities of each character and construct a tree that puts the common ones at the top.
Letter Count
------ -----
e 10
<SPC> 4
l 3
sy 2
Smvrkgant. 1
<EOF> 1
The end of file marker EOF is there since you generally have to have a multiple of eight bits in your file. It's to stop any padding at the end from being treated as a real character.
__________#__________
________________/______________ \
________/________ ____\____ e
__/__ __\__ __/__ \
/ \ / \ / \ / \
/ \ / \ / SPC l s
/ \ / \ / \ / \ / \
y S m v / k g \ n t
/\ / \
r . a EOF
Now this isn't necessarily the most efficient tree but it's enough to establish how the encodings are done. Let's first look at the uncompressed data. Assuming an eight-bit encoding, those thirty-one characters (we don't need the EOF for the uncompressed data) are going to take up 248 bits.
But, if you use the tree above to locate the characters, outputting a zero bit if you take the left sub-tree and a one bit if you take the right, you get the following:
Section Encoding
---------- --------
Seems<SPC> 00001 1 1 00010 0111 0101 (20 bits)
every<SPC> 1 00011 1 001000 00000 0101 (22 bits)
eel<SPC> 1 1 0110 0101 (10 bits)
eeks<SPC> 1 1 00101 0111 0101 (15 bits)
elegantly 1 0110 1 00110 001110 01000 01001 0110 00000 (36 bits)
.<EOF> 001001 001111 (12 bits)
That gives a grand total of 115 bits, rounded up to 120 since it needs to be a multiple of a byte, but that's still about half the size of the uncompressed data.
Now that's usually not worth it for a small file like this, since you have to add the space taken up by the actual tree itself(a), otherwise you cannot decode it at the other end. But certainly, for larger files where the distribution of characters isn't even, it can lead to impressive savings in space.
So, after all that, the Huffman tables in a JPEG are simply the tables that allow you to uncompress the stream into usable information.
The encoding process for JPEG consists of a few different steps (color conversion, chroma resolution reduction, block-based discrete cosine transforms, and so on) but the final step is a lossless Huffman encoding on each block which is what those tables are used to reverse when reading the image.
(a) Probably the best case for minimal storage of this table would be something like:
Size of length section (8-bits) = 3 (longest bit length of 6 takes 3 bits)
Repeated for each byte:
Actual length (3 bits, holding value between 1..6 inclusive)
Encoding (n bits, where n is the actual length)
Byte (8 bits)
End of table marker (3 bits) = 0 to distinguish from actual length above
For the text above, that would be:
00000011 8 bits
n bits byte
--- ------ -----
001 1 'e' 12 bits
100 0101 <SPC> 15 bits
101 00001 'S' 16 bits
101 00010 'm' 16 bits
100 0111 's' 15 bits
101 00011 'v' 16 bits
110 001000 'r' 17 bits
101 00000 'y' 16 bits
101 00101 'k' 16 bits
100 0110 'l' 15 bits
101 00110 'g' 16 bits
110 001110 'a' 17 bits
101 01000 'n' 16 bits
101 01001 't' 16 bits
110 001001 '.' 17 bits
110 001111 <EOF> 17 bits
000 3 bits
That makes the table 264 bits which totally wipes out the savings from compression. However, as stated, the impact of the table becomes far less as the input file becomes larger and there's a way to avoid the table altogether.
That way involves the use of another variant of Huffman, called Adaptive Huffman. This is where the table isn't actually stored in the compressed data.
Instead, during compression, the table starts with just EOF and a special bit sequence meant to introduce a new real byte into the table.
When introducing a new byte into the table, you would output the introducer bit sequence followed by the full eight bits of that byte.
Then, after each byte is output and the counts updated, the table/tree is rebalanced based on the new counts to be the most space-efficient (though the rebalancing may be deferred to improve speed, you just have to ensure the same deferral happens during decompression, an example being every time you add byte for the first 1K of input, then every 10K of input after that, assuming you've added new bytes since the last rebalance).
This means that the table itself can be built in exactly the same way at the other end (decompression), starting with the same minimal table with just the EOF and introducer sequence.
During decompression, when you see the introducer sequence, you can add the byte following it (the next eight bits) to the table with a count of zero, output the byte, then adjust the count and re-balance (or defer as previously mentioned).
That way, you do not have to have the table shipped with the compressed file. This, of course, costs a little more time during compression and decompression in that you're periodically rebalancing the table but, as with most things in life, it's a trade-off.
The DHT marker doesn't specify directly which symbol is associated with a code. It contains a vector with counts of how many codes there are of a given length. After that it contains a vector with symbol values.
So when you want to decode you have to generate the huffman codes from the first vector and then associate every code with a symbol in the second vector.

What is the best way of sending the data to serial port?

This is related with microcontrollers but thought to post it here because it is a problem with algorithms and data types and not with any hardware stuff. I'll explain the problem so that someone that doesn't have any hardware knowledge can still participate :)
In Microcontroller there is an Analog to Digital converter with 10
bit resolution. (It will output a
value between 0 and 1023)
I need to send this value to PC using the serial port.
But you can only write 8 bits at once. (You need to write bytes). It is
a limitation in micro controller.
So in the above case at least I need to send 2 bytes.
My PC application just reads a sequence of numbers for plotting. So
it should capture two consecutive
bytes and build the number back. But
here we will need a delimiter
character as well. but still the delimiter character has an ascii value between 0 - 255 then it will mixup the process.
So what is a simplest way to do this? Should I send the values as a sequence of chars?
Ex : 1023 = "1""0""2""3" Vs "Char(255)Char(4)"
In summary I need to send a sequence of 10 bit numbers over Serial in fastest way. :)
You need to send 10 bits, and because you send a byte at a time, you have to send 16 bits. The big question is how much is speed a priority, and how synchronised are the sender and receiver? I can think of 3 answers, depending on these conditions.
Regular sampling, unknown join point
If the device is running all the time, you aren't sure when you are going to connect (you could join at any time in the sequence) but sampling rate is slower than communication speed so you don't care about size I think I'd probably do it as following. Suppose you are trying to send the ten bits abcdefghij (each letter one bit).
I'd send pq0abcde then pq1fghij, where p and q are error checking bits. This way:
no delimiter is needed (you can tell which byte you are reading by the 0 or 1)
you can definitely spot any 1 bit error, so you know about bad data
I'm struggling to find a good two bit error correcting code, so I guess I'd just make p a parity bit for bits 2,3 and 4 (0, a b above) and q a parity bit for 5 6 and 7 (c,d,e above). This might be clearer with an example.
Suppose I want to send 714 = 1011001010.
Split in 2 10110 , 01010
Add bits to indicate first and second byte 010110, 101010
calculate parity for each half: p0=par(010)=1, q0=par(110)=0, p1=par(101)=0, q1=par(010)=1
bytes are then 10010110, 01101010
You then can detect a lot of different error conditions, quickly check which byte you are being sent if you lose synchronisation, and none of the operations take very long in a microcontroller (I'd do the parity with an 8 entry lookup table).
Dense data, known join point
If you know that the reader starts at the same time as the writer, just send the 4 ten bit values as 5 bytes. If you always read 5 bytes at a time then no problems. If you want even more space saving, and have good sample data already, I'd compress using a huffman coding.
Dense data, unknown join point
In 7 bytes you can send 5 ten bit values with 6 spare bits. Send 5 values like this:
byte 0: 0 (7 bits)
byte 1: 1 (7 bits)
byte 2: 1 (7 bits)
byte 3: 1 (7 bits)
byte 4: 0 (7 bits)
byte 5: 0 (7 bits)
byte 6: (8 bits)
Then whenever you see 3 1's in a row for the most significant bit, you know you have bytes 1, 2 and 3. This idea wastes 1 bit in 56, so could be made even more efficient, but you'd have to send more data at a time. Eg (5 consecutive ones, 120 bits sent in 16 bytes):
byte 0: 0 (7 bits) 7
byte 1: 1 (7 bits) 14
byte 2: 1 (7 bits) 21
byte 3: 1 (7 bits) 28
byte 4: 1 (7 bits) 35
byte 5: 1 (7 bits) 42
byte 6: 0 (7 bits) 49
byte 7: (8 bits) 57
byte 8: (8 bits) 65
byte 9: (8 bits) 73
byte 10: (8 bits) 81
byte 11: 0 (7 bits) 88
byte 12: (8 bits) 96
byte 13: (8 bits) 104
byte 14: (8 bits) 112
byte 15: (8 bits) 120
This is quite a fun problem!
The best method is to convert the data to an ASCII string and send it that way - it makes debugging a lot easier and it avoids various communication issues (special meaning of certain control characters etc).
If you really need to use all the available bandwidth though then you can pack 4 10 bit values into 5 consecutive 8 bit bytes. You will need to be careful about synchronization.
Since you specified "the fastest way" I think expanding the numbers to ASCII is ruled out.
In my opinion a good compromise of code simplicity and performance can be obtained by the following encoding:
Two 10bit values will be encoded in 3 bytes like this.
first 10bit value bits := abcdefghij
second 10bit value bits := klmnopqrst
Bytes to encode:
1abcdefg
0hijklmn
0_opqrst
There is one bit more (_) available that could be used for a parity over all 20bits for error checking or just set to a fixed value.
Some example code (puts 0 at the position _):
#include <assert.h>
#include <inttypes.h>
void
write_byte(uint8_t byte); /* writes byte to serial */
void
encode(uint16_t a, uint16_t b)
{
write_byte(((a >> 3) & 0x7f) | 0x80);
write_byte(((a & 3) << 4) | ((b >> 6) & 0x7f));
write_byte(b & 0x3f);
}
uint8_t
read_byte(void); /* read a byte from serial */
void
decode(uint16_t *a, uint16_t *b)
{
uint16_t x;
while (((x = read_byte()) & 0x80) == 0) {} /* sync */
*a = x << 3;
x = read_byte();
assert ((x & 0x80) == 0); /* put better error handling here */
*a |= (x >> 4) & 3;
*b = x << 6;
x = read_byte();
assert ((x & 0xc0) == 0); /* put better error handling here */
*b |= x;
}
I normally use a start byte and checksum and in this case fixed length, so send 4 bytes, the receiver can look for the start byte and if the next three add up to a know quantity then it is a good packet take out the middle two bytes, if not keep looking. The receiver can always re-sync and it doesnt waste the bandwidth of ascii. Ascii is your other option, a start byte that is not a number and perhaps four numbers for decimal. Decimal is definitely not fun in a microcontroller, so start with something non-hex like X for example and then three bytes with the hex ascii values for your number. Search for the x examine the next three bytes, hope for the best.

Resources