Endianness order. How will these bytes be represented? - endianness

Given the hexadecimal bytes 0x12345678, copy the bytes to memory using big-endian order.
Address Content
0x00400003 0x78
0x00400002 0x56
0x00400001 0x34
0x00400000 0x12
Is that right?

In big-endian, the most significant byte (12) should come first, and then the rest should come in decreasing order of significance.
If the given number is in big-endian byte-order (and probably it is), your solution is right, as it will look like this:
00400000|00400001|00400002|00400003
--------+--------+--------+--------
12 | 34 | 56 | 78
If you had to arrange the bytes in little endian, the arrangement would be reversed:
00400000|00400001|00400002|00400003
--------+--------+--------+--------
78 | 56 | 34 | 12
Note that in this arrangement, only the order of bytes is reversed, but the order of nibbles (4-bit regions = hexadecimal digits) remains the same.
You can read more in this Wikipedia page about endianness.

Related

character encoding - how utf-8 handles charactrers

So, I know that when we type characters, each character maps to a number in a character set and then, that number is transformed into a binary format so a computer can understand. They way that number is transformed into a binary format(how many bits gets allocated) depends on character encoding.
So, if I type L, It represents 76. Then 76 gets tranformed into 1 byte binary format because of let's say UTF-8.
Now, I've read the following somewhere:
The Devanagari character क, with code point 2325 (which is 915 in
hexadecimal notation), will be represented by two bytes when using the
UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four
bytes with UTF-32 (00 00 09 15).
So, as you can see it says three bytes with UTF-8 (E0 A4 95). how are E0 A4 95 bytes ? I am asking because i have no idea where E0 A4 95 came from... Why do we need this ? if we know that code point is 2325, all we have to do is in order to use UTF-8, we know that utf-8 will need 3 bytes to transform 2325 into binary... Why do we need E0 A4 95 and what is it ?
E0 A4 95 is the 3-byte UTF-8 encoding of U+0915. In binary:
E 0 A 4 9 5 (hex)
11100000 10100100 10010101 (binary)
1110xxxx 10xxxxxx 10xxxxxx (3-byte UTF-8 encoding pattern)
0000 100100 010101 (data bits)
00001001 00010101 (regrouped data bits to 8-bit bytes)
0 9 1 5 (hex)
U+0915 (Unicode code point)
The first byte's binary pattern 1110xxxx is a lead byte indicating a 3-byte encoding and 4 bits of data. Follow on bytes start with 10xxxxxx and provide 6 more bits of data. There will be two following bytes after a 3-byte leading byte indicator.
For more information read the Wikipedia article on UTF-8 and the standard RFC-3629.

What is the maximum size of LENGTH field in SNMP frames?

Implementing a SNMP v1 decoder and working with some Wireshark captures, I can see that sometimes length field of a BER if coded with one byte and other times with two bytes.
Reading BER rules, if more significative bit is setted to 1, then the length value must be extended with next byte to represent values bigger than 255.
So, if firt byte is 0x81, and next byte is 0x9F, then the extended Length field should take the 0x9F value... OK
My question is:
If second byte is 0x9F, the more significative bit is 1 again.
Wireshark only takes two bytes for this length.
Why in this case size of Length is only two bytes?
Length fields are restricted to 2 bytes?
Thanks.
According to the BER rule, the length field can be multiple bytes (much more than 2),
http://en.wikipedia.org/wiki/KLV
Length Field
Following the bytes for the Key are bytes for the
Length field which will tell you how many bytes follow the length
field and make up the Value portion. There are four kinds of encoding
for the Length field: 1-byte, 2-byte, 4-byte and Basic Encoding Rules
(BER). The 1-, 2-, and 4-byte variants are pretty straightforward:
make an unsigned integer out of the bytes, and that integer is the
number of bytes that follow.
BER length encoding is a bit more
complicated but the most flexible. If the first byte in the length
field does not have the high bit set (0x80), then that single byte
represents an integer between 0 and 127 and indicates the number of
Value bytes that immediately follows. If the high bit is set, then the
lower seven bits indicate how many bytes follow that themselves make
up a length field.
For example if the first byte of a BER length field
is binary 10000010, that would indicate that the next two bytes make
up an integer that then indicates how many Value bytes follow.
Therefore a total of three bytes were taken up to specify a length.
"If second byte is 0x9F, the more significative bit is 1 again." Is that a question? Only the first byte in the bytes determines how many following bytes are used to determine the length. So you never need to care about the most significant bit of the second byte. Never.
How Wireshark represents the bytes is not very critical. Unless Wireshark shows you a wrong value for length, you should not pay much attention to it.
In ASN.1 BER Length Encoding Rules:
a) If the number of content octets <= 127 then length octet encodes number of content octets.
b) Else the most signicant bit of the rst length octet is set to 1 and other 7 bits describe number of length octets following.
c) The following length octets encode the length of content octets in Big Endian byte order.
Example:
Length 126: 01111110
Length 127: 01111111
Length 128: 10000001 10000000
Length 1031: 10000010 00000100 00000111
Number | MSB of 1st Byte | Bytes to represent the Number | BE Binary
128 | 1 | 0000001 (=1) | 10000000 (=128+0+0+0+0+0+0+0)
1031 | 1 | 0000010 (=2) | 00000100 00000111 (=1024+0+0+0+0+0+0+4+2+1)
I'd add that SNMP (usually) uses UDP datagram for transport, which are limited to 65535 bytes, that is 0xffff. Exactly 2 bytes are needed at most to encode the length.

Ruby - How to represent message length as 2 binary bytes

I'm using Ruby and I'm communicating with a network endpoint that requires the formatting of a 'header' prior to sending the message itself.
The first field in the header must be the message length which is defined as a 2 binary byte message length in network byte order.
For example, my message is 1024 in length. How do I represent 1024 as binary two-bytes?
The standard tools for byte wrangling in Ruby (and Perl and Python and ...) are pack and unpack. Ruby's pack is in Array. You have a length that should be two bytes long and in network byte order, that sounds like a job for the n format specifier:
n | Integer | 16-bit unsigned, network (big-endian) byte order
So if the length is in length, you'd get your two bytes thusly:
two_bytes = [ length ].pack('n')
If you need to do the opposite, have a look at String#unpack:
length = two_bytes.unpack('n').first
See Array#pack.
[1024].pack("n")
This packs the number as the network-order byte sequence \x04\x00.
The way this works is that each byte is 8 binary bits. 1024 in binary is 10000000000. If we break this up into octets of 8 (8 bits per byte), we get: 00000100 00000000.
A byte can represent (2 states) ^ (8 positions) = 256 unique values. However, since we don't have 256 ascii-printable characters, we visually represent bytes as hexadecimal pairs, since a hexadecimal digit can represent 16 different values and 16 * 16 = 256. Thus, we can take the first byte, 00000100 and break it into two hexadecimal quads as 0000 0100. Translating binary to hex gives us 0x04. The second byte is trivial, as 0000 0000 is 0x00. This gives us our hexadecimal representation of the two-byte string.
It's worth noting that because you are constrained to a 2-byte (16-bit) header, you are limited to a maximum value of 11111111 11111111, or 2^16 - 1 = 65535 bytes. Any message larger than that cannot accurately represent its length in two bytes.

Huffman code tables

I didn't understand what do the Huffman tables of Jpeg contain, could someone explain this to me?
Thanks
Huffman encoding is a variable-length data compression method. It works by assigning the most frequent values in an input stream to the encodings with the smallest bit lengths.
For example, the input Seems every eel eeks elegantly. may encode the letter e as binary 1 and all other letters as various other longer codes, all starting with 0. That way, the resultant bit stream would be smaller than if every letter was a fixed size. By way of example, let's examine the quantities of each character and construct a tree that puts the common ones at the top.
Letter Count
------ -----
e 10
<SPC> 4
l 3
sy 2
Smvrkgant. 1
<EOF> 1
The end of file marker EOF is there since you generally have to have a multiple of eight bits in your file. It's to stop any padding at the end from being treated as a real character.
__________#__________
________________/______________ \
________/________ ____\____ e
__/__ __\__ __/__ \
/ \ / \ / \ / \
/ \ / \ / SPC l s
/ \ / \ / \ / \ / \
y S m v / k g \ n t
/\ / \
r . a EOF
Now this isn't necessarily the most efficient tree but it's enough to establish how the encodings are done. Let's first look at the uncompressed data. Assuming an eight-bit encoding, those thirty-one characters (we don't need the EOF for the uncompressed data) are going to take up 248 bits.
But, if you use the tree above to locate the characters, outputting a zero bit if you take the left sub-tree and a one bit if you take the right, you get the following:
Section Encoding
---------- --------
Seems<SPC> 00001 1 1 00010 0111 0101 (20 bits)
every<SPC> 1 00011 1 001000 00000 0101 (22 bits)
eel<SPC> 1 1 0110 0101 (10 bits)
eeks<SPC> 1 1 00101 0111 0101 (15 bits)
elegantly 1 0110 1 00110 001110 01000 01001 0110 00000 (36 bits)
.<EOF> 001001 001111 (12 bits)
That gives a grand total of 115 bits, rounded up to 120 since it needs to be a multiple of a byte, but that's still about half the size of the uncompressed data.
Now that's usually not worth it for a small file like this, since you have to add the space taken up by the actual tree itself(a), otherwise you cannot decode it at the other end. But certainly, for larger files where the distribution of characters isn't even, it can lead to impressive savings in space.
So, after all that, the Huffman tables in a JPEG are simply the tables that allow you to uncompress the stream into usable information.
The encoding process for JPEG consists of a few different steps (color conversion, chroma resolution reduction, block-based discrete cosine transforms, and so on) but the final step is a lossless Huffman encoding on each block which is what those tables are used to reverse when reading the image.
(a) Probably the best case for minimal storage of this table would be something like:
Size of length section (8-bits) = 3 (longest bit length of 6 takes 3 bits)
Repeated for each byte:
Actual length (3 bits, holding value between 1..6 inclusive)
Encoding (n bits, where n is the actual length)
Byte (8 bits)
End of table marker (3 bits) = 0 to distinguish from actual length above
For the text above, that would be:
00000011 8 bits
n bits byte
--- ------ -----
001 1 'e' 12 bits
100 0101 <SPC> 15 bits
101 00001 'S' 16 bits
101 00010 'm' 16 bits
100 0111 's' 15 bits
101 00011 'v' 16 bits
110 001000 'r' 17 bits
101 00000 'y' 16 bits
101 00101 'k' 16 bits
100 0110 'l' 15 bits
101 00110 'g' 16 bits
110 001110 'a' 17 bits
101 01000 'n' 16 bits
101 01001 't' 16 bits
110 001001 '.' 17 bits
110 001111 <EOF> 17 bits
000 3 bits
That makes the table 264 bits which totally wipes out the savings from compression. However, as stated, the impact of the table becomes far less as the input file becomes larger and there's a way to avoid the table altogether.
That way involves the use of another variant of Huffman, called Adaptive Huffman. This is where the table isn't actually stored in the compressed data.
Instead, during compression, the table starts with just EOF and a special bit sequence meant to introduce a new real byte into the table.
When introducing a new byte into the table, you would output the introducer bit sequence followed by the full eight bits of that byte.
Then, after each byte is output and the counts updated, the table/tree is rebalanced based on the new counts to be the most space-efficient (though the rebalancing may be deferred to improve speed, you just have to ensure the same deferral happens during decompression, an example being every time you add byte for the first 1K of input, then every 10K of input after that, assuming you've added new bytes since the last rebalance).
This means that the table itself can be built in exactly the same way at the other end (decompression), starting with the same minimal table with just the EOF and introducer sequence.
During decompression, when you see the introducer sequence, you can add the byte following it (the next eight bits) to the table with a count of zero, output the byte, then adjust the count and re-balance (or defer as previously mentioned).
That way, you do not have to have the table shipped with the compressed file. This, of course, costs a little more time during compression and decompression in that you're periodically rebalancing the table but, as with most things in life, it's a trade-off.
The DHT marker doesn't specify directly which symbol is associated with a code. It contains a vector with counts of how many codes there are of a given length. After that it contains a vector with symbol values.
So when you want to decode you have to generate the huffman codes from the first vector and then associate every code with a symbol in the second vector.

Types of endianness

What is the difference between the following types of endianness?
byte (8b) invariant big and little endianness
half-word (16b) invariant big and little endianness
word (32b) invariant big and little endianness
double-word (64b) invariant big and little endianness
Are there other types/variations?
There are two approaches to endian mapping: address invariance and data invariance.
Address Invariance
In this type of mapping, the address of bytes is always preserved between big and little. This has the side effect of reversing the order of significance (most significant to least significant) of a particular datum (e.g. 2 or 4 byte word) and therefore the interpretation of data. Specifically, in little-endian, the interpretation of data is least-significant to most-significant bytes whilst in big-endian, the interpretation is most-significant to least-significant. In both cases, the set of bytes accessed remains the same.
Example
Address invariance (also known as byte invariance): the byte address is constant but byte significance is reversed.
Addr Memory
7 0
| | (LE) (BE)
|----|
+0 | aa | lsb msb
|----|
+1 | bb | : :
|----|
+2 | cc | : :
|----|
+3 | dd | msb lsb
|----|
| |
At Addr=0: Little-endian Big-endian
Read 1 byte: 0xaa 0xaa (preserved)
Read 2 bytes: 0xbbaa 0xaabb
Read 4 bytes: 0xddccbbaa 0xaabbccdd
Data Invariance
In this type of mapping, the relative byte significance is preserved for datum of a particular size. There are therefore different types of data invariant endian mappings for different datum sizes. For example, a 32-bit word invariant endian mapping would be used for a datum size of 32. The effect of preserving the value of particular sized datum, is that the byte addresses of bytes within the datum are reversed between big and little endian mappings.
Example
32-bit data invariance (also known as word invariance): The datum is a 32-bit word which always has the value 0xddccbbaa, independent of endianness. However, for accesses smaller than a word, the address of the bytes are reversed between big and little endian mappings.
Addr Memory
| +3 +2 +1 +0 | <- LE
|-------------------|
+0 msb | dd | cc | bb | aa | lsb
|-------------------|
+4 msb | 99 | 88 | 77 | 66 | lsb
|-------------------|
BE -> | +0 +1 +2 +3 |
At Addr=0: Little-endian Big-endian
Read 1 byte: 0xaa 0xdd
Read 2 bytes: 0xbbaa 0xddcc
Read 4 bytes: 0xddccbbaa 0xddccbbaa (preserved)
Read 8 bytes: 0x99887766ddccbbaa 0x99887766ddccbbaa (preserved)
Example
16-bit data invariance (also known as half-word invariance): The datum is a 16-bit
which always has the value 0xbbaa, independent of endianness. However, for accesses smaller than a half-word, the address of the bytes are reversed between big and little endian mappings.
Addr Memory
| +1 +0 | <- LE
|---------|
+0 msb | bb | aa | lsb
|---------|
+2 msb | dd | cc | lsb
|---------|
+4 msb | 77 | 66 | lsb
|---------|
+6 msb | 99 | 88 | lsb
|---------|
BE -> | +0 +1 |
At Addr=0: Little-endian Big-endian
Read 1 byte: 0xaa 0xbb
Read 2 bytes: 0xbbaa 0xbbaa (preserved)
Read 4 bytes: 0xddccbbaa 0xddccbbaa (preserved)
Read 8 bytes: 0x99887766ddccbbaa 0x99887766ddccbbaa (preserved)
Example
64-bit data invariance (also known as double-word invariance): The datum is a 64-bit
word which always has the value 0x99887766ddccbbaa, independent of endianness. However, for accesses smaller than a double-word, the address of the bytes are reversed between big and little endian mappings.
Addr Memory
| +7 +6 +5 +4 +3 +2 +1 +0 | <- LE
|---------------------------------------|
+0 msb | 99 | 88 | 77 | 66 | dd | cc | bb | aa | lsb
|---------------------------------------|
BE -> | +0 +1 +2 +3 +4 +5 +6 +7 |
At Addr=0: Little-endian Big-endian
Read 1 byte: 0xaa 0x99
Read 2 bytes: 0xbbaa 0x9988
Read 4 bytes: 0xddccbbaa 0x99887766
Read 8 bytes: 0x99887766ddccbbaa 0x99887766ddccbbaa (preserved)
There's also middle or mixed endian. See wikipedia for details.
The only time I had to worry about this was when writing some networking code in C. Networking typically uses big-endian IIRC. Most languages either abstract the whole thing or offer libraries to guarantee that you're using the right endian-ness though.
Philibert said,
bits were actually inverted
I doubt any architecture would break byte value invariance. The order of bit-fields may need inversion when mapping structs containing them against data. Such direct mapping relies on compiler specifics which are outside the C99 standard but which may still be common. Direct mapping is faster but does not comply with the C99 standard that does not stipulate packing, alignment and byte order. C99-compliant code should use slow mapping based on values rather than addresses. That is, instead of doing this,
#if LITTLE_ENDIAN
struct breakdown_t {
int least_significant_bit: 1;
int middle_bits: 10;
int most_significant_bits: 21;
};
#elif BIG_ENDIAN
struct breakdown_t {
int most_significant_bits: 21;
int middle_bits: 10;
int least_significant_bit: 1;
};
#else
#error Huh
#endif
uint32_t data = ...;
struct breakdown_t *b = (struct breakdown_t *)&data;
one should write this (and this is how the compiler would generate code anyways even for the above "direct mapping"),
uint32_t data = ...;
uint32_t least_significant_bit = data & 0x00000001;
uint32_t middle_bits = (data >> 1) & 0x000003FF;
uint32_t most_significant_bits = (data >> 11) & 0x001fffff;
The reason behind the need to invert the order of bit-fields in each endian-neutral, application-specific data storage unit is that compilers pack bit-fields into bytes of growing addresses.
The "order of bits" in each byte does not matter as the only way to extract them is by applying masks of values and by shifting to the the least-significant-bit or most-significant-bit direction. The "order of bits" issue would only become important in imaginary architectures with the notion of bit addresses. I believe all existing architectures hide this notion in hardware and provide only least vs. most significant bit extraction which is the notion based on the endian-neutral byte values.
Best article I read about endianness "Understanding Big and Little Endian Byte Order".
Actually, I'd describe the endianness of a machine as the order of bytes inside of a word, and not the order of bits.
By "bytes" up there I mean the "smallest unit of memory the architecture can manage individually". So, if the smallest unit is 16 bits long (what in x86 would be called a word) then a 32 bit "word" representing the value 0xFFFF0000 could be stored like this:
FFFF 0000
or this:
0000 FFFF
in memory, depending on endianness.
So, if you have 8-bit endianness, it means that every word consisting of 16 bits, will be stored as:
FF 00
or:
00 FF
and so on.
Practically speaking, endianess refers to the way the processor will interpret the content of a given memory location. For example, if we have memory location 0x100 with the following content (hex bytes)
0x100: 12 34 56 78 90 ab cd ef
Reads Little Endian Big Endian
8-bit: 12 12
16-bit: 34 12 12 34
32-bit: 78 56 34 12 12 34 56 78
64-bit: ef cd ab 90 78 56 34 12 12 34 56 78 90 ab cd ef
The two situations where you need to mind endianess are with networking code and if you do down casting with pointers.
TCP/IP specifies that data on the wire should be big endian. If you transmit types other than byte arrays (like pointers to structures), you should make sure to use the ntoh/hton macros to ensure the data is sent big endian. If you send from a little-endian processor to a big-endian processor (or vice versa), the data will be garbled...
Casting issues:
uint32_t* lptr = 0x100;
uint16_t data;
*lptr = 0x0000FFFF
data = *((uint16_t*)lptr);
What will be the value of data?
On a big-endian system, it would be 0 On a little-endian system, it would be FFFF
13 years ago I worked on a tool portable to both a DEC ALPHA system and a PC. On this DEC ALPHA the bits were actually inverted. That is:
1010 0011
actually translated to
1100 0101
It was almost transparent and seamless in the C code except that I had a bitfield declared like
typedef struct {
int firstbit:1;
int middlebits:10;
int lastbits:21;
};
that needed to be translated to (using #ifdef conditional compiling)
typedef struct {
int lastbits:21;
int middlebits:10;
int firstbit:1;
};
As #erik-van-brakel answered on this post, be careful when communicating with certain PLC : Mixed-endian still alive !
Indeed, I need to communicate with a PLC (from a well known manufacturer) with (Modbus-TCP) OPC protocol and it seems that it returns me a mixed-endian on every half word. So it is still used by some of the larger manufacturers.
Here is an example with the "pieces" string :
the basic concept is the ordering of bits:
1010 0011
in little-endian is the same as
0011 1010
in big-endian (and vice-versa).
You'll notice the order changes by grouping, not by individual bit. I don't know of a system, for example, where
1100 0101
would be the "other-endian" version of the first version.

Resources