Big-endian architecture 101 - endianness

I only know what I know through my experience through Computer Architecture course. Little-endian stores the LSB on the right and MSB on the left and on Big-endian it's vice versa.
That would mean a byte representation of 18 is 0001 0010 and on Big-endian it would be 0100 1000.

No, it is not like that, say you have 3,168,415,017 as 32 bit unsigned number here is the Little Endian binary representation of it:
10111100 11011010 00101101 00101001
While the Big Endian representation would flip the BYTES but not the BITS inside the bytes.
00101001 00101101 11011010 10111100
Note that the bytes are flipped but the order of bits inside of each remains the same.

Related

Convert packed bits to two's compliment int in Go

I have a 38 bit two's complement integer packed inside a byte slice. How do I extract this value correctly? If I simply pull out the 38 bits into an int64, it's not correct for negative numbers (because the left 28 bits are always all 0).
Here's an example: https://play.golang.org/p/BRvrihYAJ80
Bytes 4 to 8 make up 40 bits, I ignore the first two bits in byte 4 and then or+shift the rest into an int64. This works for positive numbers, but not negative. The binary is correct and when interpreted as an int38 come out (correctly) as -40520517670 - but when interpretted incorrectly as an int64 comes out as 234357389274.
How can I take these 38 bits and convert them to a 64 bit int correctly?

How does UTF-8 represent characters?

I'm reading UTF-8 Encoding, and I don't understand the following sentence.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
If I'm not mistaken, this means UTF-8 requires two bytes to represent 2048 characters. In other words, we need to choose 2048 candidates from 2 to the power of 16 to represent each character.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes.
What's the big deal about choosing 2048 out of 65,536? However, UTF-8 explicitly sets boundary to each byte.
With following statements, The number of combinations is 30 (0xDF - 0xC2 + 0x01) for first byte, and 64 (0xBF - 0x80 + 0x01) for second byte.
The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
How does 1920 numbers (64 times 30) accommodate 2048 combinations?
As you already know, 2047 (0x07FF) contains the raw bits
00000111 11111111
If you look at the bit distribution chart for UTF-8:
You will see that 0x07FF falls in the second line, so it is encoded as 2 bytes using this bit pattern:
110xxxxx 10xxxxxx
Substitute the raw bits into the xs and you get this result:
11011111 10111111 (0xDF 0xBF)
Which is exactly as the description you quoted says:
The first byte will have the two high bits set and the third bit clear (11011111). The second byte will have the top bit set and the second bit clear (10111111).
Think of it as a container, where the encoding reserves a few bits for its own synchronization, and you get to use the remaining bits.
So for the range in question, the encoding "template" is
110 abcde 10 fghijk
(where I have left a single space to mark the boundary between the template and the value from the code point we want to encode, and two spaces between the actual bytes)
and you get to use the 11 bits abcdefghijk for the value you actually want to transmit.
So for the code point U+07EB you get
0x07 00000111
0xEB 11101011
where the top five zero bits are masked out (remember, we only get 11 -- because the maximum value that the encoding can accommodate in two bytes is 0x07FF. If you have a larger value, the encoding will use a different template, which is three bytes) and so
0x07 = _____ 111 (template: _____ abc)
0xEB = 11 101011 (template: de fghijk)
abc de = 111 11 (where the first three come from 0x07, and the next two from 0xEB)
fghijk = 101011 (the remaining bits from 0xEB)
yielding the value
110 11111 10 101011
aka 0xDF 0xAB.
Wikipedia's article on UTF-8 contains more examples with nicely colored numbers to see what comes from where.
The range 0x00-0x7F, which can be represented in a single byte, contains 128 code points; the two-byte range thus needs to accommodate 1920 = 2048-128 code points.
The raw encoding would allow values in the range 0xC0-0xBF in the first byte, but the values 0xC0 and 0xC1 are not ever needed because those would represent code points which can be represented in a single byte, and thus are invalid as per the encoding spec. In other words, the 0x02 in 0xC2 comes from the fact that at least one bit in the high four bits out of the 11 that this segment of the encoding can represent (one of abcd) needs to be a one bit in order for the value to require two bytes.

Difference between an 8-bit number in little-endian and an 8-bit number in big-endian

What is the difference between an 8-bit number in little-endian and an 8-bit number in big-endian on Intel 64 systems?
There is no difference. A sequence of only one byte looks the same when read from left to right or right to left.
I would like to mention that loading a single byte into a larger register, might apply the endianess to the register:
With little endianess the byte would maintain its numeric value.
With big endianess in a 4 byte register it would be shifted << 24.
Though I know of no such instruction, being not familiar with intel.

Actual length of input vector in VHDL

i am running a HDL code written in VHDL and i have an input vector with maximum length of 512 bits. Some of my inputs are less than the max size. So i want to find if there is a way to find the actual length of every input, in order to cut the unwanted zeros at the most significant bits of the input vector. Is there any possible way to do this kind of stuff?
I guess you are looking for an unambiguous padding method for your data. What I would recommend in your case is an adaption of the ISO/IEC 9797-1 padding method 2 as follows:
For every input data (even if it already has 512 bits), you add a leading '1' bit. Then you add leading '0' bits (possibly none) to fill up your vector.
To implement this scheme you would have to enlargen your input vector to 513 bits (because you have to always add at least one bit).
To remove the padding, you simple go through the vector starting at the MSB and find the first '1' bit, which marks the end of your apdding pattern.
Example (for 8+1 bit):
input: 10101
padded: 0001 10101
input: 00000000
padded: 1 00000000

Not quite understanding Endianness

I understand that 0x12345678 in big endian is 0x12 0x34 0x56 0x78 and 0x78 0x56 0x34 0x12 in little endian.
But what is this needed for? I don't fully understand how it works: it seems deceptively simple.
Is it really as simple as byte order; no other difference?
Your understanding of endianness appears to be correct.
I would like to additionally point out the implicit, conventional nature of endianness and its role in interpreting a byte sequence as some intended value.
0x12345678 in big endian is 0x12 0x34 0x56 0x78 and 0x78 0x56 0x34 0x12 in little endian.
Interestingly, you did not explicitly state what these 0x… entities above are supposed to mean. Most programmers who are familiar with a C-style language are likely to interpret 0x12345678 as a numeric value presented in hexadecimal form, and both 0x12 0x34 0x56 0x78 and 0x78 0x56 0x34 0x12 as byte sequences (where each byte is presented in hexadecimal form, and the left-most byte is located at the lowest memory address). And that is probably exactly what you meant.
Perhaps without even thinking, you have relied on a well-known convention (i.e. the assumption that your target audience will apply the same common knowledge as you would) to convey the meaning of these 0x… entities.
Endianness is very similar to this: a rule that defines for a given computer architecture, data transmission protocol, file format, etc. how to convert between a value and its representation as a byte sequence. Endianness is usually implied: Just as you did not have to explicitly tell us what you meant by 0x12345678, usually it is not necessary to accompany each byte sequence such as 0x12 0x34 0x56 0x78 with explicit instructions how to convert it back to a multi-byte value, because that knowledge (the endianness) is built into, or defined in, a specific computer architecture, file format, data transmission protocol, etc.
As to when endianness is necessary: Basically for all data types whose values don't fit in a single byte. That's because computer memory is conceptually a linear array of slots, each of which has a capacity of 8 bits (an octet, or byte). Values of data types whose representation requires more than 8 bits must therefore be spread out over several slots; and that's where the importance of the byte order comes in.
P.S.: Studying the Unicode character encodings UTF-16 and UTF-8 helped me build a deeper understanding of endianness.
While both encodings are for the exact same kind of data, endianness only plays a role in UTF-16, but not in UTF-8. How can that be?
UTF-16 requires a byte order mark (BOM), while UTF-8 doesn't. Why?
Once you understand the reasons, chances are you'll have a very good understanding of endianness issues.
It appears that your understanding of endianness is just fine.
Since there is more than one possible byte ordering for representing multi-byte data types' values in a linear address space, different CPU / computer manufacturers apparently chose different byte orderings in the past. Thus we have Big and Little Endian today (and perhaps other byte orderings that haven't got their own name).
Wikipedia has a good article on the matter, btw.

Resources