Why does utf-16 only support 2^20 code points? - utf-8

Well, I'm starting to study unicode now, and I had several doubts, at this moment I'm learning what a plane is, I saw that a plane is a set of 2^16 code points, and that utf-16 encoding supports 17 plans enumerated from 0 to 16, well my question is the following, if utf-16 supports up to 32 bits, because in practice it only encodes up to 2^20 code points? where does 20 come from? I know that if a code point requires more than 2 bytes, utf-16 uses two 16-bit units, but how does that fit into all of this, the final question is where does this 2^20 come from and not 2^32 ? Thanks, :)

Have a look at how surrogate pairs encode a character U >= 0x10000:
U' = yyyyyyyyyyxxxxxxxxxx // U - 0x10000
W1 = 110110yyyyyyyyyy // 0xD800 + yyyyyyyyyy
W2 = 110111xxxxxxxxxx // 0xDC00 + xxxxxxxxxx
(source)
As you can see, from the 32 bits of the 2x16 surrogate pair, 2x6 = 12 bits are used "only" to convey the information that this is indeed a surrogate pair (and not simply two characters with a value < 0x10000). This leaves you with 32 - 12 = 20 bits to store U'.
(Technically, you additionally have some values for U < 0x10000, of which again some are reserved for low and high surrogates, which means you end up slightly above 2^20 codepoints which can be encoded by UTF-16 (but still well below 2^21), considering that the highest possible codepoint that is supported by UTF-16 is U+10FFFF and not 2^20 = 0x100000.)

The original form of Unicode only supported 64k code points (16 bits). The intention was to support all commonly used, modern characters, and 64k really is enough for that (yes, even including Chinese). As the introduction notes (emphasis mine):
Completeness. The coded character set would be large enough to encompass all characters that were likely to be used in general text interchange.
But Unicode grew to encompass almost all human writing, including historic and lesser-used writing systems, and 64k characters was too small to handle that. (Unicode 14 has ~145k characters.) As the Unicode 2.0 introduction says (again, emphasis mine):
The Unicode Standard, Version 2.0 contains 38,885 characters from the world's scripts. These characters are more than sufficient not only for modern communication, but also for the classical forms of many languages.
In Unicode 1.x, the typical encoding was UCS-2, which is just a simple 16-bit number defining the code-point. When they decided that they were going to need more (during the Unicode 1.1 timeframe), there were only ~34k code points assigned.
Originally the thought was to create a 32-bit encoding (UCS-4) that could encode 231 values with one bit left-over, but this would have doubled the size of encoding, wasting a lot of space, and wouldn't have been backward compatible with UCS-2.
So they decided for Unicode 2.0 to invent a system backward-compatible with all defined UCS-2 code points, but that allowed them to scale larger. That's why they invented the surrogate pair system (which LMD's answer explains well). This created the UTF-16 encoding which completely replaces UCS-2.
The full thinking on how much space was needed for various areas is explained in the Unicode 2.0 Introduction:
There are over 18,000 unassigned code positions that are available for for future allocation. This number far exceeds anticipated character coding requirements for modern and most archaic characters.
One million additional characters are accessible through the surrogate extension mechanism.... This number far exceeds anticipated encoding requirements for all world characters and symbols.
The goal was to keep "common" characters in the Basic Multilingual Plane (BMP), and to place lesser-used characters into the surrogate extension area.
The surrogate system "wastes" a lot of code points that could be used for real characters. You could imagine replacing it with a more naïve system with a single "the next character is in the surrogate space" code point. But that would create ambiguity between byte sequences. You couldn't just search for 0x0041 to find the letter A. You'd have to scan backwards to make sure it wasn't a surrogate character, making certain kinds of problems much harder.
That design choice has been pretty solid. In 20 years, with steady additions of more and more obscure scripts and characters, we've used less than 15% of the available space. We definitely didn't need another 10 bits.

thinking in terms of multiples and powers of 4 help a lot with understanding UTF-8 and UTF-16 :
BMP/ASCII start : = 0
Supp plane start : 4 ^ ( 4 + 4 ) = 65,536
Size of BMP : 4 ^ ( 4 + 4 ) = 65,536 ( 4 ^ 8 )
Size of Supp plane : 4 * 4 * 4 ^ ( 4 + 4 ) = 1,048,576 ( 4 ^ 10 )
————————————————————————————————————————————————————————
Unicode (unadj) ( 4*4 + 4^4 ) * ( 4 + 4 )^4
= 4^8 + 4^10
= 1,114,112
UTF-8
2-byte UTF-8 start : 4 * 4 * ( 4 + 4 ) = 128
3-byte UTF-8 start : ( 4 ^ 4 ) * ( 4 + 4 ) = 2,048
4-byte UTF-8 start : 4 ^ ( 4 + 4 ) = 65,536
UTF-8 Multi-byte scale factors
trailing x 1 : 4 ^ 3 = 4 * ( 4 ) * 4 = 64
trailing x 2 : 4 ^ 6 = ( 4 + 4 ) ^ 4 = 4,096
trailing x 3 : 4 ^ 9 = 4 ^ ( 4 + 4 ) * 4 = 262,144
UTF-16
Hi surrogate start : ( 4 ^ 5 ) * 54 = 55,296 ( 0xD800 )
per surrogate width : ( 4 ^ 5 ) = 1,024 ( 0x 400 )
Lo surrogate start : ( 4 ^ 5 ) * 55 = 56,320 ( 0xDC00 )
Total surr. combos : ( 4 ^ 5 ) * ( 4 ^ 5 ) = 1,048,576 ( 4 ^ 10 )

Related

hpack encoding integer significance

After reading this, https://httpwg.org/specs/rfc7541.html#integer.representation
I am confused about quite a few things, although I seem to have the overall gist of the idea.
For one, What are the 'prefixes' exactly/what is their purpose?
For two:
C.1.1. Example 1: Encoding 10 Using a 5-Bit Prefix
The value 10 is to be encoded with a 5-bit prefix.
10 is less than 31 (2^5 - 1) and is represented using the 5-bit prefix.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 0 | 1 | 0 | 1 | 0 | 10 stored on 5 bits
+---+---+---+---+---+---+---+---+
What are the leading Xs? What is the starting 0 for?
>>> bin(10)
'0b1010'
>>>
Typing this in the python IDE, you see almost the same output... Why does it differ?
This is when the number fits within the number of prefix bits though, making it seemingly simple.
C.1.2. Example 2: Encoding 1337 Using a 5-Bit Prefix
The value I=1337 is to be encoded with a 5-bit prefix.
1337 is greater than 31 (25 - 1).
The 5-bit prefix is filled with its max value (31).
I = 1337 - (25 - 1) = 1306.
I (1306) is greater than or equal to 128, so the while loop body executes:
I % 128 == 26
26 + 128 == 154
154 is encoded in 8 bits as: 10011010
I is set to 10 (1306 / 128 == 10)
I is no longer greater than or equal to 128, so the while loop terminates.
I, now 10, is encoded in 8 bits as: 00001010.
The process ends.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 1 | 1 | 1 | 1 | 1 | Prefix = 31, I = 1306
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1306>=128, encode(154), I=1306/128
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 10<128, encode(10), done
+---+---+---+---+---+---+---+---+
The octet-like diagram shows three different numbers being produced... Since the numbers are produced throughout the loop, how do you replicate this octet-like diagram within an integer? What is the actual final result? The diagram or "I" being 10, or 00001010.
def f(a, b):
if a < 2**b - 1:
print(a)
else:
c = 2**b - 1
remain = a - c
print(c)
if remain >= 128:
while 1:
e = remain % 128
g = e + 128
remain = remain / 128
if remain >= 128:
continue
else:
print(remain)
c+=int(remain)
print(c)
break
As im trying to figure this out, I wrote a quick python implementation of it, It seems that i am left with a few useless variables, one being g which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
For one, What are the 'prefixes' exactly/what is their purpose?
Integers are used in a few places in HPACK messages and often they have leading bits that cannot be used to for the actual integer. Therefore, there will often be a few leading digits that will be unavailable to use for the integer itself. They are represented by the X. For the purposes of this calculation it doesn't make what those Xs are: could be 000, or 111, or 010 or...etc. Also, there will not always be 3 Xs - that is just an example. There could only be one leading X, or two, or four...etc.
For example, to look up a previous HPACK decoded header, we use 6.1. Indexed Header Field Representation which starts with a leading 1, followed by the table index value. Therefore that 1 is the X in the previous example. We have 7-bits (instead of only 5-bits in the original example in your question). If the table index value is 127 or less we can represent it using those 7-bits. If it's >= 127 then we need to do some extra work (we'll come back to this).
If it's a new value we want to add to the table (to reuse in future requests), but we already have that header name in the table (so it's just a new value for that name we want as a new entry) then we use 6.2.1. Literal Header Field with Incremental Indexing. This has 2 bits at the beginning (01 - which are the Xs), and we only have 6-bits this time to represent the index of the name we want to reuse. So in this case there are two Xs.
So don't worry about there being 3 Xs - that's just an example. In the above examples there was one X (as first bit had to be 1), and two Xs (as first two bits had to be 01) respectively. The Integer Representation section is telling you how to handle any prefixed integer, whether prefixed by 1, 2, 3... etc unusable "X" bits.
What are the leading Xs? What is the starting 0 for?
The leading Xs are discussed above. The starting 0 is just because, in this example we have 5-bits to represent the integers and only need 4-bits. So we pad it with 0. If the value to encode was 20 it would be 10100. If the value was 40, we couldn't fit it in 5-bits so need to do something else.
Typing this in the python IDE, you see almost the same output... Why does it differ?
Python uses 0b to show it's a binary number. It doesn't bother showing any leading zeros. So 0b1010 is the same as 0b01010 and also the same as 0b00001010.
This is when the number fits within the number of prefix bits though, making it seemingly simple.
Exactly. If you need more than the number of bits you have, you don't have space for it. You can't just use more bits as HPACK will not know whether you are intending to use more bits (so should look at next byte) or if it's just a straight number (so only look at this one byte). It needs a signal to know that. That signal is using all 1s.
So to encode 40 in 5 bits, we need to use 11111 to say "it's not big enough", overflow to next byte. 11111 in binary is 31, so we know it's bigger than that, so we'll not waste that, and instead use it, and subtract it from the 40 to give 9 left to encode in the next byte. A new additional byte gives us 8 new bits to play with (well actually only 7 as we'll soon discover, as the first bit is used to signal a further overflow). This is enough so we can use 00001001 to encode our 9. So our complex number is represented in two bytes: XXX11111 and 00001001.
If we want to encode a value bigger than can fix in the first prefixed bit, AND the left over is bigger than 127 that would fit into the available 7 bits of the second byte, then we can't use this overflow mechanism using two bytes. Instead we use another "overflow, overflow" mechanism using three bytes:
For this "overflow, overflow" mechanism, we set the first byte bits to 1s as usual for an overflow (XXX11111) and then set the first bit of the second byte to 1. This leaves 7 bits available to encode the value, plus the next 8 bits in the third byte we're going to have to use (actually only 7 bits of the third byte, because again it uses the first bit to indicate another overflow).
There's various ways they could go have gone about this using the second and third bytes. What they decided to do was encode this as two numbers: the 128 mod, and the 128 multiplier.
1337 = 31 + (128 * 10) + 26
So that means the frist byte is set to 31 as per pervious example, the second byte is set to 26 (which is 11010) plus the leading 1 to show we're using the overflow overflow method (so 100011010), and the third byte is set to 10 (or 00001010).
So 1337 is encoded in three bytes: XXX11111 100011010 00001010 (including setting X to whatever those values were).
Using 128 mod and multiplier is quite efficient and means this large number (and in fact any number up to 16,383) can be represented in three bytes which is, not uncoincidentally, also the max integer that can be represented in 7 + 7 = 14 bits). But it does take a bit of getting your head around!
If it's bigger than 16,383 then we need to do another round of overflow in a similar manner.
All this seems horrendously complex but is actually relatively simply, and efficiently, coded up. Computers can do this pretty easily and quickly.
It seems that i am left with a few useless variables, one being g
You are not print this value in the if statement. Only the left over value in the else. You need to print both.
which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
Exactly, it's because the first bit (value 128) needs to be set as per explanation above, to show we are continuing/overflowing into needing a third byte.

What is the offset size in FlatBuffers?

The major part of our data is strings with possible substring duplication (eg. domains - "some.thing.com" and "thing.com"). We'd like to reuse the substrings to reduce file size and memory consumption with FlatBuffers, so i'm planning to use [string] as i can just reference to some existing substrings, eg. thing.com will be just a string created with let substr_offset = builder.create_string("thing.com") and "some.thing.com" will be stored as [builder.create_string("some."), substr_offset].
However it seems referencing has the costs, so probably there is no benefit of referencing is the string is too short (less than offset variable size). Is it correct? Is offset type just usize? What are better alternatives for prefix/postfix strings representations with FlatBuffers?
PS. BTW what is string array instead of just string cost? Is it just one more offset cost?
Both strings and vectors are addressed over a 32-bit offset to them, and also have a 32-bit size field prefixed. So:
"some.thing.com" 14 chars + 1 terminator + 4 size bytes == 19.
Or:
"thing.com" 9 chars + 1 terminator + 4 size bytes == 14.
"some." 5 chars + 1 terminator + 4 size bytes == 10.
vector of 2 strings: 2x4 bytes of offsets + 4 size bytes = 12.
total: 36
of those 36, 14 are shared, leaving 22 bytes of unique data which is larger than the original. So the shared string needs to be 13 bytes or larger for this technique to be worth it, assuming it is shared often.
For details: https://google.github.io/flatbuffers/flatbuffers_internals.html
The offset seems to be uint32 (4 bytes, not usize) according to the doc.

Direct mapped cache example

i am really confused on the topic Direct Mapped Cache i've been looking around for an example with a good explanation and it's making me more confused then ever.
For example: I have
2048 byte memory
64 byte big cache
8 byte cache lines
with direct mapped cache how do i determine the 'LINE' 'TAG' and "Byte offset'?
i believe that the total number of addressing bits is 11 bits because 2048 = 2^11
2048/64 = 2^5 = 32 blocks (0 to 31) (5bits needed) (tag)
64/8 = 8 = 2^3 = 3 bits for the index
8 byte cache lines = 2^3 which means i need 3 bits for the byte offset
so the addres would be like this: 5 for the tag, 3 for the index and 3 for the byte offset
Do i have this figured out correctly?
Do i figured out correctly? YES
Explanation
1) Main memmory size is 2048 bytes = 211. So you need 11 bits to address a byte (If your word size is 1 byte) [word = smallest individual unit that will be accessed with the address]
2) You can calculating tag bits in direct mapping by doing (main memmory size / cash size). But i will explain a little more about tag bits.
Here the size of a cashe line( which is always same as size of a main memmory block) is 8 bytes. which is 23 bytes. So you need 3 bits to represent a byte within a cashe line. Now you have 8 bits (11 - 3) are remaining in the address.
Now the total number of lines present in the cache is (cashe size / line size) = 26 / 23 = 23
So, you have 3 bits to represent the line in which the your required byte is present.
The number of remaining bits now are 5 (8 - 3).
These 5 bits can be used to represent a tag. :)
3) 3 bit for index. If you were trying to label the number of bits needed to represent a line as index. Yes you are right.
4) 3 bits will be used to access a byte withing a cache line. (8 = 23)
So,
11 bits total address length = 5 tag bits + 3 bits to represent a line + 3 bits to represent a byte(word) withing a line
Hope there is no confusion now.

Compression of Integers

how can I compress a row of integers into something shorter?
E.g. Input: '1 2 4 9 8 5 2 7 6 2 3 4' -> Algorithm -> Output: 'X Y Z'
and can get it back the other way around? ('X Y Z' -> '1 2 4 9 8 5 2 7 6 2 3 4')
Input contains 12 digits max, numbers only. Output can be alphanumeric and should be 3-4 digits max.
Thanks in advance.
Edit: Each input digit 0-9; Output 0-9a-Z
Unless your input comes from a specific domain, where many inputs are unlikely/unacceptable - you cannot do it.
You can encode 62^4~=1.4*10^7 different serieses with 4 alphanumeric characters.
On the other hand, input of 12 digits can have 10^12 possible different inputs.
From pigeonhole principle - there must be 2 "compressions" that are mapped to the same input.
But, since you should need to recreate the original sequence, you cannot differentiate two identical compressions.
So such a compression does not exist.
In fact, to compress a 12 digits number into 4 characters, you are going to need the alphabet for the characters to be of size 1000:
x^4 = 10^12, x>0
x = 1000
First, you could just use any existing compression algorithm, via some library. However knowing that your input is very specialized, you can also write a special algorithm adapted yo your case.
But let's first analyze how much you can compress this input. To simplify, I'll first consider compressing exactly 12 digits from 0 to 9 (you didn't explicitly write what the input range is however). There are 10^12 possible combinations, which is a little less than 2^40. So what you basically want to do is compress 40 bits.
Let's now analyze how you can compress these 40 bits. If you understand alphanumeric as [0-9A-Z], you have 36 characters available. Each character can encode log_2(36)=5.1 bits. Encoding your 40 bits therefore needs 8 alphanumeric characters.
A better alternative would be to use base64. Here, you have 64 characters, which means each character can encode 6 bits, so you could encode your input with only 40/6=6.666 => 7 characters.
If you consider compressing your input to binary, you will obviously need 40 bits. This can be written with 5 8-bit ASCII characters, with 2 32-bit integers or with 1 64-bit integer. However this probably isn't what you want to achieve.
Conclusion: you can't compress data arbitrarily much and the data that you want to compress can't be compressed as much as you like to compress it.
As an example, to encode 12 digits from 0 to 9 into ASCII characters, you could simply print convert them into one big number, convert it to binary, then take this binary number by portions of 8 bit and convert them to ASCII characters.
Example:
Input: 1 2 4 9 8 5 2 7 6 2 3 4
One number: 124985276234
Binary: 1110100011001101100111111011101001010
Grouped: 11101 00011001 10110011 11110111 01001010
ASCII: <GS><EM>��J
Note that some ASCII-symbols are not printable. If that is important to you, you'll have to use an alternative encoding, as for example base 64, which only has 64 different characters, but they are all printable.
Similar discussion
Compressing a set of large integers
PHP Compress array of bits into shortest string possible
$val = pack('H*', "124985276234");
echo '#'. $val . '#';
print_r(unpack('H*', $val));
die;
#Issue
00011001 => 25
11001 => 25
I was try to implement #Misch algorithm in PHP but some bits when use decbin was wrong and give me bad results when unpacking. Then found pack function and its work similarly. But numbers from 0 to 9 are wrong when unpacking and on 9000000 test 8090899 was decompress with wrong value no collision was found.
set_time_limit(0);
ini_set('memory_limit', '5000M');
ini_set("max_execution_time",0);
$collision = [];
$err = [];
for ($i=0; $i < 9000000; $i++) {
$packed = pack('H*', $i);
$unpacked = unpack('H*', $packed)[1];
if ( array_key_exists($i, $collision) ) {
die("Collision:". $i .' !!!!'. $packed .'!!!!'. $unpacked);
}
if ( $i != $unpacked ) {
$e = "Collision2:". $i .' !!!!'. $packed .'!!!!'. $unpacked . "\n";
#echo $e;
$err[] = $e;
}
$collision[] = $packed;
#echo '#'. $i .'#' . $unpacked . '#' . $unpacked . "#\n";
}

Consolidate 10 bit Value into a Unique Byte

As part of an algorithm I'm writing, I need to find a way to convert a 10-bit word into a unique 8-bit word. The 10-bit word is made up of 5 pairs, where each pair can only ever equal 0, 1 or 2 (never 3). For example:
|00|10|00|01|10|
This value needs to somehow be consolidated into a single, unique byte.
As each pair can never equal 3, there are a wide range of values that this 10-bit word will never represent, which makes me think that it is possible to create an algorithm to perform this conversion. The simplest way to do this would be to use a lookup table, but it seems like a waste of resources to store ~680 values which will only be used once in my program. I've already tried to incorporate one of the pairs into the others somehow, but every attempt I've made has resulted in a non-unique value, and I'm now very quickly running out of ideas!
Any help?
The number you have is essentially base 3. You just need to convert this to base 2.
There are 5 pairs, so 3^5 = 243 numbers. And 8 bits is 2^8 = 256 numbers, so it's possible.
The simplest way to convert between bases is to go to base 10 first.
So, for your example:
00|10|00|01|10
Base 3: 02012
Base 10: 2*3^3 + 1*3^1 + 2*3^0
= 54 + 3 + 2
= 59
Base 2:
59 % 2 = 1
/2 29 % 2 = 1
/2 14 % 2 = 0
/2 7 % 2 = 1
/2 3 % 2 = 1
/2 1 % 2 = 1
So 111011 is your number in binary
This explains the above process in a bit more detail.
Note that once you have 59 above stored in a 1-byte integer, you'll probably already have what you want, thus explicitly converting to base 2 might not be necessary.
What you basically have is a base 3 number and you want to convert this to a single number 0 - 255, luckily 5 digits in ternary (base 3) gives 243 combinations.
What you'll need to do is:
Digit Action
( 1st x 3^4)
+ (2nd x 3^3)
+ (3rd x 3^2)
+ (4th x 3)
+ (5th)
This will give you a number 0 to 242.
You are considering to store some information in a byte. A byte can contain at most 2 ^ 8 = 256 status.
Your status is totally 3 ^ 5 = 243 < 256. That make the transfer possible.
Consider your pairs are ABCDE (each character can be 0, 1 or 2)
You can just calculate A*3^4 + B*3^3 + C*3^2 + D*3 + E as your result. I guarantee the result will be in range 0 -- 255.

Resources