I don't know how "xa" can convert in to 10 in pascal. I just use:
Val('xa',value,return);
And value = 10, return = 0. I'm just a newbie, anybody can explain this? I know this won't like the ASCII cause that is just a character.
And I'm using Free Pascal :)
I tested in Free Pascal, when use xa, 0xa and $xa. So, I think it understand the special character like "$","0" without calling it. Is that right?
Since early Delphi's, the core integer conversion routines don't do just number sequences, but also some specials like Pascal "$924" for hex or C style 0x02).
FreePascal adopted it when it later started adding Delphi compatibility (roughly 1997-2003). Beside this difference, another different is that the last parameter (RETURN in your example) changed from WORD (in Turbo Pascal) to integer/longint in Delphi.
IOW, the routine accepts the x and thinks you mean to convert a C style hex number, and then interprets the "a" according to Stuart's table.
It also interprets % as binary, and & as octal.
Try
val('$10',value,return);
writeln(value,' ' ,return); // 16 0
val('&10',value,return);
writeln(value,' ' ,return); // 8 0
val('%10',value,return);
writeln(value,' ' ,return); // 2 0
and compare the results.
Note that this probably won't work for very old Pascal's like Turbo Pascal, and Free Pascals from before the year 2000. The % and & are FPC specific to match the literal notation extensions (analogous to $, but for binary and octal)
var x : Integer
begin
x:=%101010; //42
x:=&101; //65
This won't work with all Pascal compilers, and you didn't say what Pascal compiler you are using, but it looks like, the x in 'xa' says that this is a hexadecimal (base 16) number, and the value of the digits in a hexadecimal number are as follows:
Digit Value
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
a 10
A 10
b 11
B 11
c 12
C 12
d 13
D 13
e 14
E 14
f 15
F 15
Related
After reading this, https://httpwg.org/specs/rfc7541.html#integer.representation
I am confused about quite a few things, although I seem to have the overall gist of the idea.
For one, What are the 'prefixes' exactly/what is their purpose?
For two:
C.1.1. Example 1: Encoding 10 Using a 5-Bit Prefix
The value 10 is to be encoded with a 5-bit prefix.
10 is less than 31 (2^5 - 1) and is represented using the 5-bit prefix.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 0 | 1 | 0 | 1 | 0 | 10 stored on 5 bits
+---+---+---+---+---+---+---+---+
What are the leading Xs? What is the starting 0 for?
>>> bin(10)
'0b1010'
>>>
Typing this in the python IDE, you see almost the same output... Why does it differ?
This is when the number fits within the number of prefix bits though, making it seemingly simple.
C.1.2. Example 2: Encoding 1337 Using a 5-Bit Prefix
The value I=1337 is to be encoded with a 5-bit prefix.
1337 is greater than 31 (25 - 1).
The 5-bit prefix is filled with its max value (31).
I = 1337 - (25 - 1) = 1306.
I (1306) is greater than or equal to 128, so the while loop body executes:
I % 128 == 26
26 + 128 == 154
154 is encoded in 8 bits as: 10011010
I is set to 10 (1306 / 128 == 10)
I is no longer greater than or equal to 128, so the while loop terminates.
I, now 10, is encoded in 8 bits as: 00001010.
The process ends.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 1 | 1 | 1 | 1 | 1 | Prefix = 31, I = 1306
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1306>=128, encode(154), I=1306/128
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 10<128, encode(10), done
+---+---+---+---+---+---+---+---+
The octet-like diagram shows three different numbers being produced... Since the numbers are produced throughout the loop, how do you replicate this octet-like diagram within an integer? What is the actual final result? The diagram or "I" being 10, or 00001010.
def f(a, b):
if a < 2**b - 1:
print(a)
else:
c = 2**b - 1
remain = a - c
print(c)
if remain >= 128:
while 1:
e = remain % 128
g = e + 128
remain = remain / 128
if remain >= 128:
continue
else:
print(remain)
c+=int(remain)
print(c)
break
As im trying to figure this out, I wrote a quick python implementation of it, It seems that i am left with a few useless variables, one being g which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
For one, What are the 'prefixes' exactly/what is their purpose?
Integers are used in a few places in HPACK messages and often they have leading bits that cannot be used to for the actual integer. Therefore, there will often be a few leading digits that will be unavailable to use for the integer itself. They are represented by the X. For the purposes of this calculation it doesn't make what those Xs are: could be 000, or 111, or 010 or...etc. Also, there will not always be 3 Xs - that is just an example. There could only be one leading X, or two, or four...etc.
For example, to look up a previous HPACK decoded header, we use 6.1. Indexed Header Field Representation which starts with a leading 1, followed by the table index value. Therefore that 1 is the X in the previous example. We have 7-bits (instead of only 5-bits in the original example in your question). If the table index value is 127 or less we can represent it using those 7-bits. If it's >= 127 then we need to do some extra work (we'll come back to this).
If it's a new value we want to add to the table (to reuse in future requests), but we already have that header name in the table (so it's just a new value for that name we want as a new entry) then we use 6.2.1. Literal Header Field with Incremental Indexing. This has 2 bits at the beginning (01 - which are the Xs), and we only have 6-bits this time to represent the index of the name we want to reuse. So in this case there are two Xs.
So don't worry about there being 3 Xs - that's just an example. In the above examples there was one X (as first bit had to be 1), and two Xs (as first two bits had to be 01) respectively. The Integer Representation section is telling you how to handle any prefixed integer, whether prefixed by 1, 2, 3... etc unusable "X" bits.
What are the leading Xs? What is the starting 0 for?
The leading Xs are discussed above. The starting 0 is just because, in this example we have 5-bits to represent the integers and only need 4-bits. So we pad it with 0. If the value to encode was 20 it would be 10100. If the value was 40, we couldn't fit it in 5-bits so need to do something else.
Typing this in the python IDE, you see almost the same output... Why does it differ?
Python uses 0b to show it's a binary number. It doesn't bother showing any leading zeros. So 0b1010 is the same as 0b01010 and also the same as 0b00001010.
This is when the number fits within the number of prefix bits though, making it seemingly simple.
Exactly. If you need more than the number of bits you have, you don't have space for it. You can't just use more bits as HPACK will not know whether you are intending to use more bits (so should look at next byte) or if it's just a straight number (so only look at this one byte). It needs a signal to know that. That signal is using all 1s.
So to encode 40 in 5 bits, we need to use 11111 to say "it's not big enough", overflow to next byte. 11111 in binary is 31, so we know it's bigger than that, so we'll not waste that, and instead use it, and subtract it from the 40 to give 9 left to encode in the next byte. A new additional byte gives us 8 new bits to play with (well actually only 7 as we'll soon discover, as the first bit is used to signal a further overflow). This is enough so we can use 00001001 to encode our 9. So our complex number is represented in two bytes: XXX11111 and 00001001.
If we want to encode a value bigger than can fix in the first prefixed bit, AND the left over is bigger than 127 that would fit into the available 7 bits of the second byte, then we can't use this overflow mechanism using two bytes. Instead we use another "overflow, overflow" mechanism using three bytes:
For this "overflow, overflow" mechanism, we set the first byte bits to 1s as usual for an overflow (XXX11111) and then set the first bit of the second byte to 1. This leaves 7 bits available to encode the value, plus the next 8 bits in the third byte we're going to have to use (actually only 7 bits of the third byte, because again it uses the first bit to indicate another overflow).
There's various ways they could go have gone about this using the second and third bytes. What they decided to do was encode this as two numbers: the 128 mod, and the 128 multiplier.
1337 = 31 + (128 * 10) + 26
So that means the frist byte is set to 31 as per pervious example, the second byte is set to 26 (which is 11010) plus the leading 1 to show we're using the overflow overflow method (so 100011010), and the third byte is set to 10 (or 00001010).
So 1337 is encoded in three bytes: XXX11111 100011010 00001010 (including setting X to whatever those values were).
Using 128 mod and multiplier is quite efficient and means this large number (and in fact any number up to 16,383) can be represented in three bytes which is, not uncoincidentally, also the max integer that can be represented in 7 + 7 = 14 bits). But it does take a bit of getting your head around!
If it's bigger than 16,383 then we need to do another round of overflow in a similar manner.
All this seems horrendously complex but is actually relatively simply, and efficiently, coded up. Computers can do this pretty easily and quickly.
It seems that i am left with a few useless variables, one being g
You are not print this value in the if statement. Only the left over value in the else. You need to print both.
which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
Exactly, it's because the first bit (value 128) needs to be set as per explanation above, to show we are continuing/overflowing into needing a third byte.
Well, I'm starting to study unicode now, and I had several doubts, at this moment I'm learning what a plane is, I saw that a plane is a set of 2^16 code points, and that utf-16 encoding supports 17 plans enumerated from 0 to 16, well my question is the following, if utf-16 supports up to 32 bits, because in practice it only encodes up to 2^20 code points? where does 20 come from? I know that if a code point requires more than 2 bytes, utf-16 uses two 16-bit units, but how does that fit into all of this, the final question is where does this 2^20 come from and not 2^32 ? Thanks, :)
Have a look at how surrogate pairs encode a character U >= 0x10000:
U' = yyyyyyyyyyxxxxxxxxxx // U - 0x10000
W1 = 110110yyyyyyyyyy // 0xD800 + yyyyyyyyyy
W2 = 110111xxxxxxxxxx // 0xDC00 + xxxxxxxxxx
(source)
As you can see, from the 32 bits of the 2x16 surrogate pair, 2x6 = 12 bits are used "only" to convey the information that this is indeed a surrogate pair (and not simply two characters with a value < 0x10000). This leaves you with 32 - 12 = 20 bits to store U'.
(Technically, you additionally have some values for U < 0x10000, of which again some are reserved for low and high surrogates, which means you end up slightly above 2^20 codepoints which can be encoded by UTF-16 (but still well below 2^21), considering that the highest possible codepoint that is supported by UTF-16 is U+10FFFF and not 2^20 = 0x100000.)
The original form of Unicode only supported 64k code points (16 bits). The intention was to support all commonly used, modern characters, and 64k really is enough for that (yes, even including Chinese). As the introduction notes (emphasis mine):
Completeness. The coded character set would be large enough to encompass all characters that were likely to be used in general text interchange.
But Unicode grew to encompass almost all human writing, including historic and lesser-used writing systems, and 64k characters was too small to handle that. (Unicode 14 has ~145k characters.) As the Unicode 2.0 introduction says (again, emphasis mine):
The Unicode Standard, Version 2.0 contains 38,885 characters from the world's scripts. These characters are more than sufficient not only for modern communication, but also for the classical forms of many languages.
In Unicode 1.x, the typical encoding was UCS-2, which is just a simple 16-bit number defining the code-point. When they decided that they were going to need more (during the Unicode 1.1 timeframe), there were only ~34k code points assigned.
Originally the thought was to create a 32-bit encoding (UCS-4) that could encode 231 values with one bit left-over, but this would have doubled the size of encoding, wasting a lot of space, and wouldn't have been backward compatible with UCS-2.
So they decided for Unicode 2.0 to invent a system backward-compatible with all defined UCS-2 code points, but that allowed them to scale larger. That's why they invented the surrogate pair system (which LMD's answer explains well). This created the UTF-16 encoding which completely replaces UCS-2.
The full thinking on how much space was needed for various areas is explained in the Unicode 2.0 Introduction:
There are over 18,000 unassigned code positions that are available for for future allocation. This number far exceeds anticipated character coding requirements for modern and most archaic characters.
One million additional characters are accessible through the surrogate extension mechanism.... This number far exceeds anticipated encoding requirements for all world characters and symbols.
The goal was to keep "common" characters in the Basic Multilingual Plane (BMP), and to place lesser-used characters into the surrogate extension area.
The surrogate system "wastes" a lot of code points that could be used for real characters. You could imagine replacing it with a more naïve system with a single "the next character is in the surrogate space" code point. But that would create ambiguity between byte sequences. You couldn't just search for 0x0041 to find the letter A. You'd have to scan backwards to make sure it wasn't a surrogate character, making certain kinds of problems much harder.
That design choice has been pretty solid. In 20 years, with steady additions of more and more obscure scripts and characters, we've used less than 15% of the available space. We definitely didn't need another 10 bits.
thinking in terms of multiples and powers of 4 help a lot with understanding UTF-8 and UTF-16 :
BMP/ASCII start : = 0
Supp plane start : 4 ^ ( 4 + 4 ) = 65,536
Size of BMP : 4 ^ ( 4 + 4 ) = 65,536 ( 4 ^ 8 )
Size of Supp plane : 4 * 4 * 4 ^ ( 4 + 4 ) = 1,048,576 ( 4 ^ 10 )
————————————————————————————————————————————————————————
Unicode (unadj) ( 4*4 + 4^4 ) * ( 4 + 4 )^4
= 4^8 + 4^10
= 1,114,112
UTF-8
2-byte UTF-8 start : 4 * 4 * ( 4 + 4 ) = 128
3-byte UTF-8 start : ( 4 ^ 4 ) * ( 4 + 4 ) = 2,048
4-byte UTF-8 start : 4 ^ ( 4 + 4 ) = 65,536
UTF-8 Multi-byte scale factors
trailing x 1 : 4 ^ 3 = 4 * ( 4 ) * 4 = 64
trailing x 2 : 4 ^ 6 = ( 4 + 4 ) ^ 4 = 4,096
trailing x 3 : 4 ^ 9 = 4 ^ ( 4 + 4 ) * 4 = 262,144
UTF-16
Hi surrogate start : ( 4 ^ 5 ) * 54 = 55,296 ( 0xD800 )
per surrogate width : ( 4 ^ 5 ) = 1,024 ( 0x 400 )
Lo surrogate start : ( 4 ^ 5 ) * 55 = 56,320 ( 0xDC00 )
Total surr. combos : ( 4 ^ 5 ) * ( 4 ^ 5 ) = 1,048,576 ( 4 ^ 10 )
Consider Microsoft Excel's column-numbering system. Columns are "numbered" A, B, C, ... , Y, Z, AA, AB, AC, ... where A is 1.
The column system is similar to the base-10 numbering system that we're familiar with in that when any digit has its maximum value and is incremented, its value is set to the lowest possible digit value and the digit to its left is incremented, or a new digit is added at the minimum value. The difference is that there isn't a digit that represents zero in the letter numbering system. So if the "digit alphabet" contained ABC or 123, we could count like this:
(base 3 with zeros added for comparison)
base 3 no 0 base 3 with 0 base 10 with 0
----------- ------------- --------------
- - 0 0
A 1 1 1
B 2 2 2
C 3 10 3
AA 11 11 4
AB 12 12 5
AC 13 20 6
BA 21 21 7
BB 22 22 8
BC 23 100 9
CA 31 101 10
CB 32 102 11
CC 33 110 12
AAA 111 111 13
Converting from the zeroless system to our base 10 system is fairly simple; it's still a matter of multiplying the power of that space by the value in that space and adding it to the total. So in the case of AAA with the alphabet ABC, it's equivalent to (1*3^2) + (1*3^1) + (1*3^0) = 9 + 3 + 1 = 13.
I'm having trouble converting inversely, though. With a zero-based system, you can use a greedy algorithm moving from largest to smallest digit and grabbing whatever fits. This will not work for a zeroless system, however. For example, converting the base-10 number 10 to the base-3 zeroless system: Though 9 (the third digit slot: 3^2) would fit into 10, this would leave no possible configuration of the final two digits since their minimum values are 1*3^1 = 3 and 1*3^0 = 1 respectively.
Realistically, my digit alphabet will contain A-Z, so I'm looking for a quick, generalized conversion method that can do this without trial and error or counting up from zero.
Edit
The accepted answer by n.m. is primarily a string-manipulation-based solution.
For a purely mathematical solution see kennytm's links:
What is the algorithm to convert an Excel Column Letter into its Number?
How to convert a column number (eg. 127) into an excel column (eg. AA)
Convert to base-3-with-zeroes first (digits 0AB), and from there, convert to base-3-without-zeroes (ABC), using these string substitutions:
A0 => 0C
B0 => AC
C0 => BC
Each substitution either removes a zero, or pushes one to the left. In the end, discard leading zeroes.
It is also possible, as an optimisation, to process longer strings of zeros at once:
A000...000 = 0BBB...BBC
B000...000 = ABBB...BBC
C000...000 = BBBB...BBC
Generalizable to any base.
how can I compress a row of integers into something shorter?
E.g. Input: '1 2 4 9 8 5 2 7 6 2 3 4' -> Algorithm -> Output: 'X Y Z'
and can get it back the other way around? ('X Y Z' -> '1 2 4 9 8 5 2 7 6 2 3 4')
Input contains 12 digits max, numbers only. Output can be alphanumeric and should be 3-4 digits max.
Thanks in advance.
Edit: Each input digit 0-9; Output 0-9a-Z
Unless your input comes from a specific domain, where many inputs are unlikely/unacceptable - you cannot do it.
You can encode 62^4~=1.4*10^7 different serieses with 4 alphanumeric characters.
On the other hand, input of 12 digits can have 10^12 possible different inputs.
From pigeonhole principle - there must be 2 "compressions" that are mapped to the same input.
But, since you should need to recreate the original sequence, you cannot differentiate two identical compressions.
So such a compression does not exist.
In fact, to compress a 12 digits number into 4 characters, you are going to need the alphabet for the characters to be of size 1000:
x^4 = 10^12, x>0
x = 1000
First, you could just use any existing compression algorithm, via some library. However knowing that your input is very specialized, you can also write a special algorithm adapted yo your case.
But let's first analyze how much you can compress this input. To simplify, I'll first consider compressing exactly 12 digits from 0 to 9 (you didn't explicitly write what the input range is however). There are 10^12 possible combinations, which is a little less than 2^40. So what you basically want to do is compress 40 bits.
Let's now analyze how you can compress these 40 bits. If you understand alphanumeric as [0-9A-Z], you have 36 characters available. Each character can encode log_2(36)=5.1 bits. Encoding your 40 bits therefore needs 8 alphanumeric characters.
A better alternative would be to use base64. Here, you have 64 characters, which means each character can encode 6 bits, so you could encode your input with only 40/6=6.666 => 7 characters.
If you consider compressing your input to binary, you will obviously need 40 bits. This can be written with 5 8-bit ASCII characters, with 2 32-bit integers or with 1 64-bit integer. However this probably isn't what you want to achieve.
Conclusion: you can't compress data arbitrarily much and the data that you want to compress can't be compressed as much as you like to compress it.
As an example, to encode 12 digits from 0 to 9 into ASCII characters, you could simply print convert them into one big number, convert it to binary, then take this binary number by portions of 8 bit and convert them to ASCII characters.
Example:
Input: 1 2 4 9 8 5 2 7 6 2 3 4
One number: 124985276234
Binary: 1110100011001101100111111011101001010
Grouped: 11101 00011001 10110011 11110111 01001010
ASCII: <GS><EM>��J
Note that some ASCII-symbols are not printable. If that is important to you, you'll have to use an alternative encoding, as for example base 64, which only has 64 different characters, but they are all printable.
Similar discussion
Compressing a set of large integers
PHP Compress array of bits into shortest string possible
$val = pack('H*', "124985276234");
echo '#'. $val . '#';
print_r(unpack('H*', $val));
die;
#Issue
00011001 => 25
11001 => 25
I was try to implement #Misch algorithm in PHP but some bits when use decbin was wrong and give me bad results when unpacking. Then found pack function and its work similarly. But numbers from 0 to 9 are wrong when unpacking and on 9000000 test 8090899 was decompress with wrong value no collision was found.
set_time_limit(0);
ini_set('memory_limit', '5000M');
ini_set("max_execution_time",0);
$collision = [];
$err = [];
for ($i=0; $i < 9000000; $i++) {
$packed = pack('H*', $i);
$unpacked = unpack('H*', $packed)[1];
if ( array_key_exists($i, $collision) ) {
die("Collision:". $i .' !!!!'. $packed .'!!!!'. $unpacked);
}
if ( $i != $unpacked ) {
$e = "Collision2:". $i .' !!!!'. $packed .'!!!!'. $unpacked . "\n";
#echo $e;
$err[] = $e;
}
$collision[] = $packed;
#echo '#'. $i .'#' . $unpacked . '#' . $unpacked . "#\n";
}
I'm using the following set of values to create a 9 character long base 31 value:
0123456789ABCDEFGHJKLMNPQRTUWXY
I was looking at modifying the Luhn algorithm to work with my base.
My question is:
In base 10, the Luhn algorithm doubles each value in an even position and then if the result is >10 the individual digits of the result are added together.
Should I still be doubling my even position values, or using a higher multiplier?
I'm trying to protect against transposed characters, missing characters, extra characters and just plain wrong digits.
I looked into the Luhn mod N algorithm, but it is very limited in what it can validate against.
I decided to use a modified version of the Freight Container system.
The shipping container system multiples each value by 2^[position] (position starting from 0) and then performs a modulus 11 on the result to get a base 10 check digit (a result of 10 is not recommended).
In this case, the trick is to find values in the range x^0 to x^[length] which are not evenly divisible by the figure you use on the modulus.
I've decided to use 3^[position] as the multiplier and performing a modulus 31 on the sum to get the check digit.
As an example: 0369CFJMK
Character 0 3 6 9 C F J M K
Value 0 3 6 9 12 15 18 21 19
--------------------------------------------------------------
Multiplier 1 3 9 27 81 243 729 2187
Result 0 9 54 243 972 3645 13122 45927
Total 63972 MOD 31 = 19
It seems that with these sort of algorithms, the main requirement is that the multipler is not evenly divisible by the base and that the pattern of the remainders doesn't repeat within the length of the code you want to validate.
Don't reinvent the wheel - use Luhn mod N instead.