Using integers to encode short strings - algorithm

Suppose I were limited to using only 32-bit unsigned integers to express strings. Obviously, I could use individual u8 numbers and allocate enough separate values to describe a short string, but say compute and time aren’t important, this being for my curiosity, not necessarily for a real world use.
I observe that a 32-bit number is the same size as 4 strict u8 chars. In decimal, there’s space to encode 4 of any character-encoding that could be indexed by a 2-digit decimal as their decimal equivalent, while 5 ECMA-1 characters could fit in the same bitsize.
Suppose I want the range of printable characters, using a mapped ASCII table, where I subtract 32 to get the printable characters into 2 decimal digits (32 to 126 become 0 to 94). Suppose a mapping function similar to |c,i|c-31*(10^((i+1)*2)), where c is the ASCII value and i is the position: 45769502. In ASCII values as a u8 array [66, 97, 116, 33], or the string “Bat!”
Clearly this is not computationally efficient. I’m not necessarily shooting for that? Just pure curiosity here.
Supposing compute is arbitrary, so even being totally absurd, how might I encode a longer string in a 32-bit unsigned integer?

First you need to decide on which characters you want to encode. Suppose you have chosen k characters which you have mapped to the numbers 0 to k-1. Then every integer n is mapped to a unique non-empty string by expressing n in base k and mapping each k-ary digit to the corresponding character. You could reserve the maximum integer for the empty string.
So you just need a mapping table for the k characters and a function to convert an integer from one base to another, that's simple and efficient, and the encoding is also optimally dense (since every integer maps to a unique string).

Related

LC-3 How to store a number large than 16-bit and print it out to console?

I'm having difficulty storing and displaying numbers greater than 32767 in LC-3 since a register can only hold values from -32768 to 32767. My apology for not being able to come up with any idea for the algorithm. Please give me some suggestion. Thanks!
You'll need a representation to store the larger number in a pair or more of words.
There are several approaches to how big integers are stored: in a fixed number of words, and in a variable number of words or bytes.  The critical part is being able to detect the presence and amount of overflow/carry on mathematical operations like *10.
For that reason, one simple approach is to use a variable number of words/bytes (for a single number), and store only one decimal digit in each of the words/bytes.  That way multiplication by 10, means simply adding a digit on the end (which has the effect of moving each existing digit to the next higher power of ten position).  Adding numbers of this form numbers is fairly easy as well, we need to line up the digits and then, we add them up and detect when the sum is >= 10, then there is a carry (of 1) to be added to the next higher order digit of the sum.  (If adding two such (variable length) numbers is desired, I would store the decimal digits in reverse order, because then the low order numbers are already lined up for addition.)  See also https://en.wikipedia.org/wiki/Binary-coded_decimal .  (In some sense, this is like storing numbers in a form like string, but using binary values instead of ascii characters.)
To simplify this approach for your needs, you can fix the number of words to use, e.g. at 7, for 7 digits.
A variation on (unpacked) Binary-coded Decimal to pack them two decimal digits per byte.  Its a bit more complicated but saves some storage.
Another approach is to store as many decimal digits as will fit full in a word, minus 1.  Which is to say if we can store 65536 in 16-bits that's only 4 full decimal digits, which means putting 3 digits at a time into a word.  You'd need 3 words for 9 digits.  Multiplication by 10 means multiplying each word by 10 numerically, and then checking for larger than 999, and if larger, then carry the 1 to the next higher order word while also subtracting 10,000 from the overflowing word.
This approach will require actual multiplication and division by 10 on each of the individual words.
There are other approaches, such as using all 16-bits in a word as magnitude, but the difficulty there is determining the amount of overflow/carry on *10 operations.  It is not a monumental task but will require work.  See https://stackoverflow.com/a/1815371/471129, for example.
(If you also want to store negative numbers, that is also an issue for representation.  We can either store the sign as separately known as sign-magnitude form (as in stored its own word/byte or packed into the highest byte) or store the number in a compliment form.  The former is better for variable length implementations and the latter can be made to work for fixed length implementations.)

How can one byte be more significant than another?

The difference between little endian and big endian was explained to me like this: "In little endian the least significant byte goes into the low address, and in big endian the most significant byte goes into the low address". What exactly makes one byte more significant than the other?
In the number 222, you could regard the first 2 as most significant because it represents the value 200; the second 2 is less significant because it represents the value 20; and the third 2 is the least significant digit because it represents the value 2.
So, although the digits are the same, the magnitude of the number they represent is used to determine the significance of a digit.
It is the same as when a value is rounded to a number of significant figures ("S.F." or "SF"): 123.321 to 3SF is 123, to 2SF it is 120, to 4SF it is 123.3. That terminology has been used since before electronics were invented.
In any positional numeric system, each digit has a different weight in creating the overall value of the number.
Consider the number 51354 (in base 10): the first 5 is more significant than the second 5, as it stands for 5 multiplied by 10000, while the second 5 is just 5 multiplied by 10.
In computers number are generally fixed-width: for example, a 16 bit unsigned integer can be thought as a sequence of exactly 16 binary digits, with the leftmost one being unambiguously the most significant (it is worth exactly 32768, more than any other bit in the number), and the rightmost the least significant (it is worth just one).
As long as integers are in the CPU registers we don't really need to care about their representation - the CPU will happily perform operations on them as required. But when they are saved to memory (which generally is some random-access bytes store), they need to be represented as bytes in some way.
If we consider "normal" computers, representing a number (bigger than one byte) as a sequence of bytes means essentially representing it in base 256, each byte being a digit in base 256, and each base-256 digit being more or less significant.
Let's see an example: take the value 54321 as a 16 bit integer. If you write it in base 256, it'll be two base-256 digits: the digit 0xD41 (which is worth 0xD4 multiplied by 256) and the digit 0x31 (which is worth 0x31 multiplied by 1). It's clear that the first one is more significant than the second one, as indeed the leftmost "digit" is worth 256 times more than the one at its right.
Now, little endian machines will write in memory the least significant digit first, big endian ones will do the opposite.
Incidentally, there's a nice relationship between binary, hexadecimal and base-256: 4 bits are mapped straight to a hexadecimal digit, and 2 hexadecimal digits are mapped straight to a byte. For this reason you can also see that 54321 is in binary
1101010000110001 = 0xD431
can be split straight into two groups of 8 bits
11010100 00110001
which are the 0xD4 and 0x31 above. So you can see as well that the most significant byte is the one that contains the most significant bits.
Here I'm using the corresponding hexadecimal values to represent each base-256 digit, as there's no good way to represent them symbolically. I could use their ASCII character value, but I 0xD4 is outside ASCII, and 0x31 is 1, which would only add confusion.

Best way to represent numbers of unbounded length?

What's the most optimal (space efficient) way to represent integers of unbounded length?
(The numbers range from zero to positive-infinity)
Some sample number inputs can be found here (each number is shown on it's own line).
Is there a compression algorithm that is specialized in compressing numbers?
You've basically got two alternatives for variable-length integers:
Use 1 bit of every k as an end terminator. That's the way Google protobuf does it, for example (in their case, one bit from every byte, so there are 7 useful bits in every byte).
Output the bit-length first, and then the bits. That's how ASN.1 works, except for OIDs which are represented in form 1.
If the numbers can be really big, Option 2 is better, although it's more complicated and you have to apply it recursively, since you may have to output the length of the length, and then the length, and then the number. A common technique is to use a Option 1 (bit markers) for the length field.
For smallish numbers, option 1 is better. Consider the case where most numbers would fit in 64 bits. The overhead of storing them 7 bits per byte is 1/7; with eight bytes, you'd represent 56 bits. Using even the 7/8 representation for length would also represent 56 bits in eight bytes: one length byte and seven data bytes. Any number shorter than 48 bits would benefit from the self-terminating code.
"Truly random numbers" of unbounded length are, on average, infinitely long, so that's probably not what you've got. More likely, you have some idea of the probability distribution of number sizes, and could choose between the above options.
Note that none of these "compress" (except relative to the bloated ascii-decimal format). The asymptote of log n/n is 0, so as the numbers get bigger the size of the size of the numbers tends to occupy no (relative) space. But it still needs to be represented somehow, so the total representation will always be a bit bigger than log2 of the number.
You cannot compress per se, but you can encode, which may be what you're looking for. You have files with sequences of ASCII decimal digits separated by line feeds. You should simply Huffman encode the characters. You won't do much better than about 3.5 bits per character.

Map string to unique 0..1 float value, while keeping order

I would like to use Redis to sort string values (using sorted sets), but I can only use floats for that purpose. I am looking for an algorithm to convert string to a float 0..1 value, while keeping order.
I mean that s1 < s2 (alphabetically) should imply that f(s1) < f(s2).
Is there such an algorithm?
P.S. I will use such an algorithm for sorting usernames and in the most cases players with matching scores would have quite different usernames. So in the most cases either approach should work, but there is still room for collisions. On the other hand strings will be sorted moreless properly and it's acceptable if almost the same usernames are sorted incorrectly.
Each character can be mapped to its ASCII number. If you convert each string to its float equivalent concatenating all the ASCII numbers (with eventually zeros in front of them so that all characters will be mapped to three numbers) you will keep ordering. But if your strings are long, your floats will be huge and your mapping might not be unique (if several strings begin with the same characters, due to rounding inside the floats).
For example:
'hello' -> 104101108108111
If you know which subsets of characters your strings contain (for instance, only lowercase letters, or only uppercase letters and numbers) you can create your own mapping to use less numbers per character.
Mathematically, such an algorithm exists and is trivial: Simply put a radix point (“.”) before the string and interpret it as a base-256 numeral (assuming your string uses 8-bit characters). Analogously, if your string had just the characters “0” to “9”, you would read it as a decimal numeral, such as .58229 for the string “58229”. You are doing the same thing, just with base 256 instead of base 10.
Practically, this is not possible without a severely restricted set of potential strings or special floating-point software. Since a typical floating-point object has a finite size, it has a finite number of possible values. E.g., a floating-point object with 64 bits has at most 264 values, even neglecting those that stand for special notions such as NaN. Conversely, a string of arbitrary length has infinitely many potential values. Even if you limit the string to something reasonable in today’s computer memories, it has hugely more potential values than a normal floating-point object does.
To solve this, you must either decrease the number of potential strings (by limiting their length or otherwise restricting which strings are allowed) or increase the number of potential floating-point values (perhaps by using special arbitrary-precision floating-point software).

How to convert a variable length string to a 16 character string?

I need to convert a variable length string of length 60 - 100 characters into a 16 character unique string. Please note that the input will be unique as well. Is there some existing hashing algorithm i can use? Or is there another way this can be achieved?
Variable length 60 string unique can be a lot more than 16 length strings.
So a general, good algorithm is not possible. It is like generate from each letter unique ( a-z) a digit (0-9)
Hash functions don't have unique outputs by the pigeonhole principle. If you there are actually less than < 2^16 distinct inputs, it is possible, but this is not a design goal of any hash functions I know of, and you'd have to create the hash function with knowledge of all inputs. So you'd have to emulate it. The simplest scheme that's reasonable efficient seems to be:
Keep an associative array from input strings to integers
When hashing a string, look it up in the associative array
If it's in the associative array, return the associated value
Else, let map[input] = map.entry_count() and return that
This assigns a unique integer to every input, and does hashing in either O(1) expected time or O(log n) time, but it takes some space. You can then convert this integer into a string if you must, for example with base64 encoding, hexadecimal notation, or by interpreting it as string in some character encoding (though you have to take care to end up with valid strings). Each of those gives you far more than 10^16 outcomes, and makes the strings less likely to be confused for numbers.

Resources