Map string to unique 0..1 float value, while keeping order - algorithm

I would like to use Redis to sort string values (using sorted sets), but I can only use floats for that purpose. I am looking for an algorithm to convert string to a float 0..1 value, while keeping order.
I mean that s1 < s2 (alphabetically) should imply that f(s1) < f(s2).
Is there such an algorithm?
P.S. I will use such an algorithm for sorting usernames and in the most cases players with matching scores would have quite different usernames. So in the most cases either approach should work, but there is still room for collisions. On the other hand strings will be sorted moreless properly and it's acceptable if almost the same usernames are sorted incorrectly.

Each character can be mapped to its ASCII number. If you convert each string to its float equivalent concatenating all the ASCII numbers (with eventually zeros in front of them so that all characters will be mapped to three numbers) you will keep ordering. But if your strings are long, your floats will be huge and your mapping might not be unique (if several strings begin with the same characters, due to rounding inside the floats).
For example:
'hello' -> 104101108108111
If you know which subsets of characters your strings contain (for instance, only lowercase letters, or only uppercase letters and numbers) you can create your own mapping to use less numbers per character.

Mathematically, such an algorithm exists and is trivial: Simply put a radix point (“.”) before the string and interpret it as a base-256 numeral (assuming your string uses 8-bit characters). Analogously, if your string had just the characters “0” to “9”, you would read it as a decimal numeral, such as .58229 for the string “58229”. You are doing the same thing, just with base 256 instead of base 10.
Practically, this is not possible without a severely restricted set of potential strings or special floating-point software. Since a typical floating-point object has a finite size, it has a finite number of possible values. E.g., a floating-point object with 64 bits has at most 264 values, even neglecting those that stand for special notions such as NaN. Conversely, a string of arbitrary length has infinitely many potential values. Even if you limit the string to something reasonable in today’s computer memories, it has hugely more potential values than a normal floating-point object does.
To solve this, you must either decrease the number of potential strings (by limiting their length or otherwise restricting which strings are allowed) or increase the number of potential floating-point values (perhaps by using special arbitrary-precision floating-point software).

Related

LC-3 How to store a number large than 16-bit and print it out to console?

I'm having difficulty storing and displaying numbers greater than 32767 in LC-3 since a register can only hold values from -32768 to 32767. My apology for not being able to come up with any idea for the algorithm. Please give me some suggestion. Thanks!
You'll need a representation to store the larger number in a pair or more of words.
There are several approaches to how big integers are stored: in a fixed number of words, and in a variable number of words or bytes.  The critical part is being able to detect the presence and amount of overflow/carry on mathematical operations like *10.
For that reason, one simple approach is to use a variable number of words/bytes (for a single number), and store only one decimal digit in each of the words/bytes.  That way multiplication by 10, means simply adding a digit on the end (which has the effect of moving each existing digit to the next higher power of ten position).  Adding numbers of this form numbers is fairly easy as well, we need to line up the digits and then, we add them up and detect when the sum is >= 10, then there is a carry (of 1) to be added to the next higher order digit of the sum.  (If adding two such (variable length) numbers is desired, I would store the decimal digits in reverse order, because then the low order numbers are already lined up for addition.)  See also https://en.wikipedia.org/wiki/Binary-coded_decimal .  (In some sense, this is like storing numbers in a form like string, but using binary values instead of ascii characters.)
To simplify this approach for your needs, you can fix the number of words to use, e.g. at 7, for 7 digits.
A variation on (unpacked) Binary-coded Decimal to pack them two decimal digits per byte.  Its a bit more complicated but saves some storage.
Another approach is to store as many decimal digits as will fit full in a word, minus 1.  Which is to say if we can store 65536 in 16-bits that's only 4 full decimal digits, which means putting 3 digits at a time into a word.  You'd need 3 words for 9 digits.  Multiplication by 10 means multiplying each word by 10 numerically, and then checking for larger than 999, and if larger, then carry the 1 to the next higher order word while also subtracting 10,000 from the overflowing word.
This approach will require actual multiplication and division by 10 on each of the individual words.
There are other approaches, such as using all 16-bits in a word as magnitude, but the difficulty there is determining the amount of overflow/carry on *10 operations.  It is not a monumental task but will require work.  See https://stackoverflow.com/a/1815371/471129, for example.
(If you also want to store negative numbers, that is also an issue for representation.  We can either store the sign as separately known as sign-magnitude form (as in stored its own word/byte or packed into the highest byte) or store the number in a compliment form.  The former is better for variable length implementations and the latter can be made to work for fixed length implementations.)

Using integers to encode short strings

Suppose I were limited to using only 32-bit unsigned integers to express strings. Obviously, I could use individual u8 numbers and allocate enough separate values to describe a short string, but say compute and time aren’t important, this being for my curiosity, not necessarily for a real world use.
I observe that a 32-bit number is the same size as 4 strict u8 chars. In decimal, there’s space to encode 4 of any character-encoding that could be indexed by a 2-digit decimal as their decimal equivalent, while 5 ECMA-1 characters could fit in the same bitsize.
Suppose I want the range of printable characters, using a mapped ASCII table, where I subtract 32 to get the printable characters into 2 decimal digits (32 to 126 become 0 to 94). Suppose a mapping function similar to |c,i|c-31*(10^((i+1)*2)), where c is the ASCII value and i is the position: 45769502. In ASCII values as a u8 array [66, 97, 116, 33], or the string “Bat!”
Clearly this is not computationally efficient. I’m not necessarily shooting for that? Just pure curiosity here.
Supposing compute is arbitrary, so even being totally absurd, how might I encode a longer string in a 32-bit unsigned integer?
First you need to decide on which characters you want to encode. Suppose you have chosen k characters which you have mapped to the numbers 0 to k-1. Then every integer n is mapped to a unique non-empty string by expressing n in base k and mapping each k-ary digit to the corresponding character. You could reserve the maximum integer for the empty string.
So you just need a mapping table for the k characters and a function to convert an integer from one base to another, that's simple and efficient, and the encoding is also optimally dense (since every integer maps to a unique string).

How good is hash function that is linear combination of values?

I was reading text about hashing , I found out that naive hash code of char string can be implemented as polynomial hash function
h(S0,S1,S2,...SN-1) = S0*A^N-1 + S1*A^N-2 + S2*A^N-3 ..... SN-1*A^0. Where Si is character at index i and A is some integer.
But cannot we straightaway sum as
h(S0,S1,S2,...SN-1) = S0*(N)+S1*(N-1)+S2*(N-2) ...... SN-1*1.
I see this function also as good since two values 2*S0+S1 != 2*S1+S0 (which are reverse) are not hashed to same values. But nowhere i find this type of hash function
Suppose we work with strings of 30 characters. That's not long, but it's not so short that problems with the hash should arise purely because the strings are too short.
The sum of the weights is 465 (1+2+...+30), with printable ASCII characters that makes the maximum hash 58590, attained by "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~". There are a lot more possible printable ASCII strings of 30 characters than that (9530 ≈ 2E59), but they all hash into the range of 0 to 58590. Naturally you cannot actually have that many strings at the same time, but you could have a lot more than 58590, and that would guarantee collisions just based on counting (it is very likely to happen much sooner of course).
The maximum hash grows only slowly, you'd need strings of 34 million characters before the entire range of a 32bit integer is used.
The other way, multiplying by powers of A (this can be evaluated with Horner's scheme so no powers needs to be calculated explicitly, it still only costs an addition and a multiplication per character, though the naive way is not the fastest way to compute that hash), does not have this problem. The powers of A quickly get big (and start wrapping, which is fine as long as A is odd), so strings with 30 characters stand a good chance to cover the entire range of whatever integer type you're using.
The problem with a linear hash function is that it's much easier to generate collisions.
Consider a string with 3 chars: S0, S1, S2.
The proposed hash code would be 3 * S0 + 2 * S1 + S2.
Every time we decrease char S2 by two (e.g. e --> c), and increase char S1 by one (e.g. m --> n), we obtain the same hash code.
Even only the fact that it is possible to describe an operation preserving hash so easily would be an alarm (because some algorithm might process the string exactly in that manner). As a more extreme case consider just summing the characters. In this situation all the anagrams of the original string would generate the same hash code (thus this hash would be useless in an application processing anagrams).

adversary argument for finding n-bit strings

Given:
S, a set an odd number of n-bit strings
A, a particular n-bit string
show that any algorithm that decides whether A is in S must examine all n bits of A in the worst case.
Usually of course we would expect to have to look at all the parts of a string to do the matching, but there's something particular about S having an odd size that's escaping me.
Let's say we have an algorithm A that decides membership in S correctly and says, for any input n-bit string, whether the string is in S or not.
Suppose for a given input n-bit string s1, the algorithm A never looks at bit i of s1 and goes on to say "s1 is in (not in) S". Then a string s2 equal to s1 except with bit i flipped is also in (not in) S! That is, for any string we feed into A, if A doesn't look at a particular bit, then there is a second string also in (or not in) S with that bit flipped.
Then what is special about odd-sized sets S? We can't pair up strings in S evenly. That is, there must be a string s3 that A looks at and decides is in S, for which no single bit can be flipped to form another string in S. So A must look at all the bits of s3 (otherwise we could make such a string, as we did before).
I guess the odd number clue is to find the end of your set or array in memory.
Assume you are using a 32 bit system,
Perhaps the compiler aligns the data structutres of your program in memory on eight byte boundaries. You have a whole load of string pointers in your data segment.If there is an odd number of strings, the next thing that needs an eight byte alignment has four bytes of padding in front of it. If there is an even number of strings, there is no padding.
If i understand this correctly, it's irrelevant whether S has an odd or even number of strings. For any particular string in S to check that it matches arbitrary string A, you must check against each character in each. You can stop early if either string is shorter than the other or a character you're checking doesn't match.

How to convert a variable length string to a 16 character string?

I need to convert a variable length string of length 60 - 100 characters into a 16 character unique string. Please note that the input will be unique as well. Is there some existing hashing algorithm i can use? Or is there another way this can be achieved?
Variable length 60 string unique can be a lot more than 16 length strings.
So a general, good algorithm is not possible. It is like generate from each letter unique ( a-z) a digit (0-9)
Hash functions don't have unique outputs by the pigeonhole principle. If you there are actually less than < 2^16 distinct inputs, it is possible, but this is not a design goal of any hash functions I know of, and you'd have to create the hash function with knowledge of all inputs. So you'd have to emulate it. The simplest scheme that's reasonable efficient seems to be:
Keep an associative array from input strings to integers
When hashing a string, look it up in the associative array
If it's in the associative array, return the associated value
Else, let map[input] = map.entry_count() and return that
This assigns a unique integer to every input, and does hashing in either O(1) expected time or O(log n) time, but it takes some space. You can then convert this integer into a string if you must, for example with base64 encoding, hexadecimal notation, or by interpreting it as string in some character encoding (though you have to take care to end up with valid strings). Each of those gives you far more than 10^16 outcomes, and makes the strings less likely to be confused for numbers.

Resources