I need to convert a variable length string of length 60 - 100 characters into a 16 character unique string. Please note that the input will be unique as well. Is there some existing hashing algorithm i can use? Or is there another way this can be achieved?
Variable length 60 string unique can be a lot more than 16 length strings.
So a general, good algorithm is not possible. It is like generate from each letter unique ( a-z) a digit (0-9)
Hash functions don't have unique outputs by the pigeonhole principle. If you there are actually less than < 2^16 distinct inputs, it is possible, but this is not a design goal of any hash functions I know of, and you'd have to create the hash function with knowledge of all inputs. So you'd have to emulate it. The simplest scheme that's reasonable efficient seems to be:
Keep an associative array from input strings to integers
When hashing a string, look it up in the associative array
If it's in the associative array, return the associated value
Else, let map[input] = map.entry_count() and return that
This assigns a unique integer to every input, and does hashing in either O(1) expected time or O(log n) time, but it takes some space. You can then convert this integer into a string if you must, for example with base64 encoding, hexadecimal notation, or by interpreting it as string in some character encoding (though you have to take care to end up with valid strings). Each of those gives you far more than 10^16 outcomes, and makes the strings less likely to be confused for numbers.
Related
Suppose I were limited to using only 32-bit unsigned integers to express strings. Obviously, I could use individual u8 numbers and allocate enough separate values to describe a short string, but say compute and time aren’t important, this being for my curiosity, not necessarily for a real world use.
I observe that a 32-bit number is the same size as 4 strict u8 chars. In decimal, there’s space to encode 4 of any character-encoding that could be indexed by a 2-digit decimal as their decimal equivalent, while 5 ECMA-1 characters could fit in the same bitsize.
Suppose I want the range of printable characters, using a mapped ASCII table, where I subtract 32 to get the printable characters into 2 decimal digits (32 to 126 become 0 to 94). Suppose a mapping function similar to |c,i|c-31*(10^((i+1)*2)), where c is the ASCII value and i is the position: 45769502. In ASCII values as a u8 array [66, 97, 116, 33], or the string “Bat!”
Clearly this is not computationally efficient. I’m not necessarily shooting for that? Just pure curiosity here.
Supposing compute is arbitrary, so even being totally absurd, how might I encode a longer string in a 32-bit unsigned integer?
First you need to decide on which characters you want to encode. Suppose you have chosen k characters which you have mapped to the numbers 0 to k-1. Then every integer n is mapped to a unique non-empty string by expressing n in base k and mapping each k-ary digit to the corresponding character. You could reserve the maximum integer for the empty string.
So you just need a mapping table for the k characters and a function to convert an integer from one base to another, that's simple and efficient, and the encoding is also optimally dense (since every integer maps to a unique string).
I was reading text about hashing , I found out that naive hash code of char string can be implemented as polynomial hash function
h(S0,S1,S2,...SN-1) = S0*A^N-1 + S1*A^N-2 + S2*A^N-3 ..... SN-1*A^0. Where Si is character at index i and A is some integer.
But cannot we straightaway sum as
h(S0,S1,S2,...SN-1) = S0*(N)+S1*(N-1)+S2*(N-2) ...... SN-1*1.
I see this function also as good since two values 2*S0+S1 != 2*S1+S0 (which are reverse) are not hashed to same values. But nowhere i find this type of hash function
Suppose we work with strings of 30 characters. That's not long, but it's not so short that problems with the hash should arise purely because the strings are too short.
The sum of the weights is 465 (1+2+...+30), with printable ASCII characters that makes the maximum hash 58590, attained by "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~". There are a lot more possible printable ASCII strings of 30 characters than that (9530 ≈ 2E59), but they all hash into the range of 0 to 58590. Naturally you cannot actually have that many strings at the same time, but you could have a lot more than 58590, and that would guarantee collisions just based on counting (it is very likely to happen much sooner of course).
The maximum hash grows only slowly, you'd need strings of 34 million characters before the entire range of a 32bit integer is used.
The other way, multiplying by powers of A (this can be evaluated with Horner's scheme so no powers needs to be calculated explicitly, it still only costs an addition and a multiplication per character, though the naive way is not the fastest way to compute that hash), does not have this problem. The powers of A quickly get big (and start wrapping, which is fine as long as A is odd), so strings with 30 characters stand a good chance to cover the entire range of whatever integer type you're using.
The problem with a linear hash function is that it's much easier to generate collisions.
Consider a string with 3 chars: S0, S1, S2.
The proposed hash code would be 3 * S0 + 2 * S1 + S2.
Every time we decrease char S2 by two (e.g. e --> c), and increase char S1 by one (e.g. m --> n), we obtain the same hash code.
Even only the fact that it is possible to describe an operation preserving hash so easily would be an alarm (because some algorithm might process the string exactly in that manner). As a more extreme case consider just summing the characters. In this situation all the anagrams of the original string would generate the same hash code (thus this hash would be useless in an application processing anagrams).
Now I have some sets of integers, say:
set1 = {int1, int2, int3};
set2 = {int2, int3, int1};
set3 = {int1, int4, int2};
The order or the numbers is not taken into consideration, so set1 and set2 are the same, while set3 are not with the other two.
Now I want to generate a unique key for these sets to distinguish them, in that way, set1 and set2 should generate the same key.
I think this for a while, thoughts as sum up the integers came to my mind but can be easily proved wrong. Sort the set and do
key = n1 + n2*2^16 + n3*2^32
may be a possible way but I wonder if this can be solved more elegantly.
The key can be either integer or string.
So any one has some idea about solving this as fast as possible? Or any reading material is welcome.
More info:
The numbers are in fact colors so each integer is less than 0xffffff
If these were small integers (all within the range(0,63) for example) then you could represent each set as a bitstring (1 for any integer that's present in the set; 0 for any that's absent). For sparse sets of large integers this would be horrendously expensive in terms of storage/memory).
One other method that comes to mind would be to sort the set and form the key as the concatenation of each number's digital representation (separated by some delimiter). So the set {2,1,3} -> "1/2/3" (using "/" as the delimiter) and {30,1,2,4} => "1/2/4/30"
I suppose you could also use a hybrid approach. All elements < 63 are encoded into a hex string and all others are encoded into a string as described. Then your final resulting key is formed by: HEXxA/B/c ... (with the "x" separating the small int hex string from the larger ints in the set).
If numbers of your set is not so large, I think hashing each set into one string can be one of proper solution.
Then they are lager ones, you can make it small ones by mod function or whatever. And by this, they can be dealed with in the same way.
Hope this will help your solution if there is no better idea.
I think a key of practical size can only be a hash value - there will always be a few pairs of inputs that hash to the same key, but you can make this unlikely.
I think the idea of sorting and then applying a standard hash function is good, but I don't like your hash multipliers. If arithmetic is mod 2^32, then multiplying by 2^32 is multiplying by zero. If it is mod 2^64, then multiplying by 2^32 will lose the top 32 bits of the input.
I would use a hash function like that described in Why chose 31 to do the multiplication in the hashcode() implementation ?, where you keep a running total, multiplying the hash value by some odd number before you add then next item into it. Multiplying by an odd number mod 2^n will at least not lose information immediately. I would suggest 131, but Java has a tradition of using 31.
I would like to use Redis to sort string values (using sorted sets), but I can only use floats for that purpose. I am looking for an algorithm to convert string to a float 0..1 value, while keeping order.
I mean that s1 < s2 (alphabetically) should imply that f(s1) < f(s2).
Is there such an algorithm?
P.S. I will use such an algorithm for sorting usernames and in the most cases players with matching scores would have quite different usernames. So in the most cases either approach should work, but there is still room for collisions. On the other hand strings will be sorted moreless properly and it's acceptable if almost the same usernames are sorted incorrectly.
Each character can be mapped to its ASCII number. If you convert each string to its float equivalent concatenating all the ASCII numbers (with eventually zeros in front of them so that all characters will be mapped to three numbers) you will keep ordering. But if your strings are long, your floats will be huge and your mapping might not be unique (if several strings begin with the same characters, due to rounding inside the floats).
For example:
'hello' -> 104101108108111
If you know which subsets of characters your strings contain (for instance, only lowercase letters, or only uppercase letters and numbers) you can create your own mapping to use less numbers per character.
Mathematically, such an algorithm exists and is trivial: Simply put a radix point (“.”) before the string and interpret it as a base-256 numeral (assuming your string uses 8-bit characters). Analogously, if your string had just the characters “0” to “9”, you would read it as a decimal numeral, such as .58229 for the string “58229”. You are doing the same thing, just with base 256 instead of base 10.
Practically, this is not possible without a severely restricted set of potential strings or special floating-point software. Since a typical floating-point object has a finite size, it has a finite number of possible values. E.g., a floating-point object with 64 bits has at most 264 values, even neglecting those that stand for special notions such as NaN. Conversely, a string of arbitrary length has infinitely many potential values. Even if you limit the string to something reasonable in today’s computer memories, it has hugely more potential values than a normal floating-point object does.
To solve this, you must either decrease the number of potential strings (by limiting their length or otherwise restricting which strings are allowed) or increase the number of potential floating-point values (perhaps by using special arbitrary-precision floating-point software).
I'm trying to rate the efficiency of a function where the input is an array of strings. The algorithm always iterates through every item in this array. This strings contained in this array are of variable length. In this initial for loop, a character replace function is called on each string. I believe the replace function on its own would be O(n) where n is the length of the string.
So I'm confused how to evaluate big o efficiency here. If n is the size of the array, I know it will at least be O(n). But with variable string lengths, how would you rate the overall efficiency with the string replacement? Would you say n is the size of the array and use other variables to represent the different sizes of each string?
Personally, I would express efficiency through size of input data (as opposed to the length of array). So, if input is t bytes, running time will be O(t). And t here is also a common length of all strings.
I see two ways to say this (out of many possible).
The first is to say it is O(N) where N is the total number of characters.
The other you could say is O(N*M) where N is the number of strings and M is the average number of characters per string. Note that this is actually the same as my above answer. You could say M = k/N, so you get O(N*k/N) or O(k) where k is the total characters in all the strings.