This is a hard one (for me) I hope people can help me. I have some text and I need to transfer it to a number, but it has to be unique just as the text is unique.
For example:
The word 'kitty' could produce 12432, but only the word kitty produces that number. The text could be anything and a proper number should be given.
One problem the result integer must me a 32-bit unsigned integer, that means the largest possible number is 2147483647. I don't mind if there is a text length restriction, but I hope it can be as large as possible.
My attempts. You have the letters A-Z and 0-9 so one character can have a number between 1-36. But if A = 1 and B = 2 and the text is A(1)B(2) and you add it you will get the result of 3, the problem is the text BA produces the same result, so this algoritm won't work.
Any ideas to point me in the right direction or is it impossible to do?
Your idea is generally sane, only needs to be developed a little.
Let f(c) be a function converting character c to a unique number in range [0..M-1]. Then you can calculate result number for the whole string like this.
f(s[0]) + f(s[1])*M + f(s[2])*M^2 + ... + f(s[n])*M^n
You can easily prove that number will be unique for particular string (and you can get string back from the number).
Obviously, you can't use very long strings here (up to 6 characters for your case), as 36^n grows fast.
Imagine you were trying to store Strings from the character set "0-9" only in a number (the equivalent of obtaining a number of a string of digits). What would you do?
Char 9 8 7 6 5 4 3 2 1 0
Str 0 5 2 1 2 5 4 1 2 6
Num = 6 * 10^0 + 2 * 10^1 + 1 * 10^2...
Apply the same thing to your characters.
Char 5 4 3 2 1 0
Str A B C D E F
L = 36
C(I): transforms character to number: C(0)=0, C(A)=10, C(B)=11, ...
Num = C(F) * L ^ 0 + C(E) * L ^ 1 + ...
Build a dictionary out of words mapped to unique numbers and use that, that's the best you can do.
I doubt there are more than 2^32 number of words in use, but this is not the problem you're facing, the problem is that you need to map numbers back to words.
If you were only mapping words over to numbers, some hash algorithm might work, although you'd have to work a bit to guarantee that you have one that won't produce collisions.
However, for numbers back to words, that's quite a different problem, and the easiest solution to this is to just build a dictionary and map both ways.
In other words:
AARDUANI = 0
AARDVARK = 1
...
If you want to map numbers to base 26 characters, you can only store 6 characters (or 5 or 7 if I miscalculated), but not 12 and certainly not 20.
Unless you only count actual words, and they don't follow any good countable rules. The only way to do that is to just put all the words in a long list, and start assigning numbers from the start.
If it's correctly spelled text in some language, you can have a number for each word. However you'd need to consider all possible plurals, place and people names etc. which is generally impossible. What sort of text are we talking about? There's usually going to be some existing words that can't be coded in 32 bits in any way without prior knowledge of them.
Can you build a list of words as you go along? Just give the first word you see the number 1, second number 2 and check if a word has a number already or it needs a new one. Then save your newly created dictionary somewhere. This would likely be the only workable solution if you require 100% reliable, reversible mapping from the numbers back to original words given new unknown text that doesn't follow any known pattern.
With 64 bits and a sufficiently good hash like MD5 it's extremely unlikely to have collisions, but for 32 bits it doesn't seem likely that a safe hash would exist.
Just treat each character as a digit in base 36, and calculate the decimal equivalent?
So:
'A' = 0
'B' = 1
[...]
'Z' = 25
'0' = 26
[...]
'9' = 35
'AA' = 36
'AB' = 37
[...]
'CAB' = 46657
Related
I am studying the hash algorithm in a book and come across this line
the index is 3 + 2 + 17 % 10 = 22 % 10 = 2
I can understand that is referring that the index is 2, and the % is a remainder, but is too much esoteric to understand.
Here below the context where is explained where this index comes from, namely is an exercise of the book
Suppose you have these four hash functions that work with strings: A.
Return “1” for all input. B. Use the length of the string as the
index. C. Use the first character of the string as the index. So, all
strings starting with a are hashed together, and so on. D. Map every
letter to a prime number: a = 2, b = 3, c = 5, d = 7, e = 11, and so
on. For a string, the hash function is the sum of all the characters
modulo the size of the hash. For example, if your hash size is 10, and
the string is “bag”, the index is 3 + 2 + 17 % 10 = 22 % 10 = 2. For
each of the following examples, which hash functions would provide a
good distribution? Assume a hash table size of 10 slots.
5.5 A phonebook where the keys are names and values are phone numbers. The names are as follows: Esther, Ben, Bob, and Dan. Answer: Hash
functions C and D would give a good distribution.
Your example shows 4 hash-functions (A-D). Each of the functions return a value used to place the word (string) in a bucket.
A. All words will go to same bucket, the function returns 1. There is not a good distribution.
B. All words will be sorted on the length of the string. There will be not be many words in the first bucket, only "a" or "I" as far as I can think of. Where as buckets 3-6 will be heavily occupied.
C. now the words are sorted on the first letter, we have 26 letters in the alphabet but not all words are evenly presented (words starting with x is way less than words starting with h)
D. has most likely the best distribution as it uses prime numbers. The word "bag" is the sum of (b=3 + a=2 + g=17) = 22.
For function B-D, the result can be bigger than 10. So in the example, the results are divided by 10 and the remainder is used to conclude the bucket the word is going to.
The slot index must be an integer in range [0,9]. The remainder function (for divisor 10) maps all integers to that range in the most uniform way.
I want to develop a way to be able to represent all combinations of b bits with k bits set (equal to 1). It needs to be a way that given an index, can get quickly the binary sequence related, and the other way around too. For instance, the tradicional approach which I thought would be to generate the numbers in order, like:
For b=4 and k=2:
0- 0011
1- 0101
2- 0110
3- 1001
4-1010
5-1100
If I am given the sequence '1010', I want to be able to quickly generate the number 4 as a response, and if I give the number 4, I want to be able to quickly generate the sequence '1010'. However I can't figure out a way to do these things without having to generate all the sequences that come before (or after).
It is not necessary to generate the sequences in that order, you could do 0-1001, 1-0110, 2-0011 and so on, but there has to be no repetition between 0 and the (combination of b choose k) - 1 and all sequences have to be represented.
How would you approach this? Is there a better algorithm than the one I'm using?
pkpnd's suggestion is on the right track, essentially process one digit at a time and if it's a 1, count the number of options that exist below it via standard combinatorics.
nCr() can be replaced by a table precomputation requiring O(n^2) storage/time. There may be another property you can exploit to reduce the number of nCr's you need to store by leveraging the absorption property along with the standard recursive formula.
Even with 1000's of bits, that table shouldn't be intractably large. Storing the answer also shouldn't be too bad, as 2^1000 is ~300 digits. If you meant hundreds of thousands, then that would be a different question. :)
import math
def nCr(n,r):
return math.factorial(n) // math.factorial(r) // math.factorial(n-r)
def get_index(value):
b = len(value)
k = sum(c == '1' for c in value)
count = 0
for digit in value:
b -= 1
if digit == '1':
if b >= k:
count += nCr(b, k)
k -= 1
return count
print(get_index('0011')) # 0
print(get_index('0101')) # 1
print(get_index('0110')) # 2
print(get_index('1001')) # 3
print(get_index('1010')) # 4
print(get_index('1100')) # 5
Nice question, btw.
I have 6 variables 0 ≤ n₁,...,n₆ ≤ 12 and I'd like to build a hash function to do the direct mapping D(n₁,n₂,n₃,n₄,n₅,n₆) = S and another function to do the inverse mapping I(S) = (n₁,n₂,n₃,n₄,n₅,n₆), where S is a string (a-z, A-Z, 0-9).
My goal is to minimize the length of S for 3 or less.
I thought as the variables have 13 possible values, a single letter (a-z) should be able to represent 2 of them, but I realized that 1 + 12 = m and 2 + 11 = m, so I still don't know how to write a function.
Is there any approach to build a function that does this mapping and returns a small string?
Using the whole ASCII to represent S is an option if it's necessary.
You can convert a set of numbers in any given range to numbers in any other range using base conversion.
Binary is base 2 (0-1), decimal is base 10 (0-9). Your 6 numbers are base 13 (0-12).
Checking whether a conversion would be possible involves counting the number of possible combinations of values for each set. With each number in the range [0,n] (thus base n+1), we can go from all 0's to all n's, thus each number can take on n+1 values and the total number of possibilities is (n+1)numberCount. For 6 decimal digits, for example, it would be 106 = 1000000, which checks out, since there are 1000000 possible numbers with (at most) 6 digits, i.e. numbers < 1000000.
Lower- and uppercase letters and numbers (26+26+10) would be base 62 (0-61), but, following from the above, 3 such values would be insufficient to represent your 6 numbers (136 > 623). To do conversion from/to these, you can do the conversion to a set of base 62 numbers, then have appropriate if-statements to convert 0-9 <=> 0-9, a-z <=> 10-35, A-Z <=> 36-61.
You can represent your data in 3 bytes (since 2563 >= 136), although this wouldn't necessary be printable characters - 32-126 is considered the standard printable range (which is still too small of a range), 128-255 is the extended range and may not be displayed properly in any given environment (to give the best chance of properly displaying it, you should at least avoid 0-31 and 127, which are control characters - you can convert 0-... to the above ranges by adding 32 and then adding another 1 if the value is >= 127).
Many / most languages should allow you to give a numeric value to represent a character, so it should be fairly simple to output it once you do the base conversion. Although some may use Unicode to represent characters, which could make it a bit less trivial to work with ASCII.
If the numbers had specific constraints, that would reduce the number of possible combinations, thus possibly making it fit into a smaller set or range of numbers.
To do the actual base conversion:
It might be simplest to first convert it to a regular integral type (typically binary or decimal), where we don't have to worry about the base, and then convert it to the target base (although first make sure your value will fit in whichever data type you're using).
Consider how binary works:
1101 is 13 = 23 + 22 + 20
13 % 2 = 1 13 / 2 = 6
6 % 2 = 0 6 / 2 = 3
3 % 2 = 1 3 / 2 = 1
1 % 2 = 1
The above, from top to bottom: 1101 = our number
Using the same idea, we can convert to/from any base as follows: (pseudo-code)
int convertFromBase(array, base):
output = 0
for each i in array
output = base*output + i
return output
int[] convertToBase(num, base):
output = []
while num > 0
output.append(num % base)
num /= base
output.reverse()
return output
You can also extend this logic to situations where each number is in a different range by changing what you divide or multiple by at each step (a detailed explanation of that is perhaps a bit beyond the scope of the question).
I thought as the variables have 13 possible values, a single letter
(a-z) should be able to represent 2 of them
This reasoning is wrong. In fact to represent two variables (=any combination these variables might take) you will need 13x13 = 169 symbols.
For your example the 6 variables can take 13^6 (=4826809) different combinations. In order to represent all possible combinations you will need 5 letters (a-z) since 26^5 (=11881376) is the least amount that is will yield more than 13^6 combinations.
For ASCII characters 3 symbols should suffice since 256^3 > 13^6.
If you are still interested in code that does the conversion, I will be happy to help.
I have a number X , I want to check the number of powers of 2 it have ?
For Ex
N=7 ans is 2 , 2*2
N=20 ans is 4, 2*2*2*2
Similar I want to check the next power of 2
For Ex:
N=14 Ans=16
Is there any Bit Hack for this without using for loops ?
Like we are having a one line solution to check if it's a power of 2 X&(X-1)==0,similarly like that ?
GCC has a built-in instruction called __builtin_clz() that returns the number of leading zeros in an integer. So for example, assuming a 32-bit int, the expression p = 32 - __builtin_clz(n) will tell you how many bits are needed to store the integer n, and 1 << p will give you the next highest power of 2 (provided p<32, of course).
There are also equivalent functions that work with long and long long integers.
Alternatively, math.h defines a function called frexp() that returns the base-2 exponent of a double-precision number. This is likely to be less efficient because your integer will have to be converted to a double-precision value before it is passed to this function.
A number is power of two if it has only single '1' in its binary value. For example, 2 = 00000010, 4 = 00000100, 8 = 00001000 and so on. So you can check it using counting the no. of 1's in its bit value. If count is 1 then the number is power of 2 and vice versa.
You can take help from here and here to avoid for loops for counting set bits.
If count is not 1 (means that Value is not power of 2) then take position of its first set bit from MSB and the next power of 2 value to this number is the value having only set bit at position + 1. For example, number 3 = 00000011. Its first set bit from MSB is 2nd bit. Therefore the next power of 2 number is a value having only set bit at 3rd position. i.e. 00000100 = 4.
Given a byte array with a length of two we have two possibilities for a shuffle. 01 and 10
A length of 3 would allow these shuffle options 012,021,102,120,102,201,210. Total of 2x3=6 options.
A length of 4 would have 6x4=24. Length of 5 would have 24x5=120 options, etc.
So once you have randomly picked one of these shuffle options, how do you store it? You could store 23105 to indicate how to shuffle four bytes.. But that takes 5x3=15 bits. I know it can be done in 7 bits because there are only 120 possibilities.
Any ideas how to more efficiently store a shuffle instruction? It should be an algorithm that will scale in length.
Edit: See my own answer below before you post a new one. I am sure that there is good information in many of these already existing answers, but I just could not understand much of it.
If you have a well-ordering of the set of elements you are shuffling, then you can create a well-ordering for the set of all the permutations and just store a single integer representing which place in the order a permutation falls.
Example:
Shuffling 1 4 5: the possibilities are:
1 4 5 [0]
1 5 4 [1]
4 1 5 [2]
4 5 1 [3]
5 1 4 [4]
5 4 1 [5]
To store the permutation 415, you would just store 2 (zero indexed).
If you have a well-ordering for the original set of elements, you can make a well-ordering for the set of permutations by iterating through the elements from least order to greatest for the leftmost element, while iterating through the leftover elements for the next place to the right and so on until you get to the rightmost element. You wouldn't need to store this array, you would just need to be able to generate the permutations in the same order again to "unpack" the stored integer.
However, attempting to generate all the permutations one by one will take a considerable amount of time beyond the smallest of sets. You can use the observation that the first (N-1)! permutations start with the 1st element, the second (N-1)! permutations start with the second element, then for each permutation that starts with a specific element, the 1st (N-2)! permutations start with the first of the leftover elements and so on and so forth. This will allow you to "pack" or "unpack" the elements in O(n), excepting the complexity of actually generating the factorials and the division and modulus of arbitrary length integers, which will be somewhat substantial.
You are right that to store just a permutation of data, and not the data itself, you will need only as many bits as ceil(log2(permutations)). For N items, the number of permutations is factorial(N) or N!, so you would need ceil(log2(factorial(N))) bits to store just the permutation of N items without also storing the items.
In whatever language you're familiar, there should be a ready way to make a big array of M bits, fill it up with a permutation, and then store it on a storage device.
A common shuffling algorithm, and one of the few unbiased ones, is the Fisher-Yates shuffle. Each iteration of the algorithm takes a random number and swaps two places based on that number. By storing a list of those random numbers, you can later reproduce the exact same permutation.
Furthermore, since the valid range for each of those numbers is known in advance, you can pack them all into a big integer by multiplying each number by the product of the lower number's valid ranges, like a kind of variable-base positional notation.
For an array of L items, why not pack the order into L*ceil(log2(L)) bits? (ceil(log2(L)) is the number of bits needed to hold L unique values). For example, here is the representation of the "unshuffled" shuffle, taking the items in order:
L=2: 0 1 (2 bits)
L=3: 00 01 10 (6 bits)
L=4: 00 01 10 11 (8 bits)
L=5: 000 001 010 011 100 (15 bits)
...
L=8: 000 001 010 011 100 101 110 111 (24 bits)
L=9: 0000 0001 0010 0011 0100 0101 0110 0111 1000 (36 bits)
...
L=16: 0000 0001 ... 1111 (64 bits)
L=128: 00000000 000000001 ... 11111111 (1024 bits)
The main advantage to this scheme compared to #user470379's answer, is that it is really easy to extract the indexes, just shift and mask. No need to regenerate the permutation table. This should be a big win for large L: (For 128 items, there are 128! = 3.8562e+215 possible permutations).
(Permutations == "possibilities"; factorial = L! = L * (L-1) * ... * 1 = exactly the way you are calculating possibilities)
This method also isn't that much larger than storing the permutation index. You can store a 128 item shuffle in 1024 bits (32 x 32-bit integers). It takes 717 bits (23 ints) to store 128!.
Between the faster decoding speed and the fact that no temporary storage is required for caclulating the permutation, storing the extra 9 ints may be well worth their cost.
Here is an implementation in Ruby that should work for arbitrary sizes. The "shuffle instruction" is contained in the array instruction. The first part calculates the shuffle using a version of the Fisher-Yates algorithm that #Theran mentioned
# Some Setup and utilities
sizeofInt = 32 # fix for your language/platform
N = 16
BitsPerIndex = Math.log2(N).ceil
IdsPerWord = sizeofInt/BitsPerIndex
# sets the n'th bitfield in array a to v
def setBitfield a,n,v
mask = (2**BitsPerIndex)-1
idx = n/IdsPerWord
shift = (n-idx*IdsPerWord)*BitsPerIndex
a[idx]&=~(mask<<shift)
a[idx]|=(v&mask)<<shift
end
# returns the n'th bitfield in array a
def getBitfield a,n
mask = (2**BitsPerIndex)-1
idx = n/IdsPerWord
shift = (n-idx*IdsPerWord)*BitsPerIndex
return (a[idx]>>shift)&mask
end
#create the shuffle instruction in linear time
nwords = (N.to_f/IdsPerWord).ceil # num words required to hold instruction
instruction = Array.new(nwords){0} # array initialized to 0
#the "inside-out" Fisher–Yates shuffle
for i in (1..N-1)
j = rand(i+1)
setBitfield(instruction,i,getBitfield(instruction,j))
setBitfield(instruction,j,i)
end
#Here is a way to visualize the shuffle order
#delete ".reverse.map{|s|s.to_i(2)}" to visualize the way it's really stored
p instruction.map{|v|v.to_s(2).rjust(BitsPerIndex*IdsPerWord,'0').scan(
Regexp.new('.'*BitsPerIndex)).reverse.map{|s|s.to_i(2)}}
Here is an example of applying the shuffle to an array of characters:
A=(0...N).map{|v|('A'.ord+v).chr}
puts A*''
#Apply the shuffle to A in linear time
for i in (0...N)
print A[getBitfield(instruction,i)]
end
print "\n"
#example: for N=20, produces
> ABCDEFGHIJKLMNOPQRST
> MSNOLGRQCTHDEPIAJFKB
Hopefully this won't be too hard to convert to javascript, or any other language.
I am sorry if this was already covered in a previous answer,, but for the first time,, these answers are completely foreign to me. I might have mentioned that I know Java and JavaScript and that I know nothing of mathematics... So log2, permutations, factorial, well-ordering are all unknown words to me.
And on top of that I ended up (again) using StackOverflow as a white board to write out my question and answered the question in my head 20 minutes later. I was tied up in non computer life and,, knowing StackOverflow I figured it was too late to save more than 20% of everybody's easily wasted time.
Anyway, having gotten lost in all three existing answers. Here is the answer I know of
(written in Javascript but it should be easy to translate 20 lines of foreign code to your language of choice)
(see it in action here: http://jsfiddle.net/M3vHC)
Edit: Thanks to AShelly for this catch: This will fail (become highly biased) when given a key length of more than 12 assuming your ints are 32 bit (more than 19 if your ints are 64 bit)
var keyLength = 5
var possibilities = 1
for(var i = 0; i < keyLength ; i++)
possibilities *= i+1 // Calculate the number of possibilities to create an unbiased key
var randomKey = parseInt(Math.random()*possibilities) // Your shuffle instruction. Random number with correct number of possibilities starting with zero as the first possibility
var keyArray = new Array(keyLength) // This will contain the new locations of existing indexes. [0,1,2,3,4] means no shuffle [4,3,2,1,0] means reverse order. etcetera
var remainsOfKey = randomKey // Our "working" key. This is disposible / single use.
var taken = new Array(keyLength) // Tells if an index has already been accounted for in the keyArray
for(var i = keyArray.length;i > 0;i--) { // The number of possibilities for the first item in the key array is the number of blanks in key array.
var add = remainsOfKey % i + 1, remainsOfKey = parseInt(randomKey / i) // Grab a number at least zero and less then the number of blanks in the keyArray
for(var j = 0; add; j++) // If we got x from the above line, make sure x is not already taken
if(!taken[j])
add--
taken[keyArray[i-1] = --j] = true // Take what we have because it is right
}
alert('Based on a key length of ' + keyLength + ' and a random key of ' + randomKey + ' the new indexes are ... ' + keyArray.join(',') + ' !')