Modifying the range of a uniform random number generator - algorithm

I am given a function rand5() that generates, with a uniform distribution, a random integer in the closed interval [1,5]. How can I use rand5(), and nothing else, to create a function rand7(), which generates integers in [1,7] (again, uniformly distributed) ?
I searched stackoverflow, and found many similar questions, but not exactly like this one.
My initial attempt was rand5() + 0.5*rand5() + 0.5*rand5(). But this won't generate integers from 1 to 7 with uniform probability. Any answers, or links to answers, are very welcome.

Note that a prefect uniform distribution cannot be achieved with a bounded number of draw5() invocations, because for every k: 5^k % 7 != 0 - so you will always have some "spare" elements.
Here is a solution with unbounded number of draw5() uses:
Draw two numbers, x1,x2. There are 5*5=25 possible outcomes for this.
Note that 25/7 ~= 3.57. Chose 3*7=21 combinations, such that each combination will be mapped to one number in [1,7], for all other 4 numbers - redraw.
For example:
(1,1),(1,2),(2,1) : 1
(3,1),(1,3),(3,2): 2
(3,3),(1,4),(4,1): 3
(2,4),(4,2)(3,4): 4
(4,3), (4,4), (1,5): 5
(5,1), (2,5), (5,2) : 6
(5,3), (3,5), (4,5) : 7
(5,4),(5,5),(2,3), (2,2) : redraw

Here's a simple way:
Use rand5() to generate a sequence of three random integers from the set { 1, 2, 4, 5 } (i.e., throw away any 3 that is generated).
If all three numbers are in the set { 1, 2 }, discard the sequence and return to step 1.
For each number in the sequence, map { 1, 2} to 0 and { 4, 5 } to 1. Use these as the three bit values for a 3-bit number. Because the bits cannot all be 0, the number will be in the range [1, 7]. Because each bit is 0 or 1 with equal probability, the distribution over [1, 7] should be uniform.

ok I had to think about it for a while but it is actually not that hard. Imagine instead of rand5 you had rand2 which either outputs 0 or 1. You can make rand2 our of rand5 by simply doing
rand2() {
if(rand5() > 2.5) return 1
else return 0
}
now using rand2 multiple times do a tree to get rand7. For example if you start rand7 can be in [1,2,3,4,5,6,7] after a throw of rand2 which gives 0 you now subset to [1,2,3,4] and after another throw or rand2 which is 1 you subset to [3,4] and a final throw of 1 gives the output of rand7 to be 4. In general this tree trick can work to take a rand2 and map to randx where x is any integer.

Here's one meta-trick which comes in handy for lots of these problems: the bias is introduced when we treat the terms differently in some fashion, so if we treat them all the same at each step and perform operations only on the set, we'll stay out of trouble.
We have to call rand5() at least once (obviously!), but if we branch on that bad things happen unless we're clever. So instead let's call it once for each of the 7 possibilities:
In [126]: import random
In [127]: def r5():
.....: return random.randint(1, 5)
.....:
In [128]: [r5() for i in range(7)]
Out[128]: [3, 1, 3, 4, 1, 1, 2]
Clearly each of these terms was equally likely to be any of these numbers.. but only one of them happened to be 2, so if our rule had been "choose whichever term rand5() returns 2 for" then it would have worked. Or 4, or whatever, and if we simply looped long enough that would happen. So there are lots of way to come up with something that works. Here (in pseudocode -- this is terrible Python) is one way:
import random, collections
def r5():
return random.randint(1, 5)
def r7():
left = range(1, 8)
while True:
if len(left) == 1:
return left[0]
rs = [r5() for n in left]
m = max(rs)
how_many_at_max = rs.count(m)
if how_many_at_max == len(rs):
# all the same: try again
continue
elif how_many_at_max == 1:
# hooray!
return left[rs.index(m)]
# keep only the non-maximals
left = [l for l,r in zip(left, rs) if r != m]
which gives
In [189]: collections.Counter(r7() for _ in xrange(10**6))
Out[189]: Counter({7: 143570, 5: 143206, 4: 142827, 2: 142673, 6: 142604, 1: 142573, 3: 142547})

Related

Minimum Delete operations to empty the vector

My friend was asked this question in an interview:
We have a vector of integers consisting only of 0s and 1s. A delete consists of selecting consecutive equal numbers and removing them. The remaining parts are then attached to each other. For e.g., if the vector is [0,1,1,0] then after removing [1,1] we get [0,0]. We need one delete to remove an element from the vector, if no consecutive elements are found.
We need to write a function that returns the minimum number of deletes to make the vector empty.
Examples 1:
Input: [0,1,1,0]
Output: 2
Explanation: [0,1,1,0] -> [0,0] -> []
Examples 2:
Input: [1,0,1,0]
Output: 3
Explanation: [1,0,1,0] -> [0,1,0] -> [0,0] -> [].
Examples 3:
Input: [1,1,1]
Output: 1
Explanation: [1,1,1] -> []
I am unsure of how to solve this question. I feel that we can use a greedy approach:
Remove all consecutive equal elements and increment the delete counter for each;
Remove elements of the form <a, b, c> where a==c and a!=b, because of we had multiple consecutive bs, it would have been deleted in step (1) above. Increment the delete counter once as we delete one b.
Repeat steps (1) and (2) as long as we can.
Increment delete counter once for each of the remaining elements in the vector.
But I am not sure if this would work. Could someone please confirm if this is the right approach? If not, how do we solve this?
Hint
You can simplify this problem greatly by noticing the following fact: a chain of consecutive zeros or ones can be shortened or lengthened without changing the final solution. By example, the two vectors have the same solution:
[1, 0, 1]
[1, 0, 0, 0, 0, 0, 0, 1]
With that in mind, the solution becomes simpler. So I encourage you to pause and try to figure it out!
Solution
With the previous remark, we can reduce the problem to vectors of alternating zeros and ones. In fact, since zero and one have no special meaning here, it suffices to solve for all such vector which start by... say a one.
[] # number of steps: 0
[1] # number of steps: 1
[1, 0] # number of steps: 2
[1, 0, 1] # number of steps: 2
[1, 0, 1, 0] # number of steps: 3
[1, 0, 1, 0, 1] # number of steps: 3
[1, 0, 1, 0, 1, 0] # number of steps: 4
[1, 0, 1, 0, 1, 0, 1] # number of steps: 4
We notice a pattern, the solution seems to be floor(n / 2) + 1 for n > 1 where n is the length of those sequences. But can we prove it..?
Proof
We will proceed by induction. Suppose you have a solution for a vector of length n - 2, then any move you do (except for deleting the two characters on the edges of the vector) will have the following result.
[..., 0, 1, 0, 1, 0 ...]
^------------ delete this one
Result:
[..., 0, 1, 1, 0, ...]
But we already mentioned that a chain of consecutive zeros or ones can be shortened or lengthened without changing the final solution. So the result of the deletion is in fact equivalent to now having to solve for:
[..., 0, 1, 0, ...]
What we did is one deletion in n elements and arrived to a case which is equivalent to having to solve for n - 2 elements. So the solution for a vector of size n is...
Solution(n) = Solution(n - 2) + 1
= [floor((n - 2) / 2) + 1] + 1
= floor(n / 2) + 1
Keeping in mind that the solutions for [1] and [1, 0] are respectively 1 and 2, this concludes our proof. Notice here, that [] turns out to be an edge case.
Interestingly enough, this proof also shows us that the optimal sequence of deletions for a given vector is highly non-unique. You can simply delete any block of ones or zeros, except for the first and last ones, and you will end up with an optimal solution.
Conclusion
In conclusion, given an arbitrary vector of ones and zeros, the smallest number of deletions you will need can be computed by counting the number of groups of consecutive ones or zeros. The answer is then floor(n / 2) + 1 for n > 1.
Just for fun, here is a Python implementation to solve this problem.
from itertools import groupby
def solution(vector):
n = 0
for group in groupby(vector):
n += 1
return n // 2 + 1 if n > 1 else n
Intuition: If we remove the subsegments of one integer, then all the remaining integers are of one type leads to only one operation.
Choosing the integer which is not the starting one to remove subsegments leads to optimal results.
Solution:
Take the integer other than the one that is starting as a flag.
Count the number of contiguous segments of the flag in a vector.
The answer will be the above count + 1(one operation for removing a segment of starting integer)
So, the answer is:
answer = Count of contiguous segments of flag + 1
Example 1:
[0,1,1,0]
flag = 1
Count of subsegments with flag = 1
So, answer = 1 + 1 = 2
Example 2:
[1,0,1,0]
flag = 0
Count of subsegments with flag = 2
So, answer = 2 + 1 = 3
Example 3:
[1,1,1]
flag = 0
Count of subsegments with flag = 0
So, answer = 0 + 1 = 1

permutations without repetition

I would like to know, what is the best approach to solve this problem:
Given x, y and y integers: a1, a2, a3 .. ay find all combinations of
a1 ± a2 ± ... ± ay = x, y < 20.
My recent approach is to find all permutations of 1 and 0 stored in table T and then, depending on whether number T[i] is 1 and 0, add or subtract ai from sum. The problem is that there are n! permutations of n-element array. Hence, for 20-element array, I have to check 20! possibilities where most of them are repeated. Could you please suggest me any potential approach to solving my problem?
There are only 2^20 (just over a million) binary vectors of length 20 rather than the infeasible 20!. Use should be able to brute-force that few in less than a second, especially if you use a Gray Code which would allow you to pass from one candidate sum to another in a single step (e.g. to go from a + b - c -d to a + b - c + d just add 2*d.
The excellent branch and bound idea of #MikeWise would be good if y gets much larger. Generate a tree starting with a root node of 0. Give it children of -a1 and +a1. Then 4 grand children by adding and subtracting a2, etc. If you ever get farther than the sum of the remaining ai from the target x -- you can prune that branch. In the worst case, this might be slightly worse than the Gray-code based brute force (because you need to do so much more processing at each node), but in the best case you might be able to prune away most possibilities.
On Edit: Here is some Python code. First I define a generator which, given an integer n, successively returns which bit position needs to flip to step through a Gray code:
def grayBit(n):
code = [0]*n
odd = True
done = False
while not done:
if odd:
code[0] = 1 - code[0] #flip bit
odd = False
yield 0
else:
i = code.index(1)
if i == n-1:
done = True
else:
code[i+1] = 1 - code[i+1]
odd = True
yield i+1
(This uses an algorithm which I learned years ago in the excellent book "Constructive Combinatorics" by Stanton and White).
Then -- I use this to return all solutions (as lists consisting of the input list of numbers with negative signs inserted as needed). The key point is that I can take the current bit-to-flip and either add or subtract twice the corresponding number:
def signedSums(nums, target):
n = len(nums)
patterns = []
total = sum(nums)
pattern = [1]*n
if target == total: patterns.append([x*y for x,y in zip(nums,pattern)])
deltas = [2*i for i in nums]
for i in grayBit(n):
if pattern[i] == 1:
total -= deltas[i]
else:
total += deltas[i]
pattern[i] = -1 * pattern[i]
if target == total: patterns.append([x*y for x,y in zip(nums,pattern)])
return patterns
Typical output:
>>> signedSums([1,2,3,4,5,9],6)
[[1, -2, -3, -4, 5, 9], [1, 2, 3, -4, -5, 9], [-1, 2, -3, 4, -5, 9], [1, 2, 3, 4, 5, -9]]
It only takes about a second to evaluate:
>>> len(signedSums([i for i in range(1,21)],100))
2865
Hence there are 2865 ways to add or subtract the integers in the range 1,2,..,20 to get a net sum of 100.
I assumed that a1 can be either added or subtracted (instead of just added, which is what your question implies if taken literally). Note that if you really want to insist that a1 occurs positively, then you could just subtract it from x and apply the above algorithm to the rest of the list and the adjusted target.
Finally, it is not too hard to see that if you solve the subset sub problem with the set of weights {2*a1, 2*a2, 2*a3, .... 2*ay} and with a target sum of x + a1 + a2 + ... + ay then the subsets selected will correspond exactly to the subsets where the positive signs occur in the solution to the original problem. Thus your problem is easily reducible to the subset-sum problem and it is thus NP-complete to determine if it has any solutions (and NP-hard to list them all).
We have conditions:
a1 ± a2 ± ... ± ay = x, y<20 [1]
First of all, I would generalize the condition [1], allowing all 'a' including 'a1' to be ±:
±a1 ± a2 ± ... ± ay = x [2]
If we have solution for [2], we can easily get solution for [1]
To solve [2] we can use the following approach:
combinations list x
| x == 0 && null list = [[]]
| null list = []
| otherwise = plusCombinations ++ minusCombinations where
a = head list
rest = tail list
plusCombinations = map (\c -> a:c) $ combinations rest (x-a)
minusCombinations = map (\c -> -a:c) $ combinations rest (x+a)
Explanation:
First condition checks if x reached zero and used all numbers from list. This means that solution found and we return single solution: [[]]
Second condition checks that list is empty and as far as x is not 0 this means that no solution can be found, returning empty solution: []
Third branch means that we can two alternatives: to use ai with '+' or with '-' so we concatenate plus and minus combinations
Example output:
*Main> combinations [1,2,3,4] 2
[[1,2,3,-4],[-1,2,-3,4]]
*Main> combinations [1,2,3,4] 3
[]
*Main> combinations [1,2,3,4] 4
[[1,2,-3,4],[-1,-2,3,4]]

Better Algorithm to find the maximum number who's square divides K :

Given a number K which is a product of two different numbers (A,B), find the maximum number(<=A & <=B) who's square divides the K .
Eg : K = 54 (6*9) . Both the numbers are available i.e 6 and 9.
My approach is fairly very simple or trivial.
taking the smallest of the two ( 6 in this case).Lets say A
Square the number and divide K, if its a perfect division, that's the number.
Else A = A-1 ,till A =1.
For the given example, 3*3 = 9 divides K, and hence 3 is the answer.
Looking for a better algorithm, than the trivial solution.
Note : The test cases are in 1000's so the best possible approach is needed.
I am sure someone else will come up with a nice answer involving modulus arithmetic. Here is a naive approach...
Each of the factors can themselves be factored (though it might be an expensive operation).
Given the factors, you can then look for groups of repeated factors.
For instance, using your example:
Prime factors of 9: 3, 3
Prime factors of 6: 2, 3
All prime factors: 2, 3, 3, 3
There are two 3s, so you have your answer (the square of 3 divides 54).
Second example of 36 x 9 = 324
Prime factors of 36: 2, 2, 3, 3
Prime factors of 9: 3, 3
All prime factors: 2, 2, 3, 3, 3, 3
So you have two 2s and four 3s, which means 2x3x3 is repeated. 2x3x3 = 18, so the square of 18 divides 324.
Edit: python prototype
import math
def factors(num, dict):
""" This finds the factors of a number recursively.
It is not the most efficient algorithm, and I
have not tested it a lot. You should probably
use another one. dict is a dictionary which looks
like {factor: occurrences, factor: occurrences, ...}
It must contain at least {2: 0} but need not have
any other pre-populated elements. Factors will be added
to this dictionary as they are found.
"""
while (num % 2 == 0):
num /= 2
dict[2] += 1
i = 3
found = False
while (not found and (i <= int(math.sqrt(num)))):
if (num % i == 0):
found = True
factors(i, dict)
factors(num / i, dict)
else:
i += 2
if (not found):
if (num in dict.keys()):
dict[num] += 1
else:
dict[num] = 1
return 0
#MAIN ROUTINE IS HERE
n1 = 37 # first number (6 in your example)
n2 = 41 # second number (9 in your example)
dict = {2: 0} # initialise factors (start with "no factors of 2")
factors(n1, dict) # find the factors of f1 and add them to the list
factors(n2, dict) # find the factors of f2 and add them to the list
sqfac = 1
# now find all factors repeated twice and multiply them together
for k in dict.keys():
dict[k] /= 2
sqfac *= k ** dict[k]
# here is the result
print(sqfac)
Answer in C++
int func(int i, j)
{
int k = 54
float result = pow(i, 2)/k
if (static_cast<int>(result)) == result)
{
if(i < j)
{
func(j, i);
}
else
{
cout << "Number is correct: " << i << endl;
}
}
else
{
cout << "Number is wrong" << endl;
func(j, i)
}
}
Explanation:
First recursion then test if result is a positive integer if it is then check if the other multiple is less or greater if greater recursive function tries the other multiple and if not then it is correct. Then if result is not positive integer then print Number is wrong and do another recursive function to test j.
If I got the problem correctly, I see that you have a rectangle of length=A, width=B, and area=K
And you want convert it to a square and lose the minimum possible area
If this is the case. So the problem with your algorithm is not the cost of iterating through mutliple iterations till get the output.
Rather the problem is that your algorithm depends heavily on the length A and width B of the input rectangle.
While it should depend only on the area K
For example:
Assume A =1, B=25
Then K=25 (the rect area)
Your algorithm will take the minimum value, which is A and accept it as answer with a single
iteration which is so fast but leads to wrong asnwer as it will result in a square of area 1 and waste the remaining 24 (whatever cm
or m)
While the correct answer here should be 5. which will never be reached by your algorithm
So, in my solution I assume a single input K
My ideas is as follows
x = sqrt(K)
if(x is int) .. x is the answer
else loop from x-1 till 1, x--
if K/x^2 is int, x is the answer
This might take extra iterations but will guarantee accurate answer
Also, there might be some concerns on the cost of sqrt(K)
but it will be called just once to avoid misleading length and width input

Maximum continuous achievable number

The problem
Definitions
Let's define a natural number N as a writable number (WN) for number set in M numeral system, if it can be written in this numeral system from members of U using each member no more than once. More strict definition of 'written': - here CONCAT means concatenation.
Let's define a natural number N as a continuous achievable number (CAN) for symbol set in M numeral system if it is a WN-number for U and M and also N-1 is a CAN-number for U and M (Another definition may be N is CAN for U and M if all 0 .. N numbers are WN for U and M). More strict:
Issue
Let we have a set of S natural numbers: (we are treating zero as a natural number) and natural number M, M>1. The problem is to find maximum CAN (MCAN) for given U and M. Given set U may contain duplicates - but each duplicate could not be used more than once, of cause (i.e. if U contains {x, y, y, z} - then each y could be used 0 or 1 time, so y could be used 0..2 times total). Also U expected to be valid in M-numeral system (i.e. can not contain symbols 8 or 9 in any member if M=8). And, of cause, members of U are numbers, not symbols for M (so 11 is valid for M=10) - otherwise the problem will be trivial.
My approach
I have in mind a simple algorithm now, which is simply checking if current number is CAN via:
Check if 0 is WN for given U and M? Go to 2: We're done, MCAN is null
Check if 1 is WN for given U and M? Go to 3: We're done, MCAN is 0
...
So, this algorithm is trying to build all this sequence. I doubt this part can be improved, but may be it can? Now, how to check if number is a WN. This is also some kind of 'substitution brute-force'. I have a realization of that for M=10 (in fact, since we're dealing with strings, any other M is not a problem) with PHP function:
//$mNumber is our N, $rgNumbers is our U
function isWriteable($mNumber, $rgNumbers)
{
if(in_array((string)$mNumber, $rgNumbers=array_map('strval', $rgNumbers), true))
{
return true;
}
for($i=1; $i<=strlen((string)$mNumber); $i++)
{
foreach($rgKeys = array_keys(array_filter($rgNumbers, function($sX) use ($mNumber, $i)
{
return $sX==substr((string)$mNumber, 0, $i);
})) as $iKey)
{
$rgTemp = $rgNumbers;
unset($rgTemp[$iKey]);
if(isWriteable(substr((string)$mNumber, $i), $rgTemp))
{
return true;
}
}
}
return false;
}
-so we're trying one piece and then check if the rest part could be written with recursion. If it can not be written, we're trying next member of U. I think this is a point which can be improved.
Specifics
As you see, an algorithm is trying to build all numbers before N and check if they are WN. But the only question is - to find MCAN, so, question is:
May be constructive algorithm is excessive here? And, if yes, what other options could be used?
Is there more quick way to determine if number is WN for given U and M? (this point may have no sense if previous point has positive answer and we'll not build and check all numbers before N).
Samples
U = {4, 1, 5, 2, 0}
M = 10
then MCAN = 2 (3 couldn't be reached)
U = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11}
M = 10
then MCAN = 21 (all before could be reached, for 22 there are no two 2 symbols total).
Hash the digit count for digits from 0 to m-1. Hash the numbers greater than m that are composed of one repeated digit.
MCAN is bound by the smallest digit for which all combinations of that digit for a given digit count cannot be constructed (e.g., X000,X00X,X0XX,XX0X,XXX0,XXXX), or (digit count - 1) in the case of zero (for example, for all combinations of four digits, combinations are needed for only three zeros; for a zero count of zero, MCAN is null). Digit counts are evaluated in ascending order.
Examples:
1. MCAN (10, {4, 1, 5, 2, 0})
3 is the smallest digit for which a digit-count of one cannot be constructed.
MCAN = 2
2. MCAN (10, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11})
2 is the smallest digit for which a digit-count of two cannot be constructed.
MCAN = 21
3. (from Alma Do Mundo's comment below) MCAN (2, {0,0,0,1,1,1})
1 is the smallest digit for which all combinations for a digit-count of four
cannot be constructed.
MCAN = 1110
4. (example from No One in Particular's answer)
MCAN (2, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1111,11111111})
1 is the smallest digit for which all combinations for a digit-count of five
cannot be constructed.
MCAN = 10101
The recursion steps I've made are:
If the digit string is available in your alphabet, mark it used and return immediately
If the digit string is of length 1, return failure
Split the string in two and try each part
This is my code:
$u = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11];
echo ncan($u), "\n"; // 21
// the functions
function satisfy($n, array $u)
{
if (!empty($u[$n])) { // step 1
--$u[$n];
return $u;
} elseif (strlen($n) == 1) { // step 2
return false;
}
// step 3
for ($i = 1; $i < strlen($n); ++$i) {
$u2 = satisfy(substr($n, 0, $i), $u);
if ($u2 && satisfy(substr($n, $i), $u2)) {
return true;
}
}
return false;
}
function is_can($n, $u)
{
return satisfy($n, $u) !== false;
}
function ncan($u)
{
$umap = array_reduce($u, function(&$result, $item) {
#$result[$item]++;
return $result;
}, []);
$i = -1;
while (is_can($i + 1, $umap)) {
++$i;
}
return $i;
}
Here is another approach:
1) Order the set U with regards to the usual numerical ordering for base M.
2) If there is a symbol between 0 and (M-1) which is missing, then that is the first number which is NOT MCAN.
3) Find the fist symbol which has the least number of entries in the set U. From this we have an upper bound on the first number which is NOT MCAN. That number would be {xxxx} N times. For example, if M = 4 and U = { 0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3}, then the number 333 is not MCAN. This gives us our upper bound.
4) So, if the first element of the set U which has the small number of occurences is x and it has C occurences, then we can clearly represent any number with C digits. (Since every element has at least C entries).
5) Now we ask if there is any number less than (C+1)x which can't be MCAN? Well, any (C+1) digit number can have either (C+1) of the same symbol or only at most (C) of the same symbol. Since x is minimal from step 3, (C+1)y for y < x can be done and (C)a + b can be done for any distinct a, b since they have (C) copies at least.
The above method works for set elements of only 1 symbol. However, we now see that it becomes more complex if multi-symbol elements are allowed. Consider the following case:
U = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1111,11111111}
Define c(A,B) = the number of 'A' symbols of 'B' length.
So for our example, c(0,1) = 15, c(0,2) = 0, c(0,3) = 0, c(0,4) = 0, ...
c(1,1) = 3, c(1,2) = 0, c(1,3) = 0, c(1,4) = 1, c(0,5) = 0, ..., c(1,8) = 1
The maximal 0 string we can't do is 16. The maximal 1 string we can't do is also 16.
1 = 1
11 = 1+1
111 = 1+1+1
1111 = 1111
11111 = 1+1111
111111 = 1+1+1111
1111111 = 1+1+1+1111
11111111 = 11111111
111111111 = 1+11111111
1111111111 = 1+1+11111111
11111111111 = 1+1+1+11111111
111111111111 = 1111+11111111
1111111111111 = 1+1111+11111111
11111111111111 = 1+1+1111+11111111
111111111111111 = 1+1+1+1111+11111111
But can we make the string 11111101111? We can't because the last 1 string (1111) needs the only set of 1's with the 4 in a row. Once we take that, we can't make the first 1 string (111111) because we only have an 8 (which is too big) or 3 1-lengths which are too small.
So for multi-symbols, we need another approach.
We know from sorting and ordering our strings what is the minimum length we can't do for a given symbol. (In the example above, it would be 16 zeros or 16 ones.) So this is our upper bound for an answer.
What we have to do now is start a 1 and count up in base M. For each number we write it in base M and then determine if we can make it from our set U. We do this by using the same approach used in the coin change problem: dynamic programming. (See for example http://www.geeksforgeeks.org/dynamic-programming-set-7-coin-change/ for the algorithm.) The only difference is that in our case we only have finite number of each elements, not an infinite supply.
Instead of subtracting the amount we are using like in the coin change problem, we strip the matching symbol off of the front of the string we are trying to match. (This is the opposite of our addition - concatenation.)

How can you compare to what extent two lists are in the same order?

I have two arrays containing the same elements, but in different orders, and I want to know the extent to which their orders differ.
The method I tried, didn't work. it was as follows:
For each list I built a matrix which recorded for each pair of elements whether they were above or below each other in the list. I then calculated a pearson correlation coefficient of these two matrices. This worked extremely badly. Here's a trivial example:
list 1:
1
2
3
4
list 2:
1
3
2
4
The method I described above produced matrices like this (where 1 means the row number is higher than the column, and 0 vice-versa):
list 1:
1 2 3 4
1 1 1 1
2 1 1
3 1
4
list 2:
1 2 3 4
1 1 1 1
2 0 1
3 1
4
Since the only difference is the order of elements 2 and 3, these should be deemed to be very similar. The Pearson Correlation Coefficient for those two matrices is 0, suggesting they are not correlated at all. I guess the problem is that what I'm looking for is not really a correlation coefficient, but some other kind of similarity measure. Edit distance, perhaps?
Can anyone suggest anything better?
Mean square of differences of indices of each element.
List 1: A B C D E
List 2: A D C B E
Indices of each element of List 1 in List 2 (zero based)
A B C D E
0 3 2 1 4
Indices of each element of List 1 in List 1 (zero based)
A B C D E
0 1 2 3 4
Differences:
A B C D E
0 -2 0 2 0
Square of differences:
A B C D E
4 4
Average differentness = 8 / 5.
Just an idea, but is there any mileage in adapting a standard sort algorithm to count the number of swap operations needed to transform list1 into list2?
I think that defining the compare function may be difficult though (perhaps even just as difficult as the original problem!), and this may be inefficient.
edit: thinking about this a bit more, the compare function would essentially be defined by the target list itself. So for example if list 2 is:
1 4 6 5 3
...then the compare function should result in 1 < 4 < 6 < 5 < 3 (and return equality where entries are equal).
Then the swap function just needs to be extended to count the swap operations.
A bit late for the party here, but just for the record, I think Ben almost had it... if you'd looked further into correlation coefficients, I think you'd have found that Spearman's rank correlation coefficient might have been the way to go.
Interestingly, jamesh seems to have derived a similar measure, but not normalized.
See this recent SO answer.
You might consider how many changes it takes to transform one string into another (which I guess it was you were getting at when you mentioned edit distance).
See: http://en.wikipedia.org/wiki/Levenshtein_distance
Although I don't think l-distance takes into account rotation. If you allow rotation as an operation then:
1, 2, 3, 4
and
2, 3, 4, 1
Are pretty similar.
There is a branch-and-bound algorithm that should work for any set of operators you like. It may not be real fast. The pseudocode goes something like this:
bool bounded_recursive_compare_routine(int* a, int* b, int level, int bound){
if (level > bound) return false;
// if at end of a and b, return true
// apply rule 0, like no-change
if (*a == *b){
bounded_recursive_compare_routine(a+1, b+1, level+0, bound);
// if it returns true, return true;
}
// if can apply rule 1, like rotation, to b, try that and recur
bounded_recursive_compare_routine(a+1, b+1, level+cost_of_rotation, bound);
// if it returns true, return true;
...
return false;
}
int get_minimum_cost(int* a, int* b){
int bound;
for (bound=0; ; bound++){
if (bounded_recursive_compare_routine(a, b, 0, bound)) break;
}
return bound;
}
The time it takes is roughly exponential in the answer, because it is dominated by the last bound that works.
Added: This can be extended to find the nearest-matching string stored in a trie. I did that years ago in a spelling-correction algorithm.
I'm not sure exactly what formula it uses under the hood, but difflib.SequenceMatcher.ratio() does exactly this:
ratio(self) method of difflib.SequenceMatcher instance:
Return a measure of the sequences' similarity (float in [0,1]).
Code example:
from difflib import SequenceMatcher
sm = SequenceMatcher(None, '1234', '1324')
print sm.ratio()
>>> 0.75
Another approach that is based on a little bit of mathematics is to count the number of inversions to convert one of the arrays into the other one. An inversion is the exchange of two neighboring array elements. In ruby it is done like this:
# extend class array by new method
class Array
def dist(other)
raise 'can calculate distance only to array with same length' if length != other.length
# initialize count of inversions to 0
count = 0
# loop over all pairs of indices i, j with i<j
length.times do |i|
(i+1).upto(length) do |j|
# increase count if i-th and j-th element have different order
count += 1 if (self[i] <=> self[j]) != (other[i] <=> other[j])
end
end
return count
end
end
l1 = [1, 2, 3, 4]
l2 = [1, 3, 2, 4]
# try an example (prints 1)
puts l1.dist(l2)
The distance between two arrays of length n can be between 0 (they are the same) and n*(n+1)/2 (reversing the first array one gets the second). If you prefer to have distances always between 0 and 1 to be able to compare distances of pairs of arrays of different length, just divide by n*(n+1)/2.
A disadvantage of this algorithms is it running time of n^2. It also assumes that the arrays don't have double entries, but it could be adapted.
A remark about the code line "count += 1 if ...": the count is increased only if either the i-th element of the first list is smaller than its j-th element and the i-th element of the second list is bigger than its j-th element or vice versa (meaning that the i-th element of the first list is bigger than its j-th element and the i-th element of the second list is smaller than its j-th element). In short: (l1[i] < l1[j] and l2[i] > l2[j]) or (l1[i] > l1[j] and l2[i] < l2[j])
If one has two orders one should look at two important ranking correlation coefficients:
Spearman's rank correlation coefficient: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
This is almost the same as Jamesh answer but scaled in the range -1 to 1.
It is defined as:
1 - ( 6 * sum_of_squared_distances ) / ( n_samples * (n_samples**2 - 1 )
Kendalls tau: https://nl.wikipedia.org/wiki/Kendalls_tau
When using python one could use:
from scipy import stats
order1 = [ 1, 2, 3, 4]
order2 = [ 1, 3, 2, 4]
print stats.spearmanr(order1, order2)[0]
>> 0.8000
print stats.kendalltau(order1, order2)[0]
>> 0.6667
if anyone is using R language, I've implemented a function that computes the "spearman rank correlation coefficient" using the method described above by #bubake here:
get_spearman_coef <- function(objectA, objectB) {
#getting the spearman rho rank test
spearman_data <- data.frame(listA = objectA, listB = objectB)
spearman_data$rankA <- 1:nrow(spearman_data)
rankB <- c()
for (index_valueA in 1:nrow(spearman_data)) {
for (index_valueB in 1:nrow(spearman_data)) {
if (spearman_data$listA[index_valueA] == spearman_data$listB[index_valueB]) {
rankB <- append(rankB, index_valueB)
}
}
}
spearman_data$rankB <- rankB
spearman_data$distance <-(spearman_data$rankA - spearman_data$rankB)**2
spearman <- 1 - ( (6 * sum(spearman_data$distance)) / (nrow(spearman_data) * ( nrow(spearman_data)**2 -1) ) )
print(paste("spearman's rank correlation coefficient"))
return( spearman)
}
results :
get_spearman_coef(c("a","b","c","d","e"), c("a","b","c","d","e"))
spearman's rank correlation coefficient: 1
get_spearman_coef(c("a","b","c","d","e"), c("b","a","d","c","e"))
spearman's rank correlation coefficient: 0.9

Resources