How does this algorithm corresponds to the roulette wheel selection? - genetic-algorithm

I am trying to implement a roulette wheel selection. I have understood this algorithm:
Calculate the sum S of all chromosome fitnesses in population
Generate a random number, r, from interval (0,S)
Loop through the population and sum fitnesses from 0 till S, this
is the partial sum, call it P.
When the P > S: stop and return the corresponding chromosome.
What I don't understand is how this corresponds to doing this instead: Roulette wheel selection algorithm
(the answer with 44 votes). This makes sense to me, but not the one above.

The following is done using the sum
def choose_parent_using_RWS(genes, S, points):
P = randint(0, int(S))
for x in genes:
P += evaluate(x, points)
if P > S:
return x
return genes[-1]
the following is done by normalizing between 0 and 1
def choose_parent_using_RWS(genes, S, points):
P = randint(0, int(S))/S
for x in genes:
P += evaluate(x, points)/S
if P > S/S:
return x
return genes[-1]

In the answer with 44 votes, the range has been normalised between 0 to 1 which is easier to understand but requires extra steps for calculations.
You can implement the approach you mentioned. In that while calculating the sum, each individual chromosome adds its own value, so when a random number is generated between 0 and S we assume that if r is between 2 numbers whose range is equal to the above mentioned value, it is chosen with the probability proportional to its fitness value. The bigger the value the more is the probability of r to come in its range.
For example, lets say that a chromosome having a fitness of 23 (assumption) is the 5th chromosome when you iterate and the total sum S is 130. The sum of the first 4 chromosomes is, lets say, 54. So if random r is between 55 and 77 (both inclusive), this chromosome is chosen.
After normalisation, 55/130 ~= 0.423 and 77/130 ~= 0.5923 is the range a random number r2 (between 0 and 1) should fall for this chromosome to be selected.

Related

Conditional sampling of binary vectors (?)

I'm trying to find a name for my problem, so I don't have to re-invent wheel when coding an algorithm which solves it...
I have say 2,000 binary (row) vectors and I need to pick 500 from them. In the picked sample I do column sums and I want my sample to be as close as possible to a pre-defined distribution of the column sums. I'll be working with 20 to 60 columns.
A tiny example:
Out of the vectors:
110
010
011
110
100
I need to pick 2 to get column sums 2, 1, 0. The solution (exact in this case) would be
110
100
My ideas so far
one could maybe call this a binary multidimensional knapsack, but I did not find any algos for that
Linear Programming could help, but I'd need some step by step explanation as I got no experience with it
as exact solution is not always feasible, something like simulated annealing brute force could work well
a hacky way using constraint solvers comes to mind - first set the constraints tight and gradually loosen them until some solution is found - given that CSP should be much faster than ILP...?
My concrete, practical (if the approximation guarantee works out for you) suggestion would be to apply the maximum entropy method (in Chapter 7 of Boyd and Vandenberghe's book Convex Optimization; you can probably find several implementations with your favorite search engine) to find the maximum entropy probability distribution on row indexes such that (1) no row index is more likely than 1/500 (2) the expected value of the row vector chosen is 1/500th of the predefined distribution. Given this distribution, choose each row independently with probability 500 times its distribution likelihood, which will give you 500 rows on average. If you need exactly 500, repeat until you get exactly 500 (shouldn't take too many tries due to concentration bounds).
Firstly I will make some assumptions regarding this problem:
Regardless whether the column sum of the selected solution is over or under the target, it weighs the same.
The sum of the first, second, and third column are equally weighted in the solution (i.e. If there's a solution whereas the first column sum is off by 1, and another where the third column sum is off by 1, the solution are equally good).
The closest problem I can think of this problem is the Subset sum problem, which itself can be thought of a special case of Knapsack problem.
However both of these problem are NP-Complete. This means there are no polynomial time algorithm that can solve them, even though it is easy to verify the solution.
If I were you the two most arguably efficient solution of this problem are linear programming and machine learning.
Depending on how many columns you are optimising in this problem, with linear programming you can control how much finely tuned you want the solution, in exchange of time. You should read up on this, because this is fairly simple and efficient.
With Machine learning, you need a lot of data sets (the set of vectors and the set of solutions). You don't even need to specify what you want, a lot of machine learning algorithms can generally deduce what you want them to optimise based on your data set.
Both solution has pros and cons, you should decide which one to use yourself based on the circumstances and problem set.
This definitely can be modeled as (integer!) linear program (many problems can). Once you have it, you can use a program such as lpsolve to solve it.
We model vector i is selected as x_i which can be 0 or 1.
Then for each column c, we have a constraint:
sum of all (x_i * value of i in column c) = target for column c
Taking your example, in lp_solve this could look like:
min: ;
+x1 +x4 +x5 >= 2;
+x1 +x4 +x5 <= 2;
+x1 +x2 +x3 +x4 <= 1;
+x1 +x2 +x3 +x4 >= 1;
+x3 <= 0;
+x3 >= 0;
bin x1, x2, x3, x4, x5;
If you are fine with a heuristic based search approach, here is one.
Go over the list and find the minimum squared sum of the digit wise difference between each bit string and the goal. For example, if we are looking for 2, 1, 0, and we are scoring 0, 1, 0, we would do it in the following way:
Take the digit wise difference:
2, 0, 1
Square the digit wise difference:
4, 0, 1
Sum:
5
As a side note, squaring the difference when scoring is a common method when doing heuristic search. In your case, it makes sense because bit strings that have a 1 in as the first digit are a lot more interesting to us. In your case this simple algorithm would pick first 110, then 100, which would is the best solution.
In any case, there are some optimizations that could be made to this, I will post them here if this kind of approach is what you are looking for, but this is the core of the algorithm.
You have a given target binary vector. You want to select M vectors out of N that have the closest sum to the target. Let's say you use the eucilidean distance to measure if a selection is better than another.
If you want an exact sum, have a look at the k-sum problem which is a generalization of the 3SUM problem. The problem is harder than the subset sum problem, because you want an exact number of elements to add to a target value. There is a solution in O(N^(M/2)). lg N), but that means more than 2000^250 * 7.6 > 10^826 operations in your case (in the favorable case where vectors operations have a cost of 1).
First conclusion: do not try to get an exact result unless your vectors have some characteristics that may reduce the complexity.
Here's a hill climbing approach:
sort the vectors by number of 1's: 111... first, 000... last;
use the polynomial time approximate algorithm for the subset sum;
you have an approximate solution with K elements. Because of the order of elements (the big ones come first), K should be a little as possible:
if K >= M, you take the M first vectors of the solution and that's probably near the best you can do.
if K < M, you can remove the first vector and try to replace it with 2 or more vectors from the rest of the N vectors, using the same technique, until you have M vectors. To sumarize: split the big vectors into smaller ones until you reach the correct number of vectors.
Here's a proof of concept with numbers, in Python:
import random
def distance(x, y):
return abs(x-y)
def show(ls):
if len(ls) < 10:
return str(ls)
else:
return ", ".join(map(str, ls[:5]+("...",)+ls[-5:]))
def find(is_xs, target):
# see https://en.wikipedia.org/wiki/Subset_sum_problem#Pseudo-polynomial_time_dynamic_programming_solution
S = [(0, ())] # we store indices along with values to get the path
for i, x in is_xs:
T = [(x + t, js + (i,)) for t, js in S]
U = sorted(S + T)
y, ks = U[0]
S = [(y, ks)]
for z, ls in U:
if z == target: # use the euclidean distance here if you want an approximation
return ls
if z != y and z < target:
y, ks = z, ls
S.append((z, ls))
ls = S[-1][1] # take the closest element to target
return ls
N = 2000
M = 500
target = 1000
xs = [random.randint(0, 10) for _ in range(N)]
print ("Take {} numbers out of {} to make a sum of {}", M, xs, target)
xs = sorted(xs, reverse = True)
is_xs = list(enumerate(xs))
print ("Sorted numbers: {}".format(show(tuple(is_xs))))
ls = find(is_xs, target)
print("FIRST TRY: {} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
splits = 0
while len(ls) < M:
first_x = xs[ls[0]]
js_ys = [(i, x) for i, x in is_xs if i not in ls and x != first_x]
replace = find(js_ys, first_x)
splits += 1
if len(replace) < 2 or len(replace) + len(ls) - 1 > M or sum(xs[i] for i in replace) != first_x:
print("Give up: can't replace {}.\nAdd the lowest elements.")
ls += tuple([i for i, x in is_xs if i not in ls][len(ls)-M:])
break
print ("Replace {} (={}) by {} (={})".format(ls[:1], first_x, replace, sum(xs[i] for i in replace)))
ls = tuple(sorted(ls[1:] + replace)) # use a heap?
print("{} elements ({}) -> {}".format(len(ls), show(ls), sum(x for i, x in is_xs if i in ls)))
print("AFTER {} splits, {} -> {}".format(splits, ls, sum(x for i, x in is_xs if i in ls)))
The result is obviously not guaranteed to be optimal.
Remarks:
Complexity: find has a polynomial time complexity (see the Wikipedia page) and is called at most M^2 times, hence the complexity remains polynomial. In practice, the process is reasonably fast (split calls have a small target).
Vectors: to ensure that you reach the target with the minimum of elements, you can improve the order of element. Your target is (t_1, ..., t_c): if you sort the t_js from max to min, you get the more importants columns first. You can sort the vectors: by number of 1s and then by the presence of a 1 in the most important columns. E.g. target = 4 8 6 => 1 1 1 > 0 1 1 > 1 1 0 > 1 0 1 > 0 1 0 > 0 0 1 > 1 0 0 > 0 0 0.
find (Vectors) if the current sum exceed the target in all the columns, then you're not connecting to the target (any vector you add to the current sum will bring you farther from the target): don't add the sum to S (z >= target case for numbers).
I propose a simple ad hoc algorithm, which, broadly speaking, is a kind of gradient descent algorithm. It seems to work relatively well for input vectors which have a distribution of 1s “similar” to the target sum vector, and probably also for all “nice” input vectors, as defined in a comment of yours. The solution is not exact, but the approximation seems good.
The distance between the sum vector of the output vectors and the target vector is taken to be Euclidean. To minimize it means minimizing the sum of the square differences off sum vector and target vector (the square root is not needed because it is monotonic). The algorithm does not guarantee to yield the sample that minimizes the distance from the target, but anyway makes a serious attempt at doing so, by always moving in some locally optimal direction.
The algorithm can be split into 3 parts.
First of all the first M candidate output vectors out of the N input vectors (e.g., N=2000, M=500) are put in a list, and the remaining vectors are put in another.
Then "approximately optimal" swaps between vectors in the two lists are done, until either the distance would not decrease any more, or a predefined maximum number of iterations is reached. An approximately optimal swap is one where removing the first vector from the list of output vectors causes a maximal decrease or minimal increase of the distance, and then, after the removal of the first vector, adding the second vector to the same list causes a maximal decrease of the distance. The whole swap is avoided if the net result is not a decrease of the distance.
Then, as a last phase, "optimal" swaps are done, again stopping on no decrease in distance or maximum number of iterations reached. Optimal swaps cause a maximal decrease of the distance, without requiring the removal of the first vector to be optimal in itself. To find an optimal swap all vector pairs have to be checked. This phase is much more expensive, being O(M(N-M)), while the previous "approximate" phase is O(M+(N-M))=O(N). Luckily, when entering this phase, most of the work has already been done by the previous phase.
from typing import List, Tuple
def get_sample(vects: List[Tuple[int]], target: Tuple[int], n_out: int,
max_approx_swaps: int = None, max_optimal_swaps: int = None,
verbose: bool = False) -> List[Tuple[int]]:
"""
Get a sample of the input vectors having a sum close to the target vector.
Closeness is measured in Euclidean metrics. The output is not guaranteed to be
optimal (minimum square distance from target), but a serious attempt is made.
The max_* parameters can be used to avoid too long execution times,
tune them to your needs by setting verbose to True, or leave them None (∞).
:param vects: the list of vectors (tuples) with the same number of "columns"
:param target: the target vector, with the same number of "columns"
:param n_out: the requested sample size
:param max_approx_swaps: the max number of approximately optimal vector swaps,
None means unlimited (default: None)
:param max_optimal_swaps: the max number of optimal vector swaps,
None means unlimited (default: None)
:param verbose: print some info if True (default: False)
:return: the sample of n_out vectors having a sum close to the target vector
"""
def square_distance(v1, v2):
return sum((e1 - e2) ** 2 for e1, e2 in zip(v1, v2))
n_vec = len(vects)
assert n_vec > 0
assert n_out > 0
n_rem = n_vec - n_out
assert n_rem > 0
output = vects[:n_out]
remain = vects[n_out:]
n_col = len(vects[0])
assert n_col == len(target) > 0
sumvect = (0,) * n_col
for outvect in output:
sumvect = tuple(map(int.__add__, sumvect, outvect))
sqdist = square_distance(sumvect, target)
if verbose:
print(f"sqdist = {sqdist:4} after"
f" picking the first {n_out} vectors out of {n_vec}")
if max_approx_swaps is None:
max_approx_swaps = sqdist
n_approx_swaps = 0
while sqdist and n_approx_swaps < max_approx_swaps:
# find the best vect to subtract (the square distance MAY increase)
sqdist_0 = None
index_0 = None
sumvect_0 = None
for index in range(n_out):
tmp_sumvect = tuple(map(int.__sub__, sumvect, output[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_0 is None or sqdist_0 > tmp_sqdist:
sqdist_0 = tmp_sqdist
index_0 = index
sumvect_0 = tmp_sumvect
# find the best vect to add,
# but only if there is a net decrease of the square distance
sqdist_1 = sqdist
index_1 = None
sumvect_1 = None
for index in range(n_rem):
tmp_sumvect = tuple(map(int.__add__, sumvect_0, remain[index]))
tmp_sqdist = square_distance(tmp_sumvect, target)
if sqdist_1 > tmp_sqdist:
sqdist_1 = tmp_sqdist
index_1 = index
sumvect_1 = tmp_sumvect
if sumvect_1:
tmp = output[index_0]
output[index_0] = remain[index_1]
remain[index_1] = tmp
sqdist = sqdist_1
sumvect = sumvect_1
n_approx_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_approx_swaps}"
f" approximately optimal swap{'s'[n_approx_swaps == 1:]}")
diffvect = tuple(map(int.__sub__, sumvect, target))
if max_optimal_swaps is None:
max_optimal_swaps = sqdist
n_optimal_swaps = 0
while sqdist and n_optimal_swaps < max_optimal_swaps:
# find the best pair to swap,
# but only if the square distance decreases
best_sqdist = sqdist
best_diffvect = diffvect
best_pair = None
for i0 in range(M):
tmp_diffvect = tuple(map(int.__sub__, diffvect, output[i0]))
for i1 in range(n_rem):
new_diffvect = tuple(map(int.__add__, tmp_diffvect, remain[i1]))
new_sqdist = sum(d * d for d in new_diffvect)
if best_sqdist > new_sqdist:
best_sqdist = new_sqdist
best_diffvect = new_diffvect
best_pair = (i0, i1)
if best_pair:
tmp = output[best_pair[0]]
output[best_pair[0]] = remain[best_pair[1]]
remain[best_pair[1]] = tmp
sqdist = best_sqdist
diffvect = best_diffvect
n_optimal_swaps += 1
else:
break
if verbose:
print(f"sqdist = {sqdist:4} after {n_optimal_swaps}"
f" optimal swap{'s'[n_optimal_swaps == 1:]}")
return output
from random import randrange
C = 30 # number of columns
N = 2000 # total number of vectors
M = 500 # number of output vectors
F = 0.9 # fill factor of the target sum vector
T = int(M * F) # maximum value + 1 that can be appear in the target sum vector
A = 10000 # maximum number of approximately optimal swaps, may be None (∞)
B = 10 # maximum number of optimal swaps, may be None (unlimited)
target = tuple(randrange(T) for _ in range(C))
vects = [tuple(int(randrange(M) < t) for t in target) for _ in range(N)]
sample = get_sample(vects, target, M, A, B, True)
Typical output:
sqdist = 2639 after picking the first 500 vectors out of 2000
sqdist = 9 after 27 approximately optimal swaps
sqdist = 1 after 4 optimal swaps
P.S.: As it stands, this algorithm is not limited to binary input vectors, integer vectors would work too. Intuitively I suspect that the quality of the optimization could suffer, though. I suspect that this algorithm is more appropriate for binary vectors.
P.P.S.: Execution times with your kind of data are probably acceptable with standard CPython, but get better (like a couple of seconds, almost a factor of 10) with PyPy. To handle bigger sets of data, the algorithm would have to be translated to C or some other language, which should not be difficult at all.

Sliding weighted randomization from number range?

When picking a random number from a range, I'm doing rand(0..100). That works all well and good, but I'd like it to favor the lower end of the range.
So, there's the highest probability of picking 0 and the lowest probability of picking 100 (and then everything in between), based on some weighted scale.
How would I implement that in Ruby?
You could try taking the lower of two random numbers. That would favour smaller numbers.
[rand(0..100), rand(0..100)].min
If your first number is 5, the chances of your second number being lower (and replacing) is only 4 in 100.
If your first number is 95, the chances of your second number being lower is 94 out of 100 so it's likely to be replaced with the lower number.
My answer concerns the generation of random variates from underlying probability distributions generally, not just those distributions that give greater weight to smaller random variates.
You need to identify a (probability) density function f that has has the desired shape. Then construct its (cumulative) distribution function F and the latter's inverse function G (the quantile), meaning that G(F(x)) = x for all x in the sample space. f can be continuous or discrete.
For example, f and F could be the (negative) exponential density and distribution functions, which give higher weight to smaller values, as shown below (source: Wiki for Exponential Distribution).
Exponential PDF Exponential CDF
These functions are given by f(x) = λe−λx and F(x) = 1 − e−λx, respectively, where e is the base of natural logarithms. λ is a shape parameter.
To generate random variates for this distribution we would draw a (pseudo-) random number between 0 and 1, mark that on the vertical axis of the CDF graph and draw a horizontal line from that point. The random variate is the point on the horizontal axis where the CDF intersects the horizontal line. If y is the random number between 0 and 1, we have
y = 1 − e**(−λx)
Solving for x,
x = -log(1 - y)/λ
so the inverse CDF is seen to be
g(y) = -log(1 - y)/λ
Here are some random variates for λ = 1.
def g(y)
-Math.log(1 - y)
end
5.times { y = rand; puts "y = #{y.round(2)}, x = #{g(y).round(2)}" }
y = 0.09, x = 0.10
y = 0.67, x = 1.09
y = 0.35, x = 0.43
y = 0.55, x = 0.79
y = 0.19, x = 0.21
Most CDFs do not have closed-form inverse functions, but if the CDF is continuous, a binary search can be performed to compute an arbitrarily-close approximation to the random variate (x on the graph) for a given y = rand.
The Weibull Distribution is one of the few other continuous distributions (besides uniform and triangular) that has a closed-form inverse function. Having two parameters, it offers greater scope than the single-parameter exponential distribution for modelling a desired shape.
For discrete CDFs, one can use if statements (or, better, a case statement) to compute the random variate for a given y = rand.
I'd do something like this:
low_end_of_range = 1
high_end_of_range = 100
weighted_range = []
(low_end_of_range..high_end_of_range).each do |num|
weight = (high_end_of_range - num) + 1
weight.times do
weighted_range << num
end
end
weighted_range.sample
This will give:
1 the highest probability of being picked, as it would appear 100 times in the weighted_range array,
2 the second highest probability of being picked, as it would appear 99 times in the weighted_range array,
100 the lowest probability of being picked, as it would appear only once in the weighted_range array, and
99 the second lowest probability of being picked, as it would appear twice in the weighted_range array,
etc.
And if you don't need any flexibility in the size of your sampling (i.e. low_end_of_range / high_end_of_range), you can do it in a nice one-liner:
(1..100).map { |i| (101 - i).times.map { i } }.flatten.sample

Coin change, dynamic programming, but coin value reduces after first use

There are a lot of coin change questions and answers around but I couldn't find this one, and am wondering if it's even a coin change problem.
Basically I have a bunch of different coins, and an infinite amount of each coin.
So say there are stacks of various denominations. Each stack is infinite. (So infinite number of 25c coins, infinite number of 2c coins etc).
However, on top of each stack of coins is a special coin which has a value greater than (or equal) to the coins below. I can't access the coins below without using this coin on top.
What I'm trying to work out is the minimum number of coins required to make a certain sum.
I think this is solvable dynamic programming but I'm unsure how to add this limitation to the traditional solutions.
I was wondering if I should just remove the special coin in my list once I use it and replace it with the normal coins, but I can't seem to reason if that would break the algorithm or not.
Looks like a classic dynamic programming problem, where the challenge is to choose state correctly.
Usually, we choose sum of selected coins as problem state, and number of selected coins as state value. Transitions are every possible coin we can take. If we have 25c and 5c coins, we can move from state Sum with value Count to states Sum+25,count+1 and Sum+5,count+1.
For your limitation, state should be augmented with information about which special(top) coins were taken. So you add a bit for each stack of coins. Then you just need to define possible transitions from every state. It is simple: if a bit for stack is set, it means that top coin was already taken and you can add a non-top coin from that stack to state sum, keeping all bits the same. Otherwise, you can take the top coin from that stack, ad its value to state sum, and set related bit.
You start from state with sum 0, all bits clear and value 0, then build states from the lowest sum up to target.
At the end, you should iterate over all possible combinations of bits and. compare values for state with the target sum and that bits combination. Choose the minimum - that would be the answer.
Example solution code:
#Available coins: (top coin value, other coins value)
stacks = [(17,8),(5,3),(11,1),(6,4)]
#Target sum
target_value = 70
states = dict()
states[(0,0)] = (0,None,None)
#DP going from sum 0 to target sum, bottom up:
for value in xrange(0, target_value):
#For each sum, consider every possible combination of bits
for bits in xrange(0, 2 ** len(stacks)):
#Can we get to this sum with this bits?
if (value,bits) in states:
count = states[(value,bits)][0]
#Let's take coin from each stack
for stack in xrange(0, len(stacks)):
stack_bit = (1 << stack)
if bits & stack_bit:
#Top coin already used, take second
cost = stacks[stack][1]
transition = (value + cost, bits)
else:
#Top coin not yet used
cost = stacks[stack][0]
transition = (value + cost, bits | stack_bit)
#If we can get a better solution for state with sum and bits
#update it
if (not (transition in states)) or states[transition][0] > count + 1:
#First element is coins number
#Second is 'backtrack reference'
#Third is coin value for this step
states[transition] = (count+1, (value,bits), cost)
min_count = target_value + 1
min_state = None
#Find the best solution over all states with sum=target_value
for bits in xrange(0, 2 ** len(stacks)):
if (target_value,bits) in states:
count = states[(target_value,bits)][0]
if count < min_count:
min_count = count
min_state = (target_value, bits)
collected_coins = []
state = min_state
if state == None:
print "No solution"
else:
#Follow backtrack references to get individual coins
while state <> (0,0):
collected_coins.append(states[state][2])
state = states[state][1]
print "Solution: %s" % list(reversed(collected_coins))
The below solution is based on two facts
1) Infinite number of coins in all available denominations
Algorithm:-
Let Given number be x
call the below method repeatedly until you reach your coin value,
Each time of iteration pass the remaining value to be assigned with coins and also the coin denomination
Method which will receive the number and the coin value
Parameters: x - coin to be allocated
z - value of coin
if(x > z) {
integer y = x/z * z
if(y == x) {
x is divisible by z and hence allocate with x/z number of coins of z value.
}
if(Y != x) {
x is not a multiple of z and hence allocate with x/z number of coins from z cents value and then for remaining amount repeat the same logic mentioned here by having the next greatest denomination of
coin.
}
}
else if(x < z) {
return from this method;
}
else if(x == z) {
x is divisible by z and hence allocate with x/z number of coins of z value
}
Example iteration:
Given number = 48c
Coins denomination: 25c, 10c, 5c, 2c, 1c
First Iteration:
Parameter x = 48c and z = 25c
return value would be a map of 1 coin of 25c, [25c , 1]
calculate the remaining amount 48c - 25c = 23c
23c is not equal to zero and not equal to 1 continue the loop.
Second Iteration:
Parameter x = 23c and z = 10c
return value would be a map of 2 coins of 10c, [10c, 2]
calculate the remaining amount 23c - 20c = 3c
3c is not equal to zero and not equal to 1 continue the loop.
Third Iteration:
Parameter x = 3c and z = 5c
No coins allocated
3c is not equal to zero and not equal to 1 continue the loop.
Fourth Iteration:
Parameter x = 3c and z = 2c
return value would be a map of 1 coin of 2c, [2c, 1]
Remaining amount 3c - 2c = 1c
Remaining amount Equals to 1 add an entry in map [1c, 1]
Final Map Entries:
[25c, 1]
[10c, 2]
[2c, 1]
[1c, 1]
[1c, 1]

Identify this sampling algorithm? (R sample() function)

I'd like to read more about an algorithm that's used in R for unequal probability sampling, but after a few hours of searching I haven't been able to turn anything up on it. I thought it might have been an Art of Computer Programming algorithm, but I haven't been able to substantiate that either. The particular function in R's random.c is called ProbSampleNoReplace().
Given a vector of probabilities prob[] and a desired sample size n with a vector of selected items ans[]
For each element j in prob[] assign an index perm[j]
Sort the list in order of probability value, largest first
totalmass = 1
For (h=0, n1= n-1, h<nans, h++,n1-- )
rt = totalmass * rand(in 0:1)
mass = 0
**sum the probabilities, largest first, until the sum is bigger than rt**
for(j=0;j<n1;j++)
mass += prob[j]
if rt <= mass then break
ans[h] = perm[j]
**reduce size of totalmass to reflect removed item**
totalmass -= prob[j]
**reset the indices to be sequential**
for(k=j, k<n1, k++)
prob[k] = prob[k+1]
perm[k] = perm[k+1]
The sample function supports unequal probability arguments. Your code fragment is not clear as to its intent to those of us who do not read C.
> table( sample(1:4, 100, repl=TRUE, prob=4:1) )
1 2 3 4
46 23 24 7
There is another SO Q&A that may be useful (found by an SO search with arguments):
random.c ProbSampleNoReplace
Faster weighted sampling without replacement

Maximizing number of factors contributing in the sum of sorted array bounded by a value

I have a sorted array of integers of size n. These values are not unique. What I need to do is
: Given a B, I need to find an i<A[n] such that the sum of |A[j:1 to n]-i| is lesser than B and to that particular sum contribute the biggest number of A[j]s. I have some ideas but I can't seem to find anything better from the naive n*B and n*n algorithm. Any ideas about O(nlogn) or O(n) ?
For example: Imagine
A[n] = 1 2 10 10 12 14 and B<7 then the best i is 12 cause I achieve having 4 A[j]s contribute to my sum. 10 and 11 are also equally good i's cause if i=10 I got 10 - 10 + 10 - 10 +12-10 + 14-10 = 6<7
A solution in O(n) : start from the end and compute a[n]-a[n-1] :
let d=14-12 => d=2 and r=B-d => r=5,
then repeat the operation but multiplying d by 2:
d=12-10 => d=2 and r=r-2*d => r=1,
r=1 end of the algorithm because the sum must be less than B:
with a array indexed 0..n-1
i=1
r=B
while(r>0 && n-i>1) {
d=a[n-i]-a[n-i-1];
r-=i*d;
i++;
}
return a[n-i+1];
maybe a drawing explains better
14 x
13 x -> 2
12 xx
11 xx -> 2*2
10 xxxx -> 3*0
9 xxxx
8 xxxx
7 xxxx
6 xxxx
5 xxxx
4 xxxxx
3 xxxxx
2 xxxxxx
1 xxxxxxx
I think you can do it in O(n) using these three tricks:
CUMULATIVE SUM
Precompute an array C[k] that stores sum(A[0:k]).
This can be done recursively via C[k]=C[k-1]+A[k] in time O(n).
The benefit of this array is that you can then compute sum(A[a:b]) via C[b]-C[a-1].
BEST MIDPOINT
Because your elements are sorted, then it is easy to compute the best i to minimise the sum of absolute values. In fact, the best i will always be given by the middle entry.
If the length of the list is even, then all values of i between the two central elements will always give the minimum absolute value.
e.g. for your list 10,10,12,14 the central elements are 10 and 12, so any value for i between 10 and 12 will minimise the sum.
ITERATIVE SEARCH
You can now scan over the elements a single time to find the best value.
1. Init s=0,e=0
2. if the score for A[s:e] is less than B increase e by 1
3. else increase s by 1
4. if e<n return to step 2
Keep track of the largest value for e-s seen which has a score < B and this is your answer.
This loop can go around at most 2n times so it is O(n).
The score for A[s:e] is given by sum |A[s:e]-A[(s+e)/2]|.
Let m=(s+e)/2.
score = sum |A[s:e]-A[(s+e)/2]|
= sum |A[s:e]-A[m]|
= sum (A[m]-A[s:m]) + sum (A[m+1:e]-A[m])
= (m-s+1)*A[m]-sum(A[s:m]) + sum(A[m+1:e])-(e-m)*A[m]
and we can compute the sums in this expression using the precomputed array C[k].
EDIT
If the endpoint must always be n, then you can use this alternative algorithm:
1. Init s=0,e=n
2. while the score for A[s:e] is greater than B, increase s by 1
PYTHON CODE
Here is a python implementation of the algorithm:
def fast(A,B):
C=[]
t=0
for a in A:
t+=a
C.append(t)
def fastsum(s,e):
if s==0:
return C[e]
else:
return C[e]-C[s-1]
def fastscore(s,e):
m=(s+e)//2
return (m-s+1)*A[m]-fastsum(s,m)+fastsum(m+1,e)-(e-m)*A[m]
s=0
e=0
best=-1
while e<len(A):
if fastscore(s,e)<B:
best=max(best,e-s+1)
e+=1
elif s==e:
e+=1
else:
s+=1
return best
print fast([1,2,10,10,12,14],7)
# this returns 4, as the 4 elements 10,10,12,14 can be chosen
Try it this way for an O(N) with N size of array approach:
minpos = position of closest value to B in array (binary search, O(log(N))
min = array[minpos]
if (min >= B) EXIT, no solution
// now, we just add the smallest elements from the left or the right
// until we are greater than B
leftindex = minpos - 1
rightindex = minpos + 1
while we have a valid leftindex or valid rightindex:
add = min(abs(array[leftindex (if valid)]-B), abs(array[rightindex (if valid)]-B))
if (min + add >= B)
break
min += add
decrease leftindex or increase rightindex according to the usage
min is now our sum, rightindex the requested i (leftindex the start)
(It could happen that some indices are not correct, this is just the idea, not the implementation)
I would guess, the average case for small b is O(log(N)). The linear case only happens if we can use the whole array.
Im not sure, but perhaps this can be done in O(log(N)*k) with N size of array and k < N, too. We have to use the bin search in a clever way to find leftindex and rightindex in every iteration, such that the possible result range gets smaller in every iteration. This could be easily done, but we have to take care of duplicates, because they could destroy our bin search reductions.

Resources