What would be the time complexity of the below-given code? - data-structures

What would be the time complexity of findMaxArea function in the below-given code?
Question source:
https://practice.geeksforgeeks.org/problems/length-of-largest-region-of-1s-1587115620/1#
class Solution:
def dfs(self,i,j,grid,row,colm):
if i<0 or j<0 or i>=row or j>=colm or grid[i][j]!=1:
return 0
res=0
grid[i][j]=2
res+=self.dfs(i+1,j,grid,row,colm)
res+=self.dfs(i,j+1,grid,row,colm)
res+=self.dfs(i+1,j+1,grid,row,colm)
res+=self.dfs(i-1,j-1,grid,row,colm)
res+=self.dfs(i-1,j,grid,row,colm)
res+=self.dfs(i,j-1,grid,row,colm)
res+=self.dfs(i+1,j-1,grid,row,colm)
res+=self.dfs(i-1,j+1,grid,row,colm)
return res+1
#Function to find unit area of the largest region of 1s.
def findMaxArea(self, grid):
#Code here
row=len(grid)
colm=len(grid[0])
maximum=-1
for i in range(row):
for j in range(colm):
if grid[i][j]==1:
maximum=max(maximum,self.dfs(i,j,grid,row,colm))
return maximum
#{
# Driver Code Starts
if __name__ == '__main__':
#T is number of test cases
T=int(input())
for i in range(T):
#size of matrix n x m
n, m = map(int, input().split())
grid = []
for _ in range(n):
a = list(map(int, input().split()))
grid.append(a)
obj = Solution()
ans = obj.findMaxArea(grid)
print(ans)
# } Driver Code Ends

You are doing a dfs on a matrix to find the largest area of 1s.
Though you are making recursive calls, you visit each element only once in the worst-case scenario e.g. when the entire matrix is filled with 1's or the entire matrix is filled with 0's.
So the time complexity would be O(m*n) where m is the number of rows and n is the number of columns in the matrix.

Related

Boolean expression for modified Queens problem

I saw the boolean expressions for the N Queens problem from here.
My modified N queens rules are simpler:
For a p*p chessboard I want to place N queens in such a way so that
Queens will be placed adjacently, rows will be filled first.
p*p chessboard size will be adjusted until it can hold N queens
For example, say N = 17, then we need a 5*5 chessboard and the placement will be:
Q_Q_Q_Q_Q
Q_Q_Q_Q_Q
Q_Q_Q_Q_Q
Q_Q_*_*_*
*_*_*_*_*
The question is I am trying to come up with a boolean expression for this problem.
This problem can be solved using the Python packages humanize and omega.
"""Solve variable size square fitting."""
import humanize
from omega.symbolic.fol import Context
def pick_chessboard(q):
ctx = Context()
# compute size of chessboard
#
# picking a domain for `p`
# requires partially solving the
# problem of computing `p`
ctx.declare(p=(0, q))
s = f'''
(p * p >= {q}) # chessboard fits the queens, and
/\ ((p - 1) * (p - 1) < {q}) # is the smallest such board
'''
u = ctx.add_expr(s)
d, = list(ctx.pick_iter(u)) # assert unique solution
p = d['p']
print(f'chessboard size: {p}')
# compute number of full rows
ctx.declare(x=(0, p))
s = f'x = {q} / {p}' # integer division
u = ctx.add_expr(s)
d, = list(ctx.pick_iter(u))
r = d['x']
print(f'{r} rows are full')
# compute number of queens on the last row
s = f'x = {q} % {p}' # modulo
u = ctx.add_expr(s)
d, = list(ctx.pick_iter(u))
n = d['x']
k = r + 1
kword = humanize.ordinal(k)
print(f'{n} queens on the {kword} row')
if __name__ == '__main__':
q = 10 # number of queens
pick_chessboard(q)
Representing multiplication (and integer division and modulo) with binary decision diagrams has complexity exponential in the number of variables, as proved in: https://doi.org/10.1109/12.73590

Need help in understanding Dynamic Programming approach for "balanced 0-1 matrix"?

Problem: I am struggling to understand/visualize the Dynamic Programming approach for "A type of balanced 0-1 matrix in "Dynamic Programming - Wikipedia Article."
Wikipedia Link: https://en.wikipedia.org/wiki/Dynamic_programming#A_type_of_balanced_0.E2.80.931_matrix
I couldn't understand how the memoization works when dealing with a multidimensional array. For example, when trying to solve the Fibonacci series with DP, using an array to store previous state results is easy, as the index value of the array store the solution for that state.
Can someone explain DP approach for the "0-1 balanced matrix" in simpler manner?
Wikipedia offered both a crappy explanation and a not ideal algorithm. But let's work with it as a starting place.
First let's take the backtracking algorithm. Rather than put the cells of the matrix "in some order", let's go everything in the first row, then everything in the second row, then everything in the third row, and so on. Clearly that will work.
Now let's modify the backtracking algorithm slightly. Instead of going cell by cell, we'll go row by row. So we make a list of the n choose n/2 possible rows which are half 0 and half 1. Then have a recursive function that looks something like this:
def count_0_1_matrices(n, filled_rows=None):
if filled_rows is None:
filled_rows = []
if some_column_exceeds_threshold(n, filled_rows):
# Cannot have more than n/2 0s or 1s in any column
return 0
else:
answer = 0
for row in possible_rows(n):
answer = answer + count_0_1_matrices(n, filled_rows + [row])
return answer
This is a backtracking algorithm like what we had before. We are just doing whole rows at a time, not cells.
But notice, we're passing around more information than we need. There is no need to pass in the exact arrangement of rows. All that we need to know is how many 1s are needed in each remaining column. So we can make the algorithm look more like this:
def count_0_1_matrices(n, still_needed=None):
if still_needed is None:
still_needed = [int(n/2) for _ in range(n)]
# Did we overrun any column?
for i in still_needed:
if i < 0:
return 0
# Did we reach the end of our matrix?
if 0 == sum(still_needed):
return 1
# Calculate the answer by recursion.
answer = 0
for row in possible_rows(n):
next_still_needed = [still_needed[i] - row[i] for i in range(n)]
answer = answer + count_0_1_matrices(n, next_still_needed)
return answer
This version is almost the recursive function in the Wikipedia version. The main difference is that our base case is that after every row is finished, we need nothing, while Wikipedia would have us code up the base case to check the last row after every other is done.
To get from this to a top-down DP, you only need to memoize the function. Which in Python you can do by defining then adding an #memoize decorator. Like this:
from functools import wraps
def memoize(func):
cache = {}
#wraps(func)
def wrap(*args):
if args not in cache:
cache[args] = func(*args)
return cache[args]
return wrap
But remember that I criticized the Wikipedia algorithm? Let's start improving it! The first big improvement is this. Do you notice that the order of the elements of still_needed can't matter, just their values? So just sorting the elements will stop you from doing the calculation separately for each permutation. (There can be a lot of permutations!)
#memoize
def count_0_1_matrices(n, still_needed=None):
if still_needed is None:
still_needed = [int(n/2) for _ in range(n)]
# Did we overrun any column?
for i in still_needed:
if i < 0:
return 0
# Did we reach the end of our matrix?
if 0 == sum(still_needed):
return 1
# Calculate the answer by recursion.
answer = 0
for row in possible_rows(n):
next_still_needed = [still_needed[i] - row[i] for i in range(n)]
answer = answer + count_0_1_matrices(n, sorted(next_still_needed))
return answer
That little innocuous sorted doesn't look important, but it saves a lot of work! And now that we know that still_needed is always sorted, we can simplify our checks for whether we are done, and whether anything went negative. Plus we can add an easy check to filter out the case where we have too many 0s in a column.
#memoize
def count_0_1_matrices(n, still_needed=None):
if still_needed is None:
still_needed = [int(n/2) for _ in range(n)]
# Did we overrun any column?
if still_needed[-1] < 0:
return 0
total = sum(still_needed)
if 0 == total:
# We reached the end of our matrix.
return 1
elif total*2/n < still_needed[0]:
# We have total*2/n rows left, but won't get enough 1s for a
# column.
return 0
# Calculate the answer by recursion.
answer = 0
for row in possible_rows(n):
next_still_needed = [still_needed[i] - row[i] for i in range(n)]
answer = answer + count_0_1_matrices(n, sorted(next_still_needed))
return answer
And, assuming you implement possible_rows, this should both work and be significantly more efficient than what Wikipedia offered.
=====
Here is a complete working implementation. On my machine it calculated the 6'th term in under 4 seconds.
#! /usr/bin/env python
from sys import argv
from functools import wraps
def memoize(func):
cache = {}
#wraps(func)
def wrap(*args):
if args not in cache:
cache[args] = func(*args)
return cache[args]
return wrap
#memoize
def count_0_1_matrices(n, still_needed=None):
if 0 == n:
return 1
if still_needed is None:
still_needed = [int(n/2) for _ in range(n)]
# Did we overrun any column?
if still_needed[0] < 0:
return 0
total = sum(still_needed)
if 0 == total:
# We reached the end of our matrix.
return 1
elif total*2/n < still_needed[-1]:
# We have total*2/n rows left, but won't get enough 1s for a
# column.
return 0
# Calculate the answer by recursion.
answer = 0
for row in possible_rows(n):
next_still_needed = [still_needed[i] - row[i] for i in range(n)]
answer = answer + count_0_1_matrices(n, tuple(sorted(next_still_needed)))
return answer
#memoize
def possible_rows(n):
return [row for row in _possible_rows(n, n/2)]
def _possible_rows(n, k):
if 0 == n:
yield tuple()
else:
if k < n:
for row in _possible_rows(n-1, k):
yield tuple(row + (0,))
if 0 < k:
for row in _possible_rows(n-1, k-1):
yield tuple(row + (1,))
n = 2
if 1 < len(argv):
n = int(argv[1])
print(count_0_1_matrices(2*n)))
You're memoizing states that are likely to be repeated. The state that needs to be remembered in this case is the vector (k is implicit). Let's look at one of the examples you linked to. Each pair in the vector argument (of length n) is representing "the number of zeros and ones that have yet to be placed in that column."
Take the example on the left, where the vector is ((1, 1) (1, 1) (1, 1) (1, 1)), when k = 2 and the assignments leading to it were 1 0 1 0, k = 3 and 0 1 0 1, k = 4. But we could get to the same state, ((1, 1) (1, 1) (1, 1) (1, 1)), k = 2 from a different set of assignments, for example: 0 1 0 1, k = 3 and 1 0 1 0, k = 4. If we would memoize the result for the state, ((1, 1) (1, 1) (1, 1) (1, 1)), we could avoid recalculating the recursion for that branch again.
Please let me know if there's anything I could better clarify.
Further elaboration in response to your comment:
The Wikipedia example seems to be pretty much a brute-force with memoization. The algorithm seems to attempt to enumerate all the matrixes but uses memoization to exit early from repeated states. How do we enumerate all possibilities? To take their example, n = 4, we start with the vector [(2,2),(2,2),(2,2),(2,2)] where zeros and ones are yet to be placed. (Since the sum of each tuple in the vector is k, we could have a simpler vector where k and the count of either ones or zeros is maintained.)
At every stage, k, in the recursion, we enumerate all possible configurations for the next vector. If the state exists in our hash, we simply return the value for that key. Otherwise, we assign the vector as a new key in the hash (in which case this recursion branch will continue).
For example:
Vector [(2,2),(2,2),(2,2),(2,2)]
Possible assignments of 1's: [1 1 0 0], [1 0 1 0], [1 0 0 1] ... etc.
First branch: [(2,1),(2,1),(1,2),(1,2)]
is this vector a key in the hash?
if yes, return value lookup
else, assign this vector as a key in the hash where the value is the sum
of the function calls with the next possible vectors as their arguments
Building on the excellent answer by https://stackoverflow.com/users/585411/btilly, I've updated their algorithm to exclude "0" cases in the still_needed tuple. The code is about 50% faster largely because of more cache hits using the collapsable tuple.
import time
from typing import Tuple
from sys import argv
from functools import cache
#cache
def possible_rows(n, k=None) -> Tuple[int]:
if k is None:
k = n / 2
return [row for row in _possible_rows(n, k)]
def _possible_rows(n, k) -> Tuple[int]:
if 0 == n:
yield tuple()
else:
if k < n:
for row in _possible_rows(n-1, k):
yield tuple(row + (0,))
if 0 < k:
for row in _possible_rows(n-1, k-1):
yield tuple(row + (1,))
def count(n: int, k: int) -> int:
if n == 0:
return 1
still_needed = tuple([k] * n)
return count_0_1_matrices(k, still_needed)
#cache
def count_0_1_matrices(k:int, still_needed: Tuple[int]):
"""
Assume still_needed contains only positive ints, and is sorted ascending
"""
# Calculate the answer by recursion.
answer = 0
for row in possible_rows(len(still_needed), k):
# Decrement the still_needed value tuple by the row tuple and only keep positive results. Sorting is important for cache hits.
next_still_needed = tuple(sorted([sn - r for sn, r in zip(still_needed, row) if sn > r]))
# Only continue if we still need values and there are enough rows left
if not next_still_needed:
answer += 1
elif len(next_still_needed) >= k and sum(next_still_needed) >= next_still_needed[-1] * k:
# sum / k -> how many rows left. We need enough rows left to continue down this path.
answer += count_0_1_matrices(k, next_still_needed)
return answer
if __name__ == "__main__":
n = 7
if 1 < len(argv):
n = int(argv[1])
start = time.time()
result = count(2*n, n)
print(f"{result} in {time.time() - start} seconds")

Combinatorial game

Here's the game:
There is a string of 0s and 1s and in each turn a player is allowed to
convert a set of contiguous 1s to 0s. A player can convert at most k
contiguous 1s to 0s and has to convert at least one 1 to 0 in his
move. The player who is unable to make a move loses.
Example:
10100111 (k=2)
Here the winning move would be: 10100101 (converted the 2nd last 1 to 0)
It's a 2 player impartial game and I tried to analyse it as a variant of nim game. There are n heaps each heap with ai marbles (n sets of contiguous 1s). A player can split a heap into 2 heaps by removing at most k marbles from anywhere in that heap. Supposing a heap has 5 marbles (*****) and you split the heap by removing k=2 marbles from position 2 (* **). Also, if you would remove the first or last k marbles, the heap wouldn't split, only its size would be reduced by k.
Can this model help find the strategy for the original game? If yes, what would be the optimal strategy?
Any help would be appreciated!
As ypercube mentioned, game can be solved and for each position it is possible to show is it winning (N-position) or losing (P) position.
It is enough to consider:
Initial losing (P) position is string with n zeros,
Winning (N) position is any position from where is a move to some P-position,
P-position is a position which every move lead to N-position.
With that it is easy to find value of each position by starting with initial position, find next N-positions, from these N-positions find (possible) P-positions, ...
Here is a python code that solves this game:
from itertools import product
from collections import defaultdict
class Game(object):
def __init__(self, n, k):
self.n, self.k = n, k
def states(self): # All strings with 0|1 of length n
return (''.join(x) for x in product(('0', '1'), repeat=self.n))
def set_zeros(self, c, i, l): # Set zeros in c from position i with length l
return c[:i] + '0'*l + c[i+l:]
def next_positions(self, c): # All moves from given position
for i in xrange(self.n):
if c[i] == '1': # First '1'
yield self.set_zeros(c, i, 1)
for j in xrange(1, self.k):
if i+j < self.n and c[i+j] == '1':
yield self.set_zeros(c, i, j+1)
else:
break
def lost_positions(self): # Initial lost position(s)
return ['0'*self.n]
def solve(self):
next_pos = {} # Maps position to posible positions after a move
prev_pos = defaultdict(set) # Maps position to posible positions before that move
win_lose = {} # True - win/N-position, False - lose/P-position, None - not decided
for s in self.states():
win_lose[s] = None
next_pos[s] = set(self.next_positions(s))
for n in next_pos[s]:
prev_pos[n].add(s)
# Initial loses positions
loses_to_check = set(self.lost_positions())
for c in loses_to_check:
win_lose[c] = False
#
while loses_to_check:
lost_c = loses_to_check.pop()
for w_pos in prev_pos[lost_c]: # Winning moves
if win_lose[w_pos] is None:
win_lose[w_pos] = True
for x in prev_pos[w_pos]: # Check positions before w_pos for P-position
if all(win_lose[i] for i in next_pos[x]):
win_lose[x] = False
loses_to_check.add(x)
return win_lose
comb = '10100111'
g = Game(len(comb), 2)
win_lose = g.solve()
print comb, win_lose[comb]
Note: changing/overriding methods states(), next_positions(c), lost_positions() is enough to implement solver for similar games.

What is the name of the algorithm that returns order statistics quickly?

I have an idea of what I want the algorithm to do, but I'm looking to see if this has been implemented for me in python and I need to know the algorithm's name.
I want to insert a set of values in the range [0,65535] into the container. I want to then ask the container for an arbitrary order statistic. Both insert and query should be logarithmic.
The standard way to do this is by using an augmented binary search tree. Essentially, in addition to keeping the set of keys stored in the tree, you keep a count of the nodes stored in each subtree. This lets you computer order statistics efficiently.
Since you're dealing with bounded integers, you can just keep a binary search tree with the 65536 values stored in it, and keep a count of the number of elements stored in each subtree. This yields a running time of O(lg 65536) instead of O(lg n).
I think you're looking for the quickselect or median-of-medians algorithm.
Here's the algorithm. However, I still don't know what it's called.
#!/usr/bin/env python
"""
This module declares Histogram, a class that maintains a histogram.
It can insert and remove values from a predetermined range, and return order
statistics. Insert, remove, and query are logarithmic time in the size of the
range. The space requirement is linear in the size of the range.
"""
import numpy as np
class Histogram:
def __init__(self, size):
"""Create the data structure that holds elements in the range
[0, size)."""
self.__data = np.zeros(size, np.int32)
self.total = 0
def size(self):
return self.__data.shape[0]
def __find(self, o, a, b):
if b == a + 1:
return a
mid = (b - a) / 2 + a
if o > self.__data[mid]:
return self.__find(o - self.__data[mid], mid, b)
return self.__find(o, a, mid)
def find(self, o):
"""Return the o'th smallest element in the data structure. Takes
O(log(size)) time."""
return self.__find(o + 1, 0, self.size())
def __alter(self, x, a, b, delta):
if b == a + 1:
self.total += delta
return
mid = (b - a) / 2 + a
if x >= mid:
self.__alter(x, mid, b, delta)
else:
self.__data[mid] += delta
self.__alter(x, a, mid, delta)
def insert(self, x):
"""Inserts element x into the data structure in O(log(size)) time."""
assert(0 <= x < self.size())
self.__alter(x, 0, self.size(), +1)
def remove(self, x):
"""Removes element x from the data structure in O(log(size)) time."""
assert(0 <= x < self.size())
self.__alter(x, 0, self.size(), -1)
def display(self):
print self.__data
def histogram_test():
size = 100
total = 100
h = Histogram(size)
data = np.random.random_integers(0, size - 1, total)
for x in data:
h.insert(x)
data.sort()
for i in range(total):
assert(h.find(i) == data[i])
assert(h.find(total + 1) == size - 1)
h.display()

Select k random elements from a list whose elements have weights

Selecting without any weights (equal probabilities) is beautifully described here.
I was wondering if there is a way to convert this approach to a weighted one.
I am also interested in other approaches as well.
Update: Sampling without replacement
If the sampling is with replacement, you can use this algorithm (implemented here in Python):
import random
items = [(10, "low"),
(100, "mid"),
(890, "large")]
def weighted_sample(items, n):
total = float(sum(w for w, v in items))
i = 0
w, v = items[0]
while n:
x = total * (1 - random.random() ** (1.0 / n))
total -= x
while x > w:
x -= w
i += 1
w, v = items[i]
w -= x
yield v
n -= 1
This is O(n + m) where m is the number of items.
Why does this work? It is based on the following algorithm:
def n_random_numbers_decreasing(v, n):
"""Like reversed(sorted(v * random() for i in range(n))),
but faster because we avoid sorting."""
while n:
v *= random.random() ** (1.0 / n)
yield v
n -= 1
The function weighted_sample is just this algorithm fused with a walk of the items list to pick out the items selected by those random numbers.
This in turn works because the probability that n random numbers 0..v will all happen to be less than z is P = (z/v)n. Solve for z, and you get z = vP1/n. Substituting a random number for P picks the largest number with the correct distribution; and we can just repeat the process to select all the other numbers.
If the sampling is without replacement, you can put all the items into a binary heap, where each node caches the total of the weights of all items in that subheap. Building the heap is O(m). Selecting a random item from the heap, respecting the weights, is O(log m). Removing that item and updating the cached totals is also O(log m). So you can pick n items in O(m + n log m) time.
(Note: "weight" here means that every time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. It does not mean that elements appear in the output with a likelihood proportional to their weights.)
Here's an implementation of that, plentifully commented:
import random
class Node:
# Each node in the heap has a weight, value, and total weight.
# The total weight, self.tw, is self.w plus the weight of any children.
__slots__ = ['w', 'v', 'tw']
def __init__(self, w, v, tw):
self.w, self.v, self.tw = w, v, tw
def rws_heap(items):
# h is the heap. It's like a binary tree that lives in an array.
# It has a Node for each pair in `items`. h[1] is the root. Each
# other Node h[i] has a parent at h[i>>1]. Each node has up to 2
# children, h[i<<1] and h[(i<<1)+1]. To get this nice simple
# arithmetic, we have to leave h[0] vacant.
h = [None] # leave h[0] vacant
for w, v in items:
h.append(Node(w, v, w))
for i in range(len(h) - 1, 1, -1): # total up the tws
h[i>>1].tw += h[i].tw # add h[i]'s total to its parent
return h
def rws_heap_pop(h):
gas = h[1].tw * random.random() # start with a random amount of gas
i = 1 # start driving at the root
while gas >= h[i].w: # while we have enough gas to get past node i:
gas -= h[i].w # drive past node i
i <<= 1 # move to first child
if gas >= h[i].tw: # if we have enough gas:
gas -= h[i].tw # drive past first child and descendants
i += 1 # move to second child
w = h[i].w # out of gas! h[i] is the selected node.
v = h[i].v
h[i].w = 0 # make sure this node isn't chosen again
while i: # fix up total weights
h[i].tw -= w
i >>= 1
return v
def random_weighted_sample_no_replacement(items, n):
heap = rws_heap(items) # just make a heap...
for i in range(n):
yield rws_heap_pop(heap) # and pop n items off it.
If the sampling is with replacement, use the roulette-wheel selection technique (often used in genetic algorithms):
sort the weights
compute the cumulative weights
pick a random number in [0,1]*totalWeight
find the interval in which this number falls into
select the elements with the corresponding interval
repeat k times
If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)
I know this is a very old question, but I think there's a neat trick to do this in O(n) time if you apply a little math!
The exponential distribution has two very useful properties.
Given n samples from different exponential distributions with different rate parameters, the probability that a given sample is the minimum is equal to its rate parameter divided by the sum of all rate parameters.
It is "memoryless". So if you already know the minimum, then the probability that any of the remaining elements is the 2nd-to-min is the same as the probability that if the true min were removed (and never generated), that element would have been the new min. This seems obvious, but I think because of some conditional probability issues, it might not be true of other distributions.
Using fact 1, we know that choosing a single element can be done by generating these exponential distribution samples with rate parameter equal to the weight, and then choosing the one with minimum value.
Using fact 2, we know that we don't have to re-generate the exponential samples. Instead, just generate one for each element, and take the k elements with lowest samples.
Finding the lowest k can be done in O(n). Use the Quickselect algorithm to find the k-th element, then simply take another pass through all elements and output all lower than the k-th.
A useful note: if you don't have immediate access to a library to generate exponential distribution samples, it can be easily done by: -ln(rand())/weight
I've done this in Ruby
https://github.com/fl00r/pickup
require 'pickup'
pond = {
"selmon" => 1,
"carp" => 4,
"crucian" => 3,
"herring" => 6,
"sturgeon" => 8,
"gudgeon" => 10,
"minnow" => 20
}
pickup = Pickup.new(pond, uniq: true)
pickup.pick(3)
#=> [ "gudgeon", "herring", "minnow" ]
pickup.pick
#=> "herring"
pickup.pick
#=> "gudgeon"
pickup.pick
#=> "sturgeon"
If you want to generate large arrays of random integers with replacement, you can use piecewise linear interpolation. For example, using NumPy/SciPy:
import numpy
import scipy.interpolate
def weighted_randint(weights, size=None):
"""Given an n-element vector of weights, randomly sample
integers up to n with probabilities proportional to weights"""
n = weights.size
# normalize so that the weights sum to unity
weights = weights / numpy.linalg.norm(weights, 1)
# cumulative sum of weights
cumulative_weights = weights.cumsum()
# piecewise-linear interpolating function whose domain is
# the unit interval and whose range is the integers up to n
f = scipy.interpolate.interp1d(
numpy.hstack((0.0, weights)),
numpy.arange(n + 1), kind='linear')
return f(numpy.random.random(size=size)).astype(int)
This is not effective if you want to sample without replacement.
Here's a Go implementation from geodns:
package foo
import (
"log"
"math/rand"
)
type server struct {
Weight int
data interface{}
}
func foo(servers []server) {
// servers list is already sorted by the Weight attribute
// number of items to pick
max := 4
result := make([]server, max)
sum := 0
for _, r := range servers {
sum += r.Weight
}
for si := 0; si < max; si++ {
n := rand.Intn(sum + 1)
s := 0
for i := range servers {
s += int(servers[i].Weight)
if s >= n {
log.Println("Picked record", i, servers[i])
sum -= servers[i].Weight
result[si] = servers[i]
// remove the server from the list
servers = append(servers[:i], servers[i+1:]...)
break
}
}
}
return result
}
If you want to pick x elements from a weighted set without replacement such that elements are chosen with a probability proportional to their weights:
import random
def weighted_choose_subset(weighted_set, count):
"""Return a random sample of count elements from a weighted set.
weighted_set should be a sequence of tuples of the form
(item, weight), for example: [('a', 1), ('b', 2), ('c', 3)]
Each element from weighted_set shows up at most once in the
result, and the relative likelihood of two particular elements
showing up is equal to the ratio of their weights.
This works as follows:
1.) Line up the items along the number line from [0, the sum
of all weights) such that each item occupies a segment of
length equal to its weight.
2.) Randomly pick a number "start" in the range [0, total
weight / count).
3.) Find all the points "start + n/count" (for all integers n
such that the point is within our segments) and yield the set
containing the items marked by those points.
Note that this implementation may not return each possible
subset. For example, with the input ([('a': 1), ('b': 1),
('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
'c'] and ['b', 'd'], but it will do so such that the weights
are respected.
This implementation only works for nonnegative integral
weights. The highest weight in the input set must be less
than the total weight divided by the count; otherwise it would
be impossible to respect the weights while never returning
that element more than once per invocation.
"""
if count == 0:
return []
total_weight = 0
max_weight = 0
borders = []
for item, weight in weighted_set:
if weight < 0:
raise RuntimeError("All weights must be positive integers")
# Scale up weights so dividing total_weight / count doesn't truncate:
weight *= count
total_weight += weight
borders.append(total_weight)
max_weight = max(max_weight, weight)
step = int(total_weight / count)
if max_weight > step:
raise RuntimeError(
"Each weight must be less than total weight / count")
next_stop = random.randint(0, step - 1)
results = []
current = 0
for i in range(count):
while borders[current] <= next_stop:
current += 1
results.append(weighted_set[current][0])
next_stop += step
return results
In the question you linked to, Kyle's solution would work with a trivial generalization.
Scan the list and sum the total weights. Then the probability to choose an element should be:
1 - (1 - (#needed/(weight left)))/(weight at n). After visiting a node, subtract it's weight from the total. Also, if you need n and have n left, you have to stop explicitly.
You can check that with everything having weight 1, this simplifies to kyle's solution.
Edited: (had to rethink what twice as likely meant)
This one does exactly that with O(n) and no excess memory usage. I believe this is a clever and efficient solution easy to port to any language. The first two lines are just to populate sample data in Drupal.
function getNrandomGuysWithWeight($numitems){
$q = db_query('SELECT id, weight FROM theTableWithTheData');
$q = $q->fetchAll();
$accum = 0;
foreach($q as $r){
$accum += $r->weight;
$r->weight = $accum;
}
$out = array();
while(count($out) < $numitems && count($q)){
$n = rand(0,$accum);
$lessaccum = NULL;
$prevaccum = 0;
$idxrm = 0;
foreach($q as $i=>$r){
if(($lessaccum == NULL) && ($n <= $r->weight)){
$out[] = $r->id;
$lessaccum = $r->weight- $prevaccum;
$accum -= $lessaccum;
$idxrm = $i;
}else if($lessaccum){
$r->weight -= $lessaccum;
}
$prevaccum = $r->weight;
}
unset($q[$idxrm]);
}
return $out;
}
I putting here a simple solution for picking 1 item, you can easily expand it for k items (Java style):
double random = Math.random();
double sum = 0;
for (int i = 0; i < items.length; i++) {
val = items[i];
sum += val.getValue();
if (sum > random) {
selected = val;
break;
}
}
I have implemented an algorithm similar to Jason Orendorff's idea in Rust here. My version additionally supports bulk operations: insert and remove (when you want to remove a bunch of items given by their ids, not through the weighted selection path) from the data structure in O(m + log n) time where m is the number of items to remove and n the number of items in stored.
Sampling wihout replacement with recursion - elegant and very short solution in c#
//how many ways we can choose 4 out of 60 students, so that every time we choose different 4
class Program
{
static void Main(string[] args)
{
int group = 60;
int studentsToChoose = 4;
Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
}
private static int FindNumberOfStudents(int studentsToChoose, int group)
{
if (studentsToChoose == group || studentsToChoose == 0)
return 1;
return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);
}
}
I just spent a few hours trying to get behind the algorithms underlying sampling without replacement out there and this topic is more complex than I initially thought. That's exciting! For the benefit of a future readers (have a good day!) I document my insights here including a ready to use function which respects the given inclusion probabilities further below. A nice and quick mathematical overview of the various methods can be found here: Tillé: Algorithms of sampling with equal or unequal probabilities. For example Jason's method can be found on page 46. The caveat with his method is that the weights are not proportional to the inclusion probabilities as also noted in the document. Actually, the i-th inclusion probabilities can be recursively computed as follows:
def inclusion_probability(i, weights, k):
"""
Computes the inclusion probability of the i-th element
in a randomly sampled k-tuple using Jason's algorithm
(see https://stackoverflow.com/a/2149533/7729124)
"""
if k <= 0: return 0
cum_p = 0
for j, weight in enumerate(weights):
# compute the probability of j being selected considering the weights
p = weight / sum(weights)
if i == j:
# if this is the target element, we don't have to go deeper,
# since we know that i is included
cum_p += p
else:
# if this is not the target element, than we compute the conditional
# inclusion probability of i under the constraint that j is included
cond_i = i if i < j else i-1
cond_weights = weights[:j] + weights[j+1:]
cond_p = inclusion_probability(cond_i, cond_weights, k-1)
cum_p += p * cond_p
return cum_p
And we can check the validity of the function above by comparing
In : for i in range(3): print(i, inclusion_probability(i, [1,2,3], 2))
0 0.41666666666666663
1 0.7333333333333333
2 0.85
to
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: random_weighted_sample_no_replacement([(1,'a'),(2,'b'),(3,'c')],2))
Out: Counter({'a': 4198, 'b': 7268, 'c': 8534})
One way - also suggested in the document above - to specify the inclusion probabilities is to compute the weights from them. The whole complexity of the question at hand stems from the fact that one cannot do that directly since one basically has to invert the recursion formula, symbolically I claim this is impossible. Numerically it can be done using all kind of methods, e.g. Newton's method. However the complexity of inverting the Jacobian using plain Python becomes unbearable quickly, I really recommend looking into numpy.random.choice in this case.
Luckily there is method using plain Python which might or might not be sufficiently performant for your purposes, it works great if there aren't that many different weights. You can find the algorithm on page 75&76. It works by splitting up the sampling process into parts with the same inclusion probabilities, i.e. we can use random.sample again! I am not going to explain the principle here since the basics are nicely presented on page 69. Here is the code with hopefully a sufficient amount of comments:
def sample_no_replacement_exact(items, k, best_effort=False, random_=None, ε=1e-9):
"""
Returns a random sample of k elements from items, where items is a list of
tuples (weight, element). The inclusion probability of an element in the
final sample is given by
k * weight / sum(weights).
Note that the function raises if a inclusion probability cannot be
satisfied, e.g the following call is obviously illegal:
sample_no_replacement_exact([(1,'a'),(2,'b')],2)
Since selecting two elements means selecting both all the time,
'b' cannot be selected twice as often as 'a'. In general it can be hard to
spot if the weights are illegal and the function does *not* always raise
an exception in that case. To remedy the situation you can pass
best_effort=True which redistributes the inclusion probability mass
if necessary. Note that the inclusion probabilities will change
if deemed necessary.
The algorithm is based on the splitting procedure on page 75/76 in:
http://www.eustat.eus/productosServicios/52.1_Unequal_prob_sampling.pdf
Additional information can be found here:
https://stackoverflow.com/questions/2140787/
:param items: list of tuples of type weight,element
:param k: length of resulting sample
:param best_effort: fix inclusion probabilities if necessary,
(optional, defaults to False)
:param random_: random module to use (optional, defaults to the
standard random module)
:param ε: fuzziness parameter when testing for zero in the context
of floating point arithmetic (optional, defaults to 1e-9)
:return: random sample set of size k
:exception: throws ValueError in case of bad parameters,
throws AssertionError in case of algorithmic impossibilities
"""
# random_ defaults to the random submodule
if not random_:
random_ = random
# special case empty return set
if k <= 0:
return set()
if k > len(items):
raise ValueError("resulting tuple length exceeds number of elements (k > n)")
# sort items by weight
items = sorted(items, key=lambda item: item[0])
# extract the weights and elements
weights, elements = list(zip(*items))
# compute the inclusion probabilities (short: π) of the elements
scaling_factor = k / sum(weights)
π = [scaling_factor * weight for weight in weights]
# in case of best_effort: if a inclusion probability exceeds 1,
# try to rebalance the probabilities such that:
# a) no probability exceeds 1,
# b) the probabilities still sum to k, and
# c) the probability masses flow from top to bottom:
# [0.2, 0.3, 1.5] -> [0.2, 0.8, 1]
# (remember that π is sorted)
if best_effort and π[-1] > 1 + ε:
# probability mass we still we have to distribute
debt = 0.
for i in reversed(range(len(π))):
if π[i] > 1.:
# an 'offender', take away excess
debt += π[i] - 1.
π[i] = 1.
else:
# case π[i] < 1, i.e. 'save' element
# maximum we can transfer from debt to π[i] and still not
# exceed 1 is computed by the minimum of:
# a) 1 - π[i], and
# b) debt
max_transfer = min(debt, 1. - π[i])
debt -= max_transfer
π[i] += max_transfer
assert debt < ε, "best effort rebalancing failed (impossible)"
# make sure we are talking about probabilities
if any(not (0 - ε <= π_i <= 1 + ε) for π_i in π):
raise ValueError("inclusion probabilities not satisfiable: {}" \
.format(list(zip(π, elements))))
# special case equal probabilities
# (up to fuzziness parameter, remember that π is sorted)
if π[-1] < π[0] + ε:
return set(random_.sample(elements, k))
# compute the two possible lambda values, see formula 7 on page 75
# (remember that π is sorted)
λ1 = π[0] * len(π) / k
λ2 = (1 - π[-1]) * len(π) / (len(π) - k)
λ = min(λ1, λ2)
# there are two cases now, see also page 69
# CASE 1
# with probability λ we are in the equal probability case
# where all elements have the same inclusion probability
if random_.random() < λ:
return set(random_.sample(elements, k))
# CASE 2:
# with probability 1-λ we are in the case of a new sample without
# replacement problem which is strictly simpler,
# it has the following new probabilities (see page 75, π^{(2)}):
new_π = [
(π_i - λ * k / len(π))
/
(1 - λ)
for π_i in π
]
new_items = list(zip(new_π, elements))
# the first few probabilities might be 0, remove them
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
while new_items and new_items[0][0] < ε:
new_items = new_items[1:]
# the last few probabilities might be 1, remove them and mark them as selected
# NOTE: we make sure that floating point issues do not arise
# by using the fuzziness parameter
selected_elements = set()
while new_items and new_items[-1][0] > 1 - ε:
selected_elements.add(new_items[-1][1])
new_items = new_items[:-1]
# the algorithm reduces the length of the sample problem,
# it is guaranteed that:
# if λ = λ1: the first item has probability 0
# if λ = λ2: the last item has probability 1
assert len(new_items) < len(items), "problem was not simplified (impossible)"
# recursive call with the simpler sample problem
# NOTE: we have to make sure that the selected elements are included
return sample_no_replacement_exact(
new_items,
k - len(selected_elements),
best_effort=best_effort,
random_=random_,
ε=ε
) | selected_elements
Example:
In : sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c')],2)
Out: {'b', 'c'}
In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c'),(4,'d')],2))
Out: Counter({'a': 2048, 'b': 4051, 'c': 5979, 'd': 7922})
The weights sum up to 10, hence the inclusion probabilities compute to: a → 20%, b → 40%, c → 60%, d → 80%. (Sum: 200% = k.) It works!
Just one word of caution for the productive use of this function, it can be very hard to spot illegal inputs for the weights. An obvious illegal example is
In: sample_no_replacement_exact([(1,'a'),(2,'b')],2)
ValueError: inclusion probabilities not satisfiable: [(0.6666666666666666, 'a'), (1.3333333333333333, 'b')]
b cannot appear twice as often as a since both have to be always be selected. There are more subtle examples. To avoid an exception in production just use best_effort=True, which rebalances the inclusion probability mass such that we have always a valid distribution. Obviously this might change the inclusion probabilities.
I used a associative map (weight,object). for example:
{
(10,"low"),
(100,"mid"),
(10000,"large")
}
total=10110
peek a random number between 0 and 'total' and iterate over the keys until this number fits in a given range.

Resources