Find minimum distance between points - algorithm

I have a set of points (x,y).
i need to return two points with minimal distance.
I use this:
http://www.cs.ucsb.edu/~suri/cs235/ClosestPair.pdf
but , i dont really understand how the algo is working.
Can explain in more simple how the algo working?
or suggest another idea?
Thank!

If the number of points is small, you can use the brute force approach i.e:
for each point find the closest point among other points and save the minimum distance with the current two indices till now.
If the number of points is large, I think you may find the answer in this thread:
Shortest distance between points algorithm

Solution for Closest Pair Problem with minimum time complexity O(nlogn) is divide-and-conquer methodology as it mentioned in the document that you have read.
Divide-and-conquer Approach for Closest-Pair Problem
Easiest way to understand this algorithm is reading an implementation of it in a high-level language (because sometimes understanding the algorithms or pseudo-codes can be harder than understanding the real codes) like Python:
# closest pairs by divide and conquer
# David Eppstein, UC Irvine, 7 Mar 2002
from __future__ import generators
def closestpair(L):
def square(x): return x*x
def sqdist(p,q): return square(p[0]-q[0])+square(p[1]-q[1])
# Work around ridiculous Python inability to change variables in outer scopes
# by storing a list "best", where best[0] = smallest sqdist found so far and
# best[1] = pair of points giving that value of sqdist. Then best itself is never
# changed, but its elements best[0] and best[1] can be.
#
# We use the pair L[0],L[1] as our initial guess at a small distance.
best = [sqdist(L[0],L[1]), (L[0],L[1])]
# check whether pair (p,q) forms a closer pair than one seen already
def testpair(p,q):
d = sqdist(p,q)
if d < best[0]:
best[0] = d
best[1] = p,q
# merge two sorted lists by y-coordinate
def merge(A,B):
i = 0
j = 0
while i < len(A) or j < len(B):
if j >= len(B) or (i < len(A) and A[i][1] <= B[j][1]):
yield A[i]
i += 1
else:
yield B[j]
j += 1
# Find closest pair recursively; returns all points sorted by y coordinate
def recur(L):
if len(L) < 2:
return L
split = len(L)/2
L = list(merge(recur(L[:split]), recur(L[split:])))
# Find possible closest pair across split line
# Note: this is not quite the same as the algorithm described in class, because
# we use the global minimum distance found so far (best[0]), instead of
# the best distance found within the recursive calls made by this call to recur().
for i in range(len(E)):
for j in range(1,8):
if i+j < len(E):
testpair(E[i],E[i+j])
return L
L.sort()
recur(L)
return best[1]
closestpair([(0,0),(7,6),(2,20),(12,5),(16,16),(5,8),\
(19,7),(14,22),(8,19),(7,29),(10,11),(1,13)])
# returns: (7,6),(5,8)
Taken from: https://www.ics.uci.edu/~eppstein/161/python/closestpair.py
Detailed explanation:
First we define an Euclidean distance aka Square distance function to prevent code repetition.
def square(x): return x*x # Define square function
def sqdist(p,q): return square(p[0]-q[0])+square(p[1]-q[1]) # Define Euclidean distance function
Then we are taking the first two points as our initial best guess:
best = [sqdist(L[0],L[1]), (L[0],L[1])]
This is a function definition for comparing Euclidean distances of next pair with our current best pair:
def testpair(p,q):
d = sqdist(p,q)
if d < best[0]:
best[0] = d
best[1] = p,q
def merge(A,B): is just a rewind function for the algorithm to merge two sorted lists that previously divided to half.
def recur(L): function definition is the actual body of the algorithm. So I will explain this function definition in more detail:
if len(L) < 2:
return L
with this part, algorithm terminates the recursion if there is only one element/point left in the list of points.
Split the list to half: split = len(L)/2
Create a recursion (by calling function's itself) for each half: L = list(merge(recur(L[:split]), recur(L[split:])))
Then lastly this nested loops will test whole pairs in the current half-list with each other:
for i in range(len(E)):
for j in range(1,8):
if i+j < len(E):
testpair(E[i],E[i+j])
As the result of this, if a better pair is found best pair will be updated.

So they solve for the problem in Many dimensions using a divide-and-conquer approach. Binary search or divide-and-conquer is mega fast. Basically, if you can split a dataset into two halves, and keep doing that until you find some info you want, you are doing it as fast as humanly and computerly possible most of the time.
For this question, it means that we divide the data set of points into two sets, S1 and S2.
All the points are numerical, right? So we have to pick some number where to divide the dataset.
So we pick some number m and say it is the median.
So let's take a look at an example:
(14, 2)
(11, 2)
(5, 2)
(15, 2)
(0, 2)
What's the closest pair?
Well, they all have the same Y coordinate, so we can look at Xs only... X shortest distance is 14 to 15, a distance of 1.
How can we figure that out using divide-and-conquer?
We look at the greatest value of X and the smallest value of X and we choose the median as a dividing line to make our two sets.
Our median is 7.5 in this example.
We then make 2 sets
S1: (0, 2) and (5, 2)
S2: (11, 2) and (14, 2) and (15, 2)
Median: 7.5
We must keep track of the median for every split, because that is actually a vital piece of knowledge in this algorithm. They don't show it very clearly on the slides, but knowing the median value (where you split a set to make two sets) is essential to solving this question quickly.
We keep track of a value they call delta in the algorithm. Ugh I don't know why most computer scientists absolutely suck at naming variables, you need to have descriptive names when you code so you don't forget what the f000 you coded 10 years ago, so instead of delta let's call this value our-shortest-twig-from-the-median-so-far
Since we have the median value of 7.5 let's go and see what our-shortest-twig-from-the-median-so-far is for Set1 and Set2, respectively:
Set1 : shortest-twig-from-the-median-so-far 2.5 (5 to m where m is 7.5)
Set 2: shortest-twig-from-the-median-so-far 3.5 (looking at 11 to m)
So I think the key take-away from the algorithm is that this shortest-twig-from-the-median-so-far is something that you're trying to improve upon every time you divide a set.
Since S1 in our case has 2 elements only, we are done with the left set, and we have 3 in the right set, so we continue dividing:
S2 = { (11,2) (14,2) (15,2) }
What do you do? You make a new median, call it S2-median
S2-median is halfway between 15 and 11... or 13, right? My math may be fuzzy, but I think that's right so far.
So let's look at the shortest-twig-so-far-for-our-right-side-with-median-thirteen ...
15 to 13 is... 2
11 to 13 is .... 2
14 to 13 is ... 1 (!!!)
So our m value or shortest-twig-from-the-median-so-far is improved (where we updated our median from before because we're in a new chunk or Set...)
Now that we've found it we know that (14, 2) is one of the points that satisfies the shortest pair equation. You can then check exhaustively against the points in this subset (15, 11, 14) to see which one is the closer one.
Clearly, (15,2) and (14,2) are the winning pair in this case.
Does that make sense? You must keep track of the median when you cut the set, and keep a new median for everytime you cut the set until you have only 2 elements remaining on each side (or in our case 3)
The magic is in the median or shortest-twig-from-the-median-so-far
Thanks for asking this question, I went in not knowing how this algorithm worked but found the right highlighted bullet point on the slide and rolled with it. Do you get it now? I don't know how to explain the median magic other than binary search is f000ing awesome.

Related

Shortest Path for n entities

I'm trying to solve a problem that is about minimizing the distance traveled by a group of n entities who have to go trough a group of x points in a given order.
The n entities all start in the same position (1,1) and then I'm given x points that are in a queue and have to be "answered" in the correct order. However, I want the distance to be minimal.
My approach to the algorithm so far was to order the entities in increasing order of their distance to the x that is next in line. Then, I'd check, from the ones that are closer to the ones that are the furthest away, if this distance to the next in line is bigger than the distance to the one that comes afterwards to minimize the distance. The closest one to not fulfill this condition went to answer. If all were closer to the x that came afterwards, I'd reorder them in increasing order of distance to the one that came afterwards and send the furthest away from this to answer the x. Since this is a test problem I'm doing as practice for a competition I know what the result should be for my test case and it seems I'm doing this wrong.
How should I implement such an algorithm that guarantees that the distance is minimal?
The algorithm you've described sounds like a greedy search algorithm. Greedy algorithms are not guaranteed to find optimal solutions except under specific conditions that don't seem to hold here.
This looks like a candidate for a dynamic programming formulation. Alternatively, you can use heuristic search such as the A* search algorithm. I'd go with the latter if I was in the competition. See the link for a description of how the algorithm works, how to implement it, and how you might apply it to your problem.
Although since the points need to be visited in order, there is a bound on the number of possible arrangements, I couldn't think of a more efficient way than the following formulation. Let f(ns, i) represent the optimal arrangement up to the ith point, where ns is the list of the last chosen point for each entity that has at least one point. Then we have at most two choices: either start a new entity if we haven't run out, or try the current point as the next visit for each entity.
Python recursion:
import math
def d(p1, p2):
return math.sqrt(math.pow(p1[0] - p2[0], 2) + math.pow(p1[1] - p2[1], 2))
def f(ps, n):
def g(ns, i):
if i == len(ps):
return 0
# start a new entity if we haven't run out
best = d((0,0), ps[i]) + g(ns[:] + [i], i + 1) if len(ns) < n else float('inf')
# try the current point as the next visit for each entity
for entity_idx, point_idx in enumerate(ns):
_ns = ns[:]
_ns[entity_idx] = i
best = min(best, d(ps[point_idx], ps[i]) + g(_ns, i + 1))
return best
return d((0,0), ps[0]) + g([0], 1)
Output:
"""
p5
p3
p1
(0,0) p4 p6
p2
"""
points = [(0,1), (0,-1), (0,2), (2,0), (0,3), (3,0)]
print f(points, 3) # 7.0

Find North-East path with most points [duplicate]

In Cracking the Coding Interview, Fourth Edition, there is such a problem:
A circus is designing a tower routine consisting of people standing
atop one anoth- er’s shoulders For practical and aesthetic reasons,
each person must be both shorter and lighter than the person below him
or her Given the heights and weights of each person in the circus,
write a method to compute the largest possible number of people in
such a tower.
EXAMPLE: Input (ht, wt): (65, 100) (70, 150) (56, 90)
(75, 190) (60, 95) (68, 110)
Output: The longest tower is length 6 and
includes from top to bottom: (56, 90) (60,95) (65,100) (68,110)
(70,150) (75,190)
Here is its solution in the book
Step 1 Sort all items by height first, and then by weight This means that if all the heights are unique, then the items will be sorted by their height If heights are the same, items will be sorted by their weight
Step 2 Find the longest sequence which contains increasing heights and increasing weights
To do this, we:
a) Start at the beginning of the sequence Currently, max_sequence is empty
b) If, for the next item, the height and the weight is not greater than those of the previous item, we mark this item as “unfit”
c) If the sequence found has more items than “max sequence”, it becomes “max sequence”
d) After that the search is repeated from the “unfit item”, until we reach the end of the original sequence
I have some questions about its solutions.
Q1
I believe this solution is wrong.
For example
(3,2) (5,9) (6,7) (7,8)
Obviously, (6,7) is an unfit item, but how about (7,8)? According to the solution, it is NOT unfit as its h and w are bother bigger than (6,7), however, it cannot be considered into the sequence, because (7,8) does not fit (5,9).
Am I right?
If I am right, what is the fix?
Q2
I believe even if there is a fix for the above solution, the style of the solution will lead to at least O(n^2), because it need to iterate again and again, according to step 2-d.
So is it possible to have a O(nlogn) solution?
You can solve the problem with dynamic programming.
Sort the troupe by height. For simplicity, assume all the heights h_i and weights w_j are distinct. Thus h_i is an increasing sequence.
We compute a sequence T_i, where T_i is a tower with person i at the top of maximal size. T_1 is simply {1}. We can deduce subsequent T_k from the earlier T_j — find the largest tower T_j that can take k's weight (w_j < w_k) and stand k on it.
The largest possible tower from the troupe is then the largest of the T_i.
This algorithm takes O(n**2) time, where n is the cardinality of the troupe.
Tried solving this myself, did not meant to give 'ready made solution', but still giving , more to check my own understanding and if my code(Python) is ok and would work of all test cases. I tried for 3 cases and it seemed to work of correct answer.
#!/usr/bin/python
#This function takes a list of tuples. Tuple(n):(height,weight) of nth person
def htower_len(ht_wt):
ht_sorted = sorted(ht_wt,reverse=True)
wt_sorted = sorted(ht_wt,key=lambda ht_wt:ht_wt[1])
max_len = 1
len1 = len(ht_sorted)
i=0
j=0
while i < (len1-1):
if(ht_sorted[i+1][1] < ht_sorted[0][1]):
max_len = max_len+1
i=i+1
print "maximum tower length :" ,max_len
###Called above function with below sample app code.
testcase =1
print "Result of Test case ",testcase
htower_len([(5,75),(6.7,83),(4,78),(5.2,90)])
testcase = testcase + 1
print "Result of Test case ",testcase
htower_len([(65, 100),(70, 150),(56, 90),(75, 190),(60, 95),(68, 110)])
testcase = testcase + 1
print "Result of Test case ",testcase
htower_len([(3,2),(5,9),(6,7),(7,8)])
For example
(3,2) (5,9) (6,7) (7,8)
Obviously, (6,7) is an unfit item, but how about (7,8)?
In answer to your Question - the algorithm first runs starting with 3,2 and gets the sequence (3,2) (5,9) marking (6,7) and (7,8) as unfit.
It then starts again on (6,7) (the first unfit) and gets (6,7) (7,8), and that makes the answer 2. Since there are no more "unfit" items, the sequence terminates with maximum length 2.
After first sorting the array by height and weight, my code checks what the largest tower would be if we grabbed any of the remaining tuples in the array (and possible subsequent tuples). In order to avoid re-computing sub-problems, solution_a is used to store the optimal max length from the tail of the input_array.
The beginning_index is the index from which we can consider grabbing elements from (the index from which we can consider people who could go below on the human stack), and beginning_tuple refers to the element/person higher up on the stack.
This solution runs in O(nlogn) to do the sort. The space used is O(n) for the solution_a array and the copy of the input_array.
def determine_largest_tower(beginning_index, a, beginning_tuple, solution_a):
# base case
if beginning_index >= len(a):
return 0
if solution_a[beginning_index] != -1: # already computed
return solution_a[beginning_index]
# recursive case
max_len = 0
for i in range(beginning_index, len(a)):
# if we can grab that value, check what the max would be
if a[i][0] >= beginning_tuple[0] and a[i][1] >= beginning_tuple[1]:
max_len = max(1 + determine_largest_tower(i+1, a, a[i], solution_a), max_len)
solution_a[beginning_index] = max_len
return max_len
def algorithm_for_human_towering(input_array):
a = sorted(input_array)
return determine_largest_tower(0, a, (-1,-1), [-1] * len(a))
a = [(3,2),(5,9),(6,7),(7,8)]
print algorithm_for_human_towering(a)
Here is another way to approach the problem altogether with code;
Algorithm
Sorting first by height and then by width
Sorted array:
[(56, 90), (60, 95), (65, 100), (68, 110), (70, 150), (75, 190)]
Finding the length of the longest increasing subsequence of weights
Why the longest subsequence of weights is the answer?
The people are sorted by increasing height,
so when we are finding a subsequence of people with increasing weights too
these selected people would satisfy our requirement as they are both in increasing order of heights and weights and therefore can form a human tower.
For example:
[(56, 90) (60,95) (65,100) (68,110) (70,150) (75,190)]
Efficient Implementation
In the attached implementation we maintain a list of increasing numbers and uses bisect_left, which is implemented under the hood using binary search, to find the proper index for insertion.
Please Note; The sequence generated by longest_increasing_sequence method might not be the actual longest subsequence, however, the length of it - will surely be as the length of the longest increasing subsequence.
Kindly refer to Longest increasing subsequence Efficient algorithms for more details.
The overall time complexity is O(n log(n)) as desired.
Code
from bisect import bisect_left
def human_tower(height, weight):
def longest_increasing_sequence(A, get_property):
lis = []
for i in range(len(A)):
x = get_property(A[i])
i = bisect_left(lis, x)
if i == len(lis):
lis.append(x)
else:
lis[i] = x
return len(lis)
# Edge case, no people
if 0 == len(height):
return 0
# Creating array of heights and widths
people = [(h, w) for h, w in zip(height, weight)]
# Sorting array first by height and then by width
people.sort()
# Returning length longest increasing sequence
return longest_increasing_sequence(people, lambda t : t[1])
assert 6 == human_tower([65,70,56,75,60,68], [100,150,90,190,95,110])

What data structure is conducive to discrete sampling? [duplicate]

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.
One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.
A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.
Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.
I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.
It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))
The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).
This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.
Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice
We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657

Count ways to take atleast one stick

There are N sticks placed in a straight line. Bob is planning to take few of these sticks. But whatever number of sticks he is going to take, he will take no two successive sticks.(i.e. if he is taking a stick i, he will not take i-1 and i+1 sticks.)
So given N, we need to calculate how many different set of sticks he could select. He need to take at least stick.
Example : Let N=3 then answer is 4.
The 4 sets are: (1, 3), (1), (2), and (3)
Main problem is that I want solution better than simple recursion. Can their be any formula for it? As am not able to crack it
It's almost identical to Fibonacci. The final solution is actually fibonacci(N)-1, but let's explain it in terms of actual sticks.
To begin with we disregard from the fact that he needs to pick up at least 1 stick. The solution in this case looks as follows:
If N = 0, there is 1 solution (the solution where he picks up 0 sticks)
If N = 1, there are 2 solutions (pick up the stick, or don't)
Otherwise he can choose to either
pick up the first stick and recurse on N-2 (since the second stick needs to be discarded), or
leave the first stick and recurse on N-1
After this computation is finished, we remove 1 from the result to avoid counting the case where he picks up 0 sticks in total.
Final solution in pseudo code:
int numSticks(int N) {
return N == 0 ? 1
: N == 1 ? 2
: numSticks(N-2) + numSticks(N-1);
}
solution = numSticks(X) - 1;
As you can see numSticks is actually Fibonacci, which can be solved efficiently using for instance memoization.
Let the number of sticks taken by Bob be r.
The problem has a bijection to the number of binary vectors with exactly r 1's, and no two adjacent 1's.
This is solveable by first placing the r 1's , and you are left with exactly n-r 0's to place between them and in the sides. However, you must place r-1 0's between the 1's, so you are left with exactly n-r-(r-1) = n-2r+1 "free" 0's.
The number of ways to arrange such vectors is now given as:
(1) = Choose(n-2r+1 + (r+1) -1 , n-2r+1) = Choose(n-r+1, n-2r+1)
Formula (1) is deriving from number of ways of choosing n-2r+1
elements from r+1 distinct possibilities with replacements
Since we solved it for a specific value of r, and you are interested in all r>=1, you need to sum for each 1<=r<=n
So, the solution of the problem is given by the close formula:
(2) = Sum{ Choose(n-r+1, n-2r+1) | for each 1<=r<=n }
Disclaimer:
(A close variant of the problem with fixed r was given as HW in the course I am TAing this semester, main difference is the need to sum the various values of r.

Distinct sub sequences summing to given number in an array

During my current preparation for interview, I encountered a question for which I am having some difficulty to get optimal solution,
We are given an array A and an integer Sum, we need to find all distinct sub sequences of A whose sum equals Sum.
For eg. A={1,2,3,5,6} Sum=6 then answer should be
{1,2,3}
{1,5}
{6}
Presently I can think of two ways of doing this,
Use Recursion ( which I suppose should be last thing to consider for an interview question)
Use Integer Partitioning to partition Sum and check whether the elements of partition are present in A
Please guide my thoughts.
I agree with Jason. This solution comes to mind:
(complexity is O(sum*|A|) if you represent the map as an array)
Call the input set A and the target sum sum
Have a map of elements B, with each element being x:y, where x (the map key) is the sum and y (the map value) is the number of ways to get to it.
Starting of, add 0:1 to the map - there is 1 way to get to 0 (obviously by using no elements)
For each element a in A, consider each element x:y in B.
If x+a > sum, don't do anything.
If an element with the key x+a already exists in B, say that element is x+a:z, modify it to x+a:y+z.
If an element with the key doesn't exist, simply add x+a:y to the set.
Look up the element with key sum, thus sum:x - x is our desired value.
If B is sorted (or an array), you can simply skip the rest of the elements in B during the "don't do anything" step.
Tracing it back:
The above just gives the count, this will modify it to give the actual subsequences.
At each element in B, instead of the sum, store all the source elements and the elements used to get there (so have a list of pairs at each element in B).
For 0:1 there is no source elements.
For x+a:y, the source element is x and the element to get there is a.
During the above process, if an element with the key already exists, enqueue the pair x/a to the element x+a (enqueue is an O(1) operation).
If an element with the key doesn't exist, simply create a list with one pair x/a at the element x+a.
To reconstruct, simply start at sum and recursively trace your way back.
We have to be careful of duplicate sequences (do we?) and sequences with duplicate elements here.
Example - not tracing it back:
A={1,2,3,5,6}
sum = 6
B = 0:1
Consider 1
Add 0+1
B = 0:1, 1:1
Consider 2
Add 0+2:1, 1+2:1
B = 0:1, 1:1, 2:1, 3:1
Consider 3
Add 0+3:1 (already exists -> add 1 to it), 1+3:1, 2+1:1, 3+1:1
B = 0:1, 1:1, 2:1, 3:2, 4:1, 5:1, 6:1
Consider 5
B = 0:1, 1:1, 2:1, 3:2, 4:1, 5:2, 6:2
Generated sums thrown away = 7:1, 8:2, 9:1, 10:1, 11:1
Consider 6
B = 0:1, 1:1, 2:1, 3:2, 4:1, 5:2, 6:3
Generated sums thrown away = 7:1, 8:1, 9:2, 10:1, 11:2, 12:2
Then, from 6:3, we know we have 3 ways to get to 6.
Example - tracing it back:
A={1,2,3,5,6}
sum = 6
B = 0:{}
Consider 1
B = 0:{}, 1:{0/1}
Consider 2
B = 0:{}, 1:{0/1}, 2:{0/2}, 3:{1/2}
Consider 3
B = 0:{}, 1:{0/1}, 2:{0/2}, 3:{1/2,0/3}, 4:{1/3}, 5:{2/3}, 6:{3/3}
Consider 5
B = 0:{}, 1:{0/1}, 2:{0/2}, 3:{1/2,0/3}, 4:{1/3}, 5:{2/3,0/5}, 6:{3/3,1/5}
Generated sums thrown away = 7, 8, 9, 10, 11
Consider 6
B = 0:{}, 1:{0/1}, 2:{0/2}, 3:{1/2,0/3}, 4:{1/3}, 5:{2/3,0/5}, 6:{3/3,1/5,0/6}
Generated sums thrown away = 7, 8, 9, 10, 11, 12
Then, tracing back from 6: (not in {} means an actual element, in {} means a map entry)
{6}
{3}+3
{1}+2+3
{0}+1+2+3
1+2+3
Output {1,2,3}
{0}+3+3
3+3
Invalid - 3 is duplicate
{1}+5
{0}+1+5
1+5
Output {1,5}
{0}+6
6
Output {6}
This is a variant of the subset-sum problem. The subset-sum problem asks if there is a subset that sums to given a value. You are asking for all of the subsets that sum to a given value.
The subset-sum problem is hard (more precisely, it's NP-Complete) which means that your variant is hard too (it's not NP-Complete, because it's not a decision problem, but it is NP-Hard).
The classic approach to the subset-sum problem is either recursion or dynamic programming. It's obvious how to modify the recursive solution to the subset-sum problem to answer your variant. I suggest that you also take a look at the dynamic programming solution to subset-sum and see if you can modify it for your variant (tbc: I do not know if this is actually possible). That would certainly be a very valuable learning exercise whether or not is possible as it would certainly enhance your understanding of dynamic programming either way.
It would surprise me though, if the expected answer to your question is anything but the recursive solution. It's easy to come up with, and an acceptable approach to the problem. Asking for the dynamic programming solution on-the-fly is a bit much to ask.
You did, however, neglect to mention a very naïve approach to this problem: generate all subsets, and for each subset check if it sums to the given value or not. Obviously that's exponential, but it does solve the problem.
I assumed that given array contains distinct numbers.
Let's define function f(i, s) - which means we used some numbers in range [1, i] and the sum of used numbers is s.
Let's store all values in 2 dimensional matrix i.e. in cell (i, j) we will have value for f(i, j). Now if have already calculated values for cells which are located upper or lefter the cell (i, s) we can calculate value for f(i, s) i.e. f(i, s) = f(i - 1, s);(not to take i indexed number) and if(s >= a[i]) f(i, s) += f(i - 1, s - a[i]). And we can use bottom-up approach to fill all the matrix, setting [f(0, 0) = 1; f(0, i) = 0; 1 <= i <= s], [f(i, 0) = 1;1<=i<=n;]. If we calculated all the matrix then we have answer in cell f(n,S); Thus we have total time complexity O(n*s) and memory complexity O(n*s);
We can improve memory complexity if we note that in every iteration we need only information from previous row, it means that we can store matrix of size 2xS not nxS. We reduced memory complexity up to linear to S. This problem is NP complete thus we don't have polynomial algorithm for this and this approach is the best thing.

Resources