Approximation-tolerant map - algorithm

I'm working with arrays of integer, all of the same size l.
I have a static set of them and I need to build a function to efficiently look them up.
The tricky part is that the elements in the array I need to search might be off by 1.
Given the arrays {A_1, A_2, ..., A_n}, and an array S, I need a function search such that:
search(S)=x iff ∀i: A_x[i] ∈ {S[i]-1, S[i], S[i]+1}.
A possible solution is treating each vector as a point in an l-dimensional space and looking for the closest point, but it'd cost something like O(l*n) in space and O(l*log(n)) in time.
Would there be a solution with a better space complexity (and/or time, of course)?
My arrays are pretty different from each other, and good heuristics might be enough.

Consider a search array S with the values:
S = [s1, s2, s3, ... , sl]
and the average value:
s̅ = (s1 + s2 + s3 + ... + sl) / l
and two matching arrays, one where every value is one greater than the corresponding value in S, and one where very value is one smaller:
A1 = [s1+1, s2+1, s3+1, ... , sl+1]
A2 = [s1−1, s2−1, s3−1, ... , sl−1]
These two arrays would have the average values:
a̅1 = (s1 + 1 + s2 + 1 + s3 + 1 + ... + sl + 1) / l = s̅ + 1
a̅2 = (s1 − 1 + s2 − 1 + s3 − 1 + ... + sl − 1) / l = s̅ − 1
So every matching array, whose values are at most 1 away from the corresponding values in the search array, has an average value that is at most 1 away from the average value of the search array.
If you calculate and store the average value of each array, and then sort the arrays based on their average value (or use an extra data structure that enables you to find all arrays with a certain average value), you can quickly identify which arrays have an average value within 1 of the search array's average value. Depending on the data, this could drastically reduce the number of arrays you have to check for similarity.
After having pre-processed the arrays and stores their average values, performing a search would mean iterating over the search array to calculate the average value, looking up which arrays have a similar average value, and then iterating over those arrays to check every value.
If you expect many arrays to have a similar average value, you could use several averages to detect arrays that are locally very different but similar on average. You could e.g. calculate these four averages:
the first half of the array
the second half of the array
the odd-numbered elements
the even-numbered elements
Analysis of the actual data should give you more information about how to divide the array and combine different averages to be most effective.
If the total sum of an array cannot exceed the integer size, you could store the total sum of each array, and check whether it is within l of the total sum of the search array, instead of using averages. This would avoid having to use floats and divisions.
(You could expand this idea by also storing other properties which are easily calculated and don't take up much space to store, such as the highest and lowest value, the biggest jump, ... They could help create a fingerprint of each array that is near-unique, depending on the data.)

If the number of dimensions is not very small, then probably the best solution will be to build a decision tree that recursively partitions the set along different dimensions.
Each node, including the root, would be a hash table from the possible values for some dimension to either:
The list of points that match that value within tolerance, if it's small enough; or
Those same points in a similar tree partitioning on the remaining dimensions.
Since each level completely eliminates one dimension, the depth of the tree is at most L, and search takes O(L) time.
The order in which the dimensions are chosen along each path is important, of course -- the wrong choice could explode the size of the data structure, with each point appearing many times.
Since your points are "pretty different", though, it should be possible to build a tree with minimal duplication. I would try the ID3 algorithm to choose the dimensions: https://en.wikipedia.org/wiki/ID3_algorithm. That basically means you greedily choose the dimension that maximizes the overall reduction in set size, using an entropy metric.

I would personally create something like a Trie for the lookup. I said "something like" because we have up to 3 values per index that might match. So we aren't creating a decision tree, but a DAG. Where sometimes we have choices.
That is straightforward and will run (with backtracking) in maximum time O(k*l).
But here is the trick. Whenever we see a choice of matching states that we can go into next, we can create a merged state which tries all of them. We can create a few or a lot of these merged states. Each one will defer a choice by 1 step. And if we're careful to keep track of which merged states we've created, we can reuse the same one over and over again.
In theory we can be generating partial matches for somewhat arbitrary subsets of our arrays. Which can grow exponentially in the number of arrays. In practice are likely to only wind up with a few of these merged states. But still we can guarantee a tradeoff - more states up front runs faster later. So we optimize until we are done or have hit the limit of how much data we want to have.
Here is some proof of concept code for this in Python. It will likely build the matcher in time O(n*l) and match in time O(l). However it is only guaranteed to build the matcher in time O(n^2 * l^2) and match in time O(n * l).
import pprint
class Matcher:
def __init__ (self, arrays, optimize_limit=None):
# These are the partial states we could be in during a match.
self.states = [{}]
# By state, this is what we would be trying to match.
self.state_for = ['start']
# By combination we could try to match for, which state it is.
self.comb_state = {'start': 0}
for i in range(len(arrays)):
arr = arrays[i]
# Set up "matched the end".
state_index = len(self.states)
this_state = {'matched': [i]}
self.comb_state[(i, len(arr))] = state_index
self.states.append(this_state)
self.state_for.append((i, len(arr)))
for j in reversed(range(len(arr))):
this_for = (i, j)
prev_state = {}
if 0 == j:
prev_state = self.states[0]
matching_values = set((arr[k] for k in range(max(j-1, 0), min(j+2, len(arr)))))
for v in matching_values:
if v in prev_state:
prev_state[v].append(state_index)
else:
prev_state[v] = [state_index]
if 0 < j:
state_index = len(self.states)
self.states.append(prev_state)
self.state_for.append(this_for)
self.comb_state[this_for] = state_index
# Theoretically optimization can take space
# O(2**len(arrays) * len(arrays[0]))
# We will optimize until we are done or hit a more reasonable limit.
if optimize_limit is None:
# Normally
optimize_limit = len(self.states)**2
# First we find all of the choices at the root.
# This will be an array of arrays with format:
# [state, key, values]
todo = []
for k, v in self.states[0].iteritems():
if 1 < len(v):
todo.append([self.states[0], k, tuple(v)])
while len(todo) and len(self.states) < optimize_limit:
this_state, this_key, this_match = todo.pop(0)
if this_key == 'matched':
pass # We do not need to optimize this!
elif this_match in self.comb_state:
this_state[this_key] = self.comb_state[this_match]
else:
# Construct a new state that is all of these.
new_state = {}
for state_ind in this_match:
for k, v in self.states[state_ind].iteritems():
if k in new_state:
new_state[k] = new_state[k] + v
else:
new_state[k] = v
i = len(self.states)
self.states.append(new_state)
self.comb_state[this_match] = i
self.state_for.append(this_match)
this_state[this_key] = [i]
for k, v in new_state.iteritems():
if 1 < len(v):
todo.append([new_state, k, tuple(v)])
#pp = pprint.PrettyPrinter()
#pp.pprint(self.states)
#pp.pprint(self.comb_state)
#pp.pprint(self.state_for)
def match (self, list1, ind=0, state=0):
this_state = self.states[state]
if 'matched' in this_state:
return this_state['matched']
elif list1[ind] in this_state:
answer = []
for next_state in this_state[list1[ind]]:
answer = answer + self.match(list1, ind+1, next_state)
return answer;
else:
return []
foo = Matcher([[1, 2, 3], [2, 3, 4]])
print(foo.match([2, 2, 3]))
Please note that I deliberately set up a situation where there are 2 matches. It reports both of them. :-)

I came up with a further approach derived off Matt Timmermans's answer: building a simple decision tree that might have certain some arrays in multiple branches. It works even if the error in the array I'm searching is larger than 1.
The idea is the following: given the set of arrays As...
Pick an index and a pivot.
I fixed the pivot to a constant value that works well with my data, and tried all indices to find the best one. Trying multiple pivots might work better, but I didn't need to.
Partition As into two possibly-intersecting subsets, one for the arrays (whose index-th element is) smaller than the pivot, one for the larger arrays. Arrays very close to the pivot are added to both sets:
function partition( As, pivot, index ):
return {
As.filter( A => A[index] <= pivot + 1 ),
As.filter( A => A[index] >= pivot - 1 ),
}
Apply both previous steps to each subset recursively, stopping when a subset only contains a single element.
Here an example of a possible tree generated with this algorithm (note that A2 appears both on the left and right child of the root node):
{A1, A2, A3, A4}
pivot:15
index:73
/ \
/ \
{A1, A2} {A2, A3, A4}
pivot:7 pivot:33
index:54 index:0
/ \ / \
/ \ / \
A1 A2 {A2, A3} A4
pivot:5
index:48
/ \
/ \
A2 A3
The search function then uses this as a normal decision tree: it starts from the root node and recurses either to the left or the right child depending on whether its value at index currentNode.index is greater or less than currentNode.pivot. It proceeds recursively until it reaches a leaf.
Once the decision tree is built, the time complexity is in the worst case O(n), but in practice it's probably closer to O(log(n)) if we choose good indices and pivots (and if the dataset is diverse enough) and find a fairly balanced tree.
The space complexity can be really bad in the worst case (O(2^n)), but it's closer to O(n) with balanced trees.

Related

What is the most efficient algorithm/data structure for finding the smallest range containing a point?

Given a data set of a few millions of price ranges, we need to find the smallest range that contains a given price.
The following rules apply:
Ranges can be fully nested (ie, 1-10 and 5-10 is valid)
Ranges cannot be partially nested (ie, 1-10 and 5-15 is invalid)
Example:
Given the following price ranges:
1-100
50-100
100-120
5-10
5-20
The result for searching price 7 should be 5-10
The result for searching price 100 should be 100-120 (smallest range containing 100).
What's the most efficient algorithm/data structure to implement this?
Searching the web, I only found solutions for searching ranges within ranges.
I've been looking at Morton count and Hilbert curve, but can't wrap my head around how to use them for this case.
Thanks.
Because you did not mention this ad hoc algorithm, I'll propose this as a simple answer to your question:
This is a python function, but it's fairly easy to understand and convert it in another language.
def min_range(ranges, value):
# ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20)]
# value = 100
# INIT
import math
best_range = None
best_range_len = math.inf
# LOOP THROUGH ALL RANGES
for b, e in ranges:
# PICK THE SMALLEST
if b <= value <= e and e - b < best_range_len:
best_range = (b, e)
best_range_len = e - b
print(f'Minimal range containing {value} = {best_range}')
I believe there are more efficient and complicated solutions (if you can do some precomputation for example) but this is the first step you must take.
EDIT : Here is a better solution, probably in O(log(n)) but it's not trivial. It is a tree where each node is an interval, and has a child list of all strictly non overlapping intervals that are contained inside him.
Preprocessing is done in O(n log(n)) time and queries are O(n) in worst case (when you can't find 2 ranges that don't overlap) and probably O(log(n)) in average.
2 classes: Tree that holds the tree and can query:
class tree:
def __init__(self, ranges):
# sort the ranges by lowest starting and then greatest ending
ranges = sorted(ranges, key=lambda i: (i[0], -i[1]))
# recursive building -> might want to optimize that in python
self.node = node( (-float('inf'), float('inf')) , ranges)
def __str__(self):
return str(self.node)
def query(self, value):
# bisect is for binary search
import bisect
curr_sol = self.node.inter
node_list = self.node.child_list
while True:
# which of the child ranges can include our value ?
i = bisect.bisect_left(node_list, (value, float('inf'))) - 1
# does it includes it ?
if i < 0 or i == len(node_list):
return curr_sol
if value > node_list[i].inter[1]:
return curr_sol
else:
# if it does then go deeper
curr_sol = node_list[i].inter
node_list = node_list[i].child_list
Node that holds the structure and information:
class node:
def __init__(self, inter, ranges):
# all elements in ranges will be descendant of this node !
import bisect
self.inter = inter
self.child_list = []
for i, r in enumerate(ranges):
if len(self.child_list) == 0:
# append a new child when list is empty
self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))
else:
# the current range r is included in a previous range
# r is not a child of self but a descendant !
if r[0] < self.child_list[-1].inter[1]:
continue
# else -> this is a new child
self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))
def __str__(self):
# fancy
return f'{self.inter} : [{", ".join([str(n) for n in self.child_list])}]'
def __lt__(self, other):
# this is '<' operator -> for bisect to compare our items
return self.inter < other
and to test that:
ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20), (50, 51)]
t = tree(ranges)
print(t)
print(t.query(10))
print(t.query(5))
print(t.query(40))
print(t.query(50))
Preprocessing that generates disjoined intervals
(I call source segments as ranges and resulting segments as intervals)
For ever range border (both start and end) make tuple: (value, start/end fiels, range length, id), put them in array/list
Sort these tuples by the first field. In case of tie make longer range left for start and right for end.
Make a stack
Make StartValue variable.
Walk through the list:
if current tuple contains start:
if interval is opened: //we close it
if current value > StartValue: //interval is not empty
make interval with //note id remains in stack
(start=StartValue, end = current value, id = stack.peek)
add interval to result list
StartValue = current value //we open new interval
push id from current tuple onto stack
else: //end of range
if current value > StartValue: //interval is not empty
make interval with //note id is removed from stack
(start=StartValue, end = current value, id = stack.pop)
add interval to result list
if stack is not empty:
StartValue = current value //we open new interval
After that we have sorted list of disjointed intervals containing start/end value and id of the source range (note that many intervals might correspond to the same source range), so we can use binary search easily.
If we add source ranges one-by-one in nested order (nested after it parent), we can see that every new range might generate at most two new intervals, so overall number of intervals M <= 2*N and overall complexity is O(Nlog N + Q * logN) where Q is number of queries
Edit:
Added if stack is not empty section
Result for your example 1-100, 50-100, 100-120, 5-10, 5-20 is
1-5(0), 5-10(3), 10-20(4), 20-50(0), 50-100(1), 100-120(2)
Since pLOPeGG already covered the ad hoc case, I will answer the question under the premise that preporcessing is performed in order to support multiple queries efficiently.
General data structures for efficient queries on intervals are the Interval Tree and the Segment Tree
What about an approach like this. Since we only allow nested and not partial-nesting. This looks to be a do-able approach.
Split segments into (left,val) and (right,val) pairs.
Order them with respect to their vals and left/right relation.
Search the list with binary search. We get two outcomes not found and found.
If found check if it is a left or right. If it is a left go right until you find a right without finding a left. If it is a right go left until you find a left without finding a right. Pick the smallest.
If not found stop when the high-low is 1 or 0. Then compare the queried value with the value of the node you are at and then according to that search right and left to it just like before.
As an example;
We would have (l,10) (l,20) (l,30) (r,45) (r,60) (r,100) when searching for say, 65 you drop on (r,100) so you go left and can't find a spot with a (l,x) such that x>=65 so you go left until you get balanced lefts and rights and first right and last left is your interval. The reprocessing part will be long but since you will keep it that way. It is still O(n) in worst-case. But that worst case requires you to have everything nested inside each other and you searching for the outer-most.

Choose the best cluster partition based on a cost function

I've a string that I'd like to cluster:
s = 'AAABBCCCCC'
I don't know in advance how many clusters I'll get. All I have, is a cost function that can take a clustering and give it a score.
There is also a constraint on the cluster sizes: they must be in a range [a, b]
In my exemple, for a=3 and b=4, all possible clustering are:
[
['AAA', 'BBC', 'CCCC'],
['AAA', 'BBCC', 'CCC'],
['AAAB', 'BCC', 'CCC'],
]
Concatenation of each clustering must give the string s
The cost function is something like this
cost(clustering) = alpha*l + beta*e + gamma*d
where:
l = variance(cluster_lengths)
e = mean(clusters_entropies)
d = 1 - nb_characters_in_b_that_are_not_in_a)/size_of_b (for b the
consecutive cluster of a)
alpha, beta, gamma are weights
This cost function gives a low cost (0) for the best case:
Where all clusters have the same size.
Content inside each cluster is the same.
Consecutive clusters don't have the same content.
Theoretically, the solution is to calculate the cost of all possible compositions for this string and choose the lowest. but It will take too much time.
Is there any clustering algorithme that can find the best clustering according to this cost function in a reasonable time ?
A dynamic programming approach should work here.
Imagine, first, that a cost(clustering) equals to the sum of cost(cluster) for all all clusters that constitute the clustering.
Then, a simple DP function is defined as follows:
F[i] = minimal cost of clustering the substring s[0:i]
and calculated in the following way:
for i = 0..length(s)-1:
for j = a..b:
last_cluster = s[i-j..i]
F[i] = min(F[i], F[i - j] + cost(last_cluster))
Of course, first you have to initialize values of F to some infinite values or nulls to correctly apply min function.
To actually restore the answer, you can store additional values P[i], which would contain the lengths of the last cluster with optimal clustering of string s[0..i].
When you update F[i], you also update P[i].
Then, restoring answer is little trouble:
current_pos = length(s) - 1
while (current_pos >= 0):
current_cluster_length = P[current_pos]
current_cluster = s[(current_pos - current_cluster_length + 1)..current_pos]
// grab current_cluster to the answer
current_pos -= current_cluster_length
Note that in this approach you will get the clsuters in the inverse order, meaning from the last cluster all the way to the first one.
Let's now apply this idea to the initial problem.
What we would like is to make cost(clustering) more or less linear, so that we can compute it cluster by cluster instead of computing it for the whole clustering.
The first parameter of our DP function F will be, as before, i, the number of chars in the substring s[0:i] we have found optimal answer to.
The meaning of the F function is, as usual, the minimal cost we can achieve with the given parameters.
The parameter e = mean(clusters_entropies) of the cost function is already linear and can be computed cluster by cluster, so this is not a problem.
The parameter l = variance(cluster_lengths) is a little bit more complex.
The variance of n values is defined as Sum[(x[i] - mean)^2] / n.
mean is expected value, namely mean = Sum[x[i]] / n.
Note also that Sum[x[i]] is the sum of lengths of all clusters and in our case it is always fixed and equals to length(s).
Therefore, mean = length(s) / n.
Okay, we have more or less made our l part of cost function linear except the n parameter. We will add this parameter, namely the number of clusters in the desired clustering, as a parameter to our F function.
We will also have a parameter cur which will mean the number of clusters currently assembled in the given state.
The parameter d of the cost function also requires adding additional parameter to our DP function F, namely j, sz, the size of the last cluster in our partition.
Overall, we have come up with a DP function F[i][n][cur][sz] that gives us the minimal cost function of partitioning string s[0:i] into n clusters of which cur are currently constructed with the size of the last cluster equal to sz. Of course, our responsibility is to make sure that a<=sz<=b.
The answer in terms of the minimal cost function will be the minimum among all possible n and a<=sz<=b values of DP function F[length(s)-1][n][n][sz].
Now notice that this time we do not even require the companion P function to store the length of the last cluster as we already included that information as the last sz parameter into our F function.
We will, however, store in P[i][n][cur][sz] the length of the next to last cluster in the optimal clustering with the specified parameters. We will use that value to restore our solution.
Thus, we will be able to restore an answer in the following way, assuming the minimum of F is achieved in the parameters n=n0 and sz=sz0:
current_pos = length(s) - 1
current_n = n0
current_cluster_size = sz0
while (current_n > 0):
current_cluster = s[(current_pos - current_cluster_size + 1)..current_pos]
next_cluster_size = P[current_pos][n0][current_n][current_cluster_size]
current_n--;
current_pos -= current_cluster_size;
current_cluster_size = next_cluster_size
Let's now get to the computation of F.
I will omit the corner cases and range checks, but it will be enough to just initialize F with some infinite values.
// initialize for the case of one cluster
// d = 0, l = 0, only have to calculate entropy
for i=0..length(s)-1:
for n=1..length(s):
F[i][n][1][i+1] = cluster_entropy(s[0..i]);
P[i][n][1][i+1] = -1; // initialize with fake value as in this case there is no previous cluster
// general case computation
for i=0..length(s)-1:
for n=1..length(s):
for cur=2..n:
for sz=a..b:
for prev_sz=a..b:
cur_cluster = s[i-sz+1..i]
prev_cluster = s[i-sz-prev_sz+1..i-sz]
F[i][n][cur][sz] = min(F[i][n][cur][sz], F[i-sz][n][cur - 1][prev_sz] + gamma*calc_d(prev_cluster, cur_cluster) + beta*cluster_entropy(cur_cluster)/n + alpha*(sz - s/n)^2)

Hashing function to distribute over n values (with a twist)

I was wondering if there are any hashing functions to distribute input over n values. The distribution should of course be fairly uniform. But there is a twist. with small changes of n, few elements should get a new hash. Optimally it should split all k uniformly over n values and if n increases to n+1 only k/n-k/(n+1) values would have to move to uniformly distribute in the new hash. Obviously having a hash which simply creates uniform values and then mod it would work, but that would move a lot of hashes to fill the new node. The goal here is that as few values as possible falls into a new bucket.
Suppose 2^{n-1} < N <= 2^n. Then there is a standard trick for turning a hash function H that produces (at least) n bits into one that produces a number from 0 to N.
Compute H(v).
Keep just the first n bits.
If that's smaller than N, stop and output it. Otherwise, start from the top with H(v) instead of v.
Some properties of this technique:
You might worry that you have to repeat the loop many times in some cases. But actually the expected number of loops is at most 2.
If you bump up N and n doesn't have to change, very few things get a new hash: only those ones that had exactly N somewhere in their chain of hashes. (Of course, identifying which elements have this property is kind of hard -- in general it may require rehashing every element!)
If you bump up N and n does have to change, about half of the elements have to be rebucketed. But this happens more and more rarely the bigger N is -- it is an amortized O(1) cost on each bump.
Edit to add an additional comment about the "have to rehash everything" requirement: One might consider modifying step 3 above to "start from the top with the first n bits of H(v)" instead. This reduces the problem with identifying which elements need to be rehashed -- since they'll be in the bucket for the hash of N -- though I'm not confident the resulting hash will have quite as good collision avoidance properties. It certainly makes the process a bit more fragile -- one would want to prove something special about the choice of H (that the bottom few bits aren't "critical" to its collision avoidance properties somehow).
Here is a simple example implementation in Python, together with a short main that shows that most strings do not move when bumping normally, and about half of strings get moved when bumping across a 2^n boundary. Forgive me for any idiosyncracies of my code -- Python is a foreign language.
import math
def ilog2(m): return int(math.ceil(math.log(m,2)))
def hash_into(obj, N):
cur_hash = hash(obj)
mask = pow(2, ilog2(N)) - 1
while (cur_hash & mask) >= N:
# seems Python uses the identity for its hash on integers, which
# doesn't iterate well; let's use literally any other hash at all
cur_hash = hash(str(cur_hash))
return cur_hash & mask
def same_hash(obj, N, N2):
return hash_into(obj, N) == hash_into(obj, N2)
def bump_stat(objs, N):
return len([obj for obj in objs if same_hash(obj, N, N+1)])
alphabet = [chr(x) for x in range(ord('a'),ord('z')+1)]
ascending = alphabet + [c1 + c2 for c1 in alphabet for c2 in alphabet]
def main():
print len(ascending)
print bump_stat(ascending, 10)
print float(bump_stat(ascending, 16))/len(ascending)
# prints:
# 702
# 639
# 0.555555555556
Well, when you add a node, you will want it to fill up, so you will actually want k/(n+1) elements to move from their old nodes to the new one.
That is easily accomplished:
Just generate a hash value for each key as you normally would. Then, to assign key k to a node in [0,N):
Let H(k) be the hash of k.
int hash = H(k);
for (int n=N-1;n>0;--n) {
if ((mix(hash,n) % (i+1))==0) {
break;
}
}
//put it in node n
So, when you add node node 1, it steals half the items from node 0.
When you add node 2, it steals 1/3 of the items from the previous 2 nodes.
And so on...
EDIT: added the mix() function, to mix up the hash differently for every n -- otherwise you get non-uniformities when n is not prime.

What data structure is conducive to discrete sampling? [duplicate]

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.
One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.
A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.
Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.
I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.
It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))
The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).
This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.
Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice
We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657

Finding the best pair of elements that don't exceed a certain weight?

I have a collection of objects, each of which has a weight and a value. I want to pick the pair of objects with the highest total value subject to the restriction that their combined weight does not exceed some threshold. Additionally, I am given two arrays, one containing the objects sorted by weight and one containing the objects sorted by value.
I know how to do it in O(n2) but how can I do it in O(n)?
This is a combinatorial optimization problem, and the fact the values are sorted means you can easily try a branch and bound approach.
I think that I have a solution that works in O(n log n) time and O(n) extra space. This isn't quite the O(n) solution you wanted, but it's still better than the naive quadratic solution.
The intuition behind the algorithm is that we want to be able to efficiently determine, for any amount of weight, the maximum value we can get with a single item that uses at most that much weight. If we can do this, we have a simple algorithm for solving the problem: iterate across the array of elements sorted by value. For each element, see how much additional value we could get by pairing a single element with it (using the values we precomputed), then find which of these pairs is maximum. If we can do the preprocessing in O(n log n) time and can answer each of the above queries in O(log n) time, then the total time for the second step will be O(n log n) and we have our answer.
An important observation we need to do the preprocessing step is as follows. Our goal is to build up a structure that can answer the question "which element with weight less than x has maximum value?" Let's think about how we might do this by adding one element at a time. If we have an element (value, weight) and the structure is empty, then we want to say that the maximum value we can get using weight at most "weight" is "value". This means that everything in the range [0, max_weight - weight) should be set to value. Otherwise, suppose that the structure isn't empty when we try adding in (value, weight). In that case, we want to say that any portion of the range [0, weight) whose value is less than value should be replaced by value.
The problem here is that when we do these insertions, there might be, on iteration k, O(k) different subranges that need to be updated, leading to an O(n2) algorithm. However, we can use a very clever trick to avoid this. Suppose that we insert all of the elements into this data structure in descending order of value. In that case, when we add in (value, weight), because we add the elements in descending order of value, each existing value in the data structure must be higher than our value. This means that if the range [0, weight) intersects any range at all, those ranges will automatically be higher than value and so we don't need to update them. If we combine this with the fact that each range we add always spans from zero to some value, the only portion of the new range that could ever be added to the data structure is the range [weight, x), where x is the highest weight stored in the data structure so far.
To summarize, assuming that we visit the (value, weight) pairs in descending order of value, we can update our data structure as follows:
If the structure is empty, record that the range [0, value) has value "value."
Otherwise, if the highest weight recorded in the structure is greater than weight, skip this element.
Otherwise, if the highest weight recorded so far is x, record that the range [weight, x) has value "value."
Notice that this means that we are always splitting ranges at the front of the list of ranges we have encountered so far. Because of this, we can think about storing the list of ranges as a simple array, where each array element tracks the upper endpoint of some range and the value assigned to that range. For example, we might track the ranges [0, 3), [3, 9), and [9, 12) as the array
3, 9, 12
If we then needed to split the range [0, 3) into [0, 1) and [1, 3), we could do so by prepending 1 to he list:
1, 3, 9, 12
If we represent this array in reverse (actually storing the ranges from high to low instead of low to high), this step of creating the array runs in O(n) time because at each point we just do O(1) work to decide whether or not to add another element onto the end of the array.
Once we have the ranges stored like this, to determine which of the ranges a particular weight falls into, we can just use a binary search to find the largest element smaller than that weight. For example, to look up 6 in the above array we'd do a binary search to find 3.
Finally, once we have this data structure built up, we can just look at each of the objects one at a time. For each element, we see how much weight is left, use a binary search in the other structure to see what element it should be paired with to maximize the total value, and then find the maximum attainable value.
Let's trace through an example. Given maximum allowable weight 10 and the objects
Weight | Value
------+------
2 | 3
6 | 5
4 | 7
7 | 8
Let's see what the algorithm does. First, we need to build up our auxiliary structure for the ranges. We look at the objects in descending order of value, starting with the object of weight 7 and value 8. This means that if we ever have at least seven units of weight left, we can get 8 value. Our array now looks like this:
Weight: 7
Value: 8
Next, we look at the object of weight 4 and value 7. This means that with four or more units of weight left, we can get value 7:
Weight: 7 4
Value: 8 7
Repeating this for the next item (weight six, value five) does not change the array, since if the object has weight six, if we ever had six or more units of free space left, we would never choose this; we'd always take the seven-value item of weight four. We can tell this since there is already an object in the table whose range includes remaining weight four.
Finally, we look at the last item (value 3, weight 2). This means that if we ever have weight two or more free, we could get 3 units of value. The final array now looks like this:
Weight: 7 4 2
Value: 8 7 3
Finally, we just look at the objects in any order to see what the best option is. When looking at the object of weight 2 and value 3, since the maximum allowed weight is 10, we need tom see how much value we can get with at most 10 - 2 = 8 weight. A binary search over the array tells us that this value is 8, so one option would give us 11 weight. If we look at the object of weight 6 and value 5, a binary search tells us that with five remaining weight the best we can do would be to get 7 units of value, for a total of 12 value. Repeating this on the next two entries doesn't turn up anything new, so the optimum value found has value 12, which is indeed the correct answer.
Hope this helps!
Here is an O(n) time, O(1) space solution.
Let's call an object x better than an object y if and only if (x is no heavier than y) and (x is no less valuable) and (x is lighter or more valuable). Call an object x first-choice if no object is better than x. There exists an optimal solution consisting either of two first-choice objects, or a first-choice object x and an object y such that only x is better than y.
The main tool is to be able to iterate the first-choice objects from lightest to heaviest (= least valuable to most valuable) and from most valuable to least valuable (= heaviest to lightest). The iterator state is an index into the objects by weight (resp. value) and a max value (resp. min weight) so far.
Each of the following steps is O(n).
During a scan, whenever we encounter an object that is not first-choice, we know an object that's better than it. Scan once and consider these pairs of objects.
For each first-choice object from lightest to heaviest, determine the heaviest first-choice object that it can be paired with, and consider the pair. (All lighter objects are less valuable.) Since the latter object becomes lighter over time, each iteration of the loop is amortized O(1). (See also searching in a matrix whose rows and columns are sorted.)
Code for the unbelievers. Not heavily tested.
from collections import namedtuple
from operator import attrgetter
Item = namedtuple('Item', ('weight', 'value'))
sentinel = Item(float('inf'), float('-inf'))
def firstchoicefrombyweight(byweight):
bestsofar = sentinel
for x in byweight:
if x.value > bestsofar.value:
bestsofar = x
yield (x, bestsofar)
def firstchoicefrombyvalue(byvalue):
bestsofar = sentinel
for x in byvalue:
if x.weight < bestsofar.weight:
bestsofar = x
yield x
def optimize(items, maxweight):
byweight = sorted(items, key=attrgetter('weight'))
byvalue = sorted(items, key=attrgetter('value'), reverse=True)
maxvalue = float('-inf')
try:
i = firstchoicefrombyvalue(byvalue)
y = i.next()
for x, z in firstchoicefrombyweight(byweight):
if z is not x and x.weight + z.weight <= maxweight:
maxvalue = max(maxvalue, x.value + z.value)
while x.weight + y.weight > maxweight:
y = i.next()
if y is x:
break
maxvalue = max(maxvalue, x.value + y.value)
except StopIteration:
pass
return maxvalue
items = [Item(1, 1), Item(2, 2), Item(3, 5), Item(3, 7), Item(5, 8)]
for maxweight in xrange(3, 10):
print maxweight, optimize(items, maxweight)
This is similar to Knapsack problem. I will use naming from it (num - weight, val - value).
The essential part:
Start with a = 0 and b = n-1. Assuming 0 is the index of heaviest object and n-1 is the index of lightest object.
Increase a til objects a and b satisfy the limit.
Compare current solution with best solution.
Decrease b by one.
Go to 2.
Update:
It's the knapsack problem, except there is a limit of 2 items. You basically need to decide how much space you want for the first object and how much for the other. There is n significant ways to split available space, so the complexity is O(n). Picking the most valuable objects to fit in those spaces can be done without additional cost.

Resources