retrieve closest element from a set of elements - algorithm

I'm experimenting with an idea, where I have following subproblem:
I have a list of size m containing tuples of fixed length n.
[(e11, e12, .., e1n), (e21, e22, .., e2n), ..., (em1, em2, .., emn)]
Now, given some random tuple (t1, t2, .., tn), which does not belong to the list, I want to find the closest tuple(s), that belongs to the list.
I use the following distance function (Hamming distance):
def distance(A, B):
total = 0
for e1, e2 in zip(A, B):
total += e1 == e2
return total
One option is to use exhaustive search, but this is not sufficient for my problem as the lists are quite large. Other idea, I have come up with, is to first use kmedoids to cluster the list and retrieve K medoids (cluster centers). For querying, I can determine the closest cluster with K calls to distance function. Then, I can search for the closest tuple from that particular cluster. I think it should work, but I am not completely sure, if it is fine in cases the query tuple is on the edges of the clusters.
However, I was wondering, if you have a better idea to solve the problem as my mind is completely blank at the moment. However, I have a strong feeling that there may be a clever way to do it.
Solutions that require precomputing something are fine as long as they bring down the complexity of the query.

You can store a hash table (dictionary/map) that maps from an element (in the tupple) to the tupples it appears in: hash:element->list<tupple>.
Now, when you have a new "query", you will need to iterate each of hash(element) for each element of the new "query", and find the maximal number of hits.
pseudo code:
findMax(tuple):
histogram <- empty map
for each element in tuple:
#assuming hash_table is the described DS from above
for each x in hash_table[element]:
histogram[x]++ #assuming lazy initialization to 0
return key with highest value in histogram
An alternative, that does not exactly follow the metric you desired is a k-d tree. The difference is k-d tree also take into consideration the "distance" between the elements (and not only equality/inequality).
k-d trees also require the elements to be comparable.

If your data is big enough, you may want to create some inverted indexes over it.
With a data of m vectors of n elements.
Data:
0: 1, 2, 3, 4, 5, ...
1: 2, 3, 1, 5, 3, ...
2: 5, 3, 2, 1, 3, ...
3: 1, 2, 1, 5, 3, ...
...
m: m0, ... mn
Then you want to get n indexes like this:
Index0
1: 0, 3
2: 1
5: 2
Index1
2: 0, 3
3: 3, 3
Index2
3: 0
1: 1, 3
2: 2
...
Then you only search on your indexes to get the tuples that contain any of the query tuple values and find the closest tuple within those.
def search(query)
candidates = []
for i in range(len(query))
value = query[i]
candidates.append(indexes[i][value])
# find candidates with min distance
for candidate in candidates
distance = distance(candidate, query)
...
The heavy process is creating the indexes, once you built them the search will be really fast.

Related

recursively split line into smaller segments

I have a line. It starts with two indexes, call them 0 and 1, at the outermost points. At any point I can create a new point which bisects two other ones (there must not already be a point between them). However when this happens the indexes need to increment. For example, here's a potential series of steps to achieve N=5 since there are indexes in the result.
(graph) (split between) (iteration #)
< ============================ >
0 1 0,1 0
0 1 2 1,2 1
0 1 2 3 0,1 2
0 1 2 3 4
I have two questions:
What pseudocode could be used to find the "split between" values given the iteration number?
How could I prevent the shape from being unbalanced? Are there certain restrictions I should place on the value of N? I don't particularly care what order the splits happen in, but I do want to make sure the result is balanced.
This is an issue I've encountered when developing a video game.
I'm not sure if this is the kind of answer you are looking for, but I see this as a binary tree structure. Every tree node contains its own label and its left and right labels. The root of the tree (level 0) would be (2, 0, 1) (split 2 with 0 on the left and 1 and the right). Every node would be split into two children. The algorithm would go something like this:
At step N, pick the leftmost node without two children in level floor(log2(N - 1)).
Take the node label T and the left and right labels L and R from that node.
If the node does not have a left child, add a left child node (N, L, T).
If the node already has a left child, add a right child node (N, T, R).
N <- N + 1
For example, at iteration 5 you would have something like this:
Level 0: (2, 0, 1)
/ \
/ \
/ \
Level 1: (3, 0, 2) (4, 2, 1)
/
/
Level 2: (5, 0, 3)
Now, to reconstruct the current split, you would do the following:
Initialize a list S <- [0].
For every node (T, L, R) in the tree traversed in postorder:
If the node does not have a left child, append T to S.
If the node does not have a right child, append R to S.
For the previous case, you would have:
S = [0]
(5, 0, 3) -> S = [0, 5, 3]
(3, 0, 2) -> S = [0, 5, 3, 2]
(4, 2, 1) -> S = [0, 5, 3, 2, 4, 1]
(2, 0, 1) -> S = [0, 5, 3, 2, 4, 1]
So the complete split would be [0, 5, 3, 2, 4, 1]. The split would be perfectly balanced only when N = 2k for some positive integer k. Of course, you can annotate the tree nodes with additional "distance" information if you need to keep track of something like that.
I agree with jdehesa in that what you are doing does have its similarities with a binary tree. I would recommend looking in using that data structure if you can, since it is highly structured, well-defined, and many great algorithms exist for working with them.
Additionally, as mentioned in the comment section above, a linked list would also be a nice option, since you are adding in a lot of elements. A normal array (which is contiguous in memory) will require you to move many elements over and over again as you insert additional elements, which is slow. A linked list would allow you to add your element anywhere in memory, and then just update a few pointers in the linked list on both sides of where you want to insert it, and be done. No moving things around.
However, if you really just want to put together a working solution using array and aren't concerned with using other data structures, here is the math for the indexing you requested:
Each pair can be listed as (a, b), and we can quickly see b = a + 1. Thus, if you find a, you know b. To get these, you'll need two loops:
iteration := 0
i := 0
while iteration < desired_iterations
for j = (2 ^ i) - 1; j >= 0 && iteration < desired_iterations; j--
print j, j + 1
iteration++
i++
Where ^ is the exponentiation operator. What we do is find the second to last element in the list (2^i)-1 and count backwards, listing off the indices. We then increment "i" to signify that we've now doubled our array size, and then repeat again. If at any point we research our desired number of iterations, we break out of both loops because we're finished.

Dynamic Programming Implementation of Unique subset generation

I came across this problem where we need to find the total number of ways in which you can form a subset of n(where n is the input by the user). The subset conditions are : the numbers in the subset should be distinct and the numbers in the subset should be in decreasing order.
Example: Given n =7 ,the output is 4 because the possible combinations are (6,1)(5,2)(4,3)(4,2,1). Note that though (4,1,1,1) also add upto 7,however it has repeating numbers. Hence it is not a valid subset.
I have solved this problem using the backtracking method which has exponential complexity. However, I am trying to figure out how can we solve this problem using Dynamic Prog ? The nearest I could come up with is the series : for n= 0,1,2,3,4,5,6,7,8,9,10 for which the output is 0,0,0,1,1,2,3,4,5,7,9 respectively. However, I am not able to come up with a general formula that I can use to calculate f(n) . While going through the internet I came across this kind of integer sequences in https://oeis.org/search?q=0%2C0%2C0%2C1%2C1%2C2%2C3%2C4%2C5%2C7%2C9%2C11%2C14&sort=&language=&go=Search . But I am unable to understand the formula they have provided here. Any help would be appreciated. Any other insights that improves exponential complexity is also appreciated.
What you call "unique subset generation" is better known as integer partitions with distinct parts. Mathworld has an entry on the Q function, which counts the number of partitions with distinct parts.
However, your function does not count trivial partitions (i.e. n -> (n)), so what you're looking for is actually Q(n) - 1, which generates the sequence that you linked in your question--the number of partitions into at least 2 distinct parts. This answer to a similar question on Mathematics contains an efficient algorithm in Java for computing the sequence up to n = 200, which easily can be adapted for larger values.
Here's a combinatorial explanation of the referenced algorithm:
Let's start with a table of all the subsets of {1, 2, 3}, grouped by their sums:
index (sum) partitions count
0 () 1
1 (1) 1
2 (2) 1
3 (1+2), (3) 2
4 (1+3) 1
5 (2+3) 1
6 (1+2+3) 1
Suppose we want to construct a new table of all of subsets of {1, 2, 3, 4}. Notice that every subset of {1, 2, 3} is also a subset of {1, 2, 3, 4}, so each subset above will appear in our new table. In fact, we can divide our new table into two equally sized categories: subsets which do not include 4, and those that do. What we can do is start from the table above, copy it over, and then "extend" it with 4. Here's the table for {1, 2, 3, 4}:
index (sum) partitions count
0 () 1
1 (1) 1
2 (2) 1
3 (1+2), (3) 2
4 (1+3), [4] 2
5 (2+3), [1+4] 2
6 (1+2+3),[2+4] 2
7 [1+2+4],[3+4] 2
8 [1+3+4] 1
9 [2+3+4] 1
10 [1+2+3+4] 1
All the subsets that include 4 are surrounded by square brackets, and they are formed by adding 4 to exactly one of the old subsets which are surrounded by parentheses. We can repeat this process and build the table for {1,2,..,5}, {1,2,..,6}, etc.
But we don't actually needed to store the actual subsets/partitions, we just need the counts for each index/sum. For example, if with have the table for {1, 2, 3} with only counts, we can build the count-table for {1, 2, 3, 4} by taking each (index, count) pair from the old table, and adding count to the the current count for index + 4. The idea is that, say, if there are two subsets of {1, 2, 3} that sum to 3, then adding 4 to each of those two subsets will make two new subsets for 7.
With this in mind, here's a Python implementation of this process:
def c(n):
counts = [1]
for k in range(1, n + 1):
new_counts = counts[:] + [0]*k
for index, count in enumerate(counts):
new_counts[index + k] += count
counts = new_counts
return counts
We start with the table for {}, which has just one subset--the empty set-- which sums to zero. Then we build the table for {1}, then for {1, 2}, ..., all the way up to {1,2,..,n}. Once we're done, we'll have counted every partition for n, since no partition of n can include integers larger than n.
Now, we can make 2 major optimizations to the code:
Limit the table to only include entries up to n, since that's all we're interested in. If index + k exceeds n, then we just ignore it. We can even preallocate space for the final table up front, rather than growing bigger tables on each iteration.
Instead of building a new table from scratch on each iteration, if we carefully iterate over the old table backwards, we can actually update it in-place without messing up any of the new values.
With these optimizations, you effectively have the same algorithm from the Mathematics answer that was referenced earlier, which was motivated by generating functions. It runs in O(n^2) time and takes only O(n) space. Here's what it looks like in Python:
def c(n):
counts = [1] + [0]*n
for k in range(1, n + 1):
for i in reversed(range(k, n + 1)):
counts[i] += counts[i - k]
return counts
def Q(n):
"-> the number of subsets of {1,..,n} that sum to n."
return c(n)[n]
def f(n):
"-> the number of subsets of {1,..,n} of size >= 2 that sum to n."
return Q(n) - 1

Maximum sum of n intervals in a sequence

I'm doing some programming "kata" which are skill building exercises for programming (and martial arts). I want to learn how to solve for algorithms like these in shorter amounts of time, so I need to develop my knowledge of the patterns. Eventually I want to solve in increasingly efficient time complexities (O(n), O(n^2), etc), but for now I'm fine with figuring out the solution with any efficiency to start.
The problem:
Given arr[10] = [4, 5, 0, 2, 5, 6, 4, 0, 3, 5]
Given various segment lengths, for example one 3-length segment, and two 2-length segments, find the optimal position of (or maximum sum contained by) the segments without overlapping the segments.
For example, solution to this array and these segments is 2, because:
{4 5} 0 2 {5 6 4} 0 {3 5}
What I have tried before posting on stackoverflow.com:
I've read through:
Algorithm to find maximum coverage of non-overlapping sequences. (I.e., the Weighted Interval Scheduling Prob.)
algorithm to find longest non-overlapping sequences
and I've watched MIT opencourseware and read about general steps for solving complex problems with dynamic programming, and completed a dynamic programming tutorial for finding Fibonacci numbers with memoization. I thought I could apply memoization to this problem, but I haven't found a way yet.
The theme of dynamic programming is to break the problem down into sub-problems which can be iterated to find the optimal solution.
What I have come up with (in an OO way) is
foreach (segment) {
- find the greatest sum interval with length of this segment
This produces incorrect results, because not always will the segments fit with this approach. For example:
Given arr[7] = [0, 3, 5, 5, 5, 1, 0] and two 3-length segments,
The first segment will take 5, 5, 5, leaving no room for the second segment. Ideally I should memoize this scenario and try the algorithm again, this time avoiding 5, 5, 5, as a first pick. Is this the right path?
How can I approach this in a "dynamic programming" way?
If you place the first segment, you get two smaller sub-arrays: placing one or both of the two remaining segments into one of these sub-arrays is a sub-problem of just the same form as the original one.
So this suggests a recursion: you place the first segment, then try out the various combinations of assigning remaining segments to sub-arrays, and maximize over those combinations. Then you memoize: the sub-problems all take an array and a list of segment sizes, just like the original problem.
I'm not sure this is the best algorithm but it is the one suggested by a "direct" dynamic programming approach.
EDIT: In more detail:
The arguments to the valuation function should have two parts: one is a pair of numbers which represent the sub-array being analysed (initially [0,6] in this example) and the second is a multi-set of numbers representing the lengths of the segments to be allocated ({3,3} in this example). Then in pseudo-code you do something like this:
valuation( array_ends, the_segments):
if sum of the_segments > array_ends[1] - array_ends[0]:
return -infinity
segment_length = length of chosen segment from the_segments
remaining_segments = the_segments with chosen segment removed
best_option = 0
for segment_placement = array_ends[0] to array_ends[1] - segment_length:
value1 = value of placing the chosen segment at segment_placement
new_array1 = [array_ends[0],segment_placement]
new_array2 = [segment_placement + segment_length,array_ends[1]]
for each partition of remaining segments into seg1 and seg2:
sub_value1 = valuation( new_array1, seg1)
sub_value2 = valuation( new_array2, seg2)
if value1 + sub_value1 + sub_value2 > best_option:
best_option = value1 + sub_value1 + sub_value2
return best_option
This code (modulo off by one errors and typos) calculates the valuation but it calls the valuation function more than once with the same arguments. So the idea of the memoization is to cache those results and avoid re-traversing equivalent parts of the tree. So we can do this just by wrapping the valuation function:
memoized_valuation(args):
if args in memo_dictionary:
return memo_dictionary[args]
else:
result = valuation(args)
memo_dictionary[args] = result
return result
Of course, you need to change the recursive call now to call memoized_valuation.

Disperse Duplicates in an Array

Source : Google Interview Question
Write a routine to ensure that identical elements in the input are maximally spread in the output?
Basically, we need to place the same elements,in such a way , that the TOTAL spreading is as maximal as possible.
Example:
Input: {1,1,2,3,2,3}
Possible Output: {1,2,3,1,2,3}
Total dispersion = Difference between position of 1's + 2's + 3's = 4-1 + 5-2 + 6-3 = 9 .
I am NOT AT ALL sure, if there's an optimal polynomial time algorithm available for this.Also,no other detail is provided for the question other than this .
What i thought is,calculate the frequency of each element in the input,then arrange them in the output,each distinct element at a time,until all the frequencies are exhausted.
I am not sure of my approach .
Any approaches/ideas people .
I believe this simple algorithm would work:
count the number of occurrences of each distinct element.
make a new list
add one instance of all elements that occur more than once to the list (order within each group does not matter)
add one instance of all unique elements to the list
add one instance of all elements that occur more than once to the list
add one instance of all elements that occur more than twice to the list
add one instance of all elements that occur more than trice to the list
...
Now, this will intuitively not give a good spread:
for {1, 1, 1, 1, 2, 3, 4} ==> {1, 2, 3, 4, 1, 1, 1}
for {1, 1, 1, 2, 2, 2, 3, 4} ==> {1, 2, 3, 4, 1, 2, 1, 2}
However, i think this is the best spread you can get given the scoring function provided.
Since the dispersion score counts the sum of the distances instead of the squared sum of the distances, you can have several duplicates close together, as long as you have a large gap somewhere else to compensate.
for a sum-of-squared-distances score, the problem becomes harder.
Perhaps the interview question hinged on the candidate recognizing this weakness in the scoring function?
In perl
#a=(9,9,9,2,2,2,1,1,1);
then make a hash table of the counts of different numbers in the list, like a frequency table
map { $x{$_}++ } #a;
then repeatedly walk through all the keys found, with the keys in a known order and add the appropriate number of individual numbers to an output list until all the keys are exhausted
#r=();
$g=1;
while( $g == 1 ) {
$g=0;
for my $n (sort keys %x)
{
if ($x{$n}>1) {
push #r, $n;
$x{$n}--;
$g=1
}
}
}
I'm sure that this could be adapted to any programming language that supports hash tables
python code for algorithm suggested by Vorsprung and HugoRune:
from collections import Counter, defaultdict
def max_spread(data):
cnt = Counter()
for i in data: cnt[i] += 1
res, num = [], list(cnt)
while len(cnt) > 0:
for i in num:
if num[i] > 0:
res.append(i)
cnt[i] -= 1
if cnt[i] == 0: del cnt[i]
return res
def calc_spread(data):
d = defaultdict()
for i, v in enumerate(data):
d.setdefault(v, []).append(i)
return sum([max(x) - min(x) for _, x in d.items()])
HugoRune's answer takes some advantage of the unusual scoring function but we can actually do even better: suppose there are d distinct non-unique values, then the only thing that is required for a solution to be optimal is that the first d values in the output must consist of these in any order, and likewise the last d values in the output must consist of these values in any (i.e. possibly a different) order. (This implies that all unique numbers appear between the first and last instance of every non-unique number.)
The relative order of the first copies of non-unique numbers doesn't matter, and likewise nor does the relative order of their last copies. Suppose the values 1 and 2 both appear multiple times in the input, and that we have built a candidate solution obeying the condition I gave in the first paragraph that has the first copy of 1 at position i and the first copy of 2 at position j > i. Now suppose we swap these two elements. Element 1 has been pushed j - i positions to the right, so its score contribution will drop by j - i. But element 2 has been pushed j - i positions to the left, so its score contribution will increase by j - i. These cancel out, leaving the total score unchanged.
Now, any permutation of elements can be achieved by swapping elements in the following way: swap the element in position 1 with the element that should be at position 1, then do the same for position 2, and so on. After the ith step, the first i elements of the permutation are correct. We know that every swap leaves the scoring function unchanged, and a permutation is just a sequence of swaps, so every permutation also leaves the scoring function unchanged! This is true at for the d elements at both ends of the output array.
When 3 or more copies of a number exist, only the position of the first and last copy contribute to the distance for that number. It doesn't matter where the middle ones go. I'll call the elements between the 2 blocks of d elements at either end the "central" elements. They consist of the unique elements, as well as some number of copies of all those non-unique elements that appear at least 3 times. As before, it's easy to see that any permutation of these "central" elements corresponds to a sequence of swaps, and that any such swap will leave the overall score unchanged (in fact it's even simpler than before, since swapping two central elements does not even change the score contribution of either of these elements).
This leads to a simple O(nlog n) algorithm (or O(n) if you use bucket sort for the first step) to generate a solution array Y from a length-n input array X:
Sort the input array X.
Use a single pass through X to count the number of distinct non-unique elements. Call this d.
Set i, j and k to 0.
While i < n:
If X[i+1] == X[i], we have a non-unique element:
Set Y[j] = Y[n-j-1] = X[i].
Increment i twice, and increment j once.
While X[i] == X[i-1]:
Set Y[d+k] = X[i].
Increment i and k.
Otherwise we have a unique element:
Set Y[d+k] = X[i].
Increment i and k.

Finding size of max independent set in binary tree - why faulty "solution" doesn't work?

Here is a link to a similar question with a good answer: Java Algorithm for finding the largest set of independent nodes in a binary tree.
I came up with a different answer, but my professor says it won't work and I'd like to know why (he doesn't answer email).
The question:
Given an array A with n integers, its indexes start with 0 (i.e, A[0],
A[1], …, A[n-1]). We can interpret A as a binary tree in which the two
children of A[i] are A[2i+1] and A[2i+2], and the value of each
element is the node weight of the tree. In this tree, we say that a
set of vertices is "independent" if it does not contain any
parent-child pair. The weight of an independent set is just the
summation of all weights of its elements. Develop an algorithm to
calculate the maximum weight of any independent set.
The answer I came up with used the following two assumptions about independent sets in a binary tree:
All nodes on the same level are independent from each other.
All nodes on alternating levels are independent from each other (there are no parent/child relations)
Warning: I came up with this during my exam, and it isn't pretty, but I just want to see if I can argue for at least partial credit.
So, why can't you just build two independent sets (one for odd levels, one for even levels)?
If any of the weights in each set are non-negative, sum them (discarding the negative elements because that won't contribute to a largest weight set) to find the independent set with the largest weight.
If the weights in the set are all negative (or equal to 0), sort it and return the negative number closest to 0 for the weight.
Compare the weights of the largest independent set from each of the two sets and return it as the final solution.
My professor claims it won't work, but I don't see why. Why won't it work?
Interjay has noted why your answer is incorrect. The problem can be solved with a recursive algorithm find-max-independent which, given a binary tree, considers two cases:
What is the max-independent set given that the root node is
included?
What is the max-independent set given that the root node
is not included?
In case 1, since the root node is included, neither of its children can. Thus we sum the value of find-max-independent of the grandchildren of root, plus the value of root (which must be included), and return that.
In case 2, we return the max value of find-max-independent of the children nodes, if any (we can pick only one)
The algorithm may look something like this (in python):
def find_max_independent ( A ):
N=len(A)
def children ( i ):
for n in (2*i+1, 2*i+2):
if n<N: yield n
def gchildren ( i ):
for child in children(i):
for gchild in children(child):
yield gchild
memo=[None]*N
def rec ( root ):
"finds max independent set in subtree tree rooted at root. memoizes results"
assert(root<N)
if memo[root] != None:
return memo[root]
# option 'root not included': find the child with the max independent subset value
without_root = sum(rec(child) for child in children(root))
# option 'root included': possibly pick the root
# and the sum of the max value for the grandchildren
with_root = max(0, A[root]) + sum(rec(gchild) for gchild in gchildren(root))
val=max(with_root, without_root)
assert(val>=0)
memo[root]=val
return val
return rec(0) if N>0 else 0
Some test cases illustrated:
tests=[
[[1,2,3,4,5,6], 16], #1
[[-100,2,3,4,5,6], 6], #2
[[1,200,3,4,5,6], 200], #3
[[1,2,3,-4,5,-6], 6], #4
[[], 0],
[[-1], 0],
]
for A, expected in tests:
actual=find_max_independent(A)
print("test: {}, expected: {}, actual: {} ({})".format(A, expected, actual, expected==actual))
Sample output:
test: [1, 2, 3, 4, 5, 6], expected: 16, actual: 16 (True)
test: [-100, 2, 3, 4, 5, 6], expected: 15, actual: 15 (True)
test: [1, 200, 3, 4, 5, 6], expected: 206, actual: 206 (True)
test: [1, 2, 3, -4, 5, -6], expected: 8, actual: 8 (True)
test: [], expected: 0, actual: 0 (True)
test: [-1], expected: 0, actual: 0 (True)
Test case 1
Test case 2
Test case 3
Test case 4
The complexity of the memoized algorithm is O(n), since rec(n) is called once for each node. This is a top-down dynamic programming solution using depth-first-search.
(Test case illustrations courtesy of leetcode's interactive binary tree editor)
Your algorithm doesn't work because the set of nodes it returns will be either all from odd levels, or all from even levels. But the optimal solution can have nodes from both.
For example, consider a tree where all weights are 0 except for two nodes which have weight 1. One of these nodes is at level 1 and the other is at level 4. The optimal solution will contain both these nodes and have weight 2. But your algorithm will only give one of these nodes and have weight 1.

Resources