2d Tree Nearest Neighbor Algorithm Clarification - algorithm

I am trying to implement a recursive nearest neighbour algorithm for a 2d-Tree.
Recursion (and unwinding recursion) is still kind of confusing for me and the best pseudocode I have found is from this StackOverflow question:
2D KD Tree and Nearest Neighbour Search
However the answer uses a "Median" value, which I am not sure how to compute. Also the wikipedia article on kd-trees has a nearest neighbour pseudo code that does not use a median value.
I would like to know if it is possible to construct a recursive version of the Nearest Neighbours algorithm without using a median value. If anyone can provide me with pseudo code for this I will be grateful.

If you are desperate in not using a median, you can use mean. Here, there is the simple approach:
Example 1: What is the Mean of these numbers?
6, 11, 7
Add the numbers: 6 + 11 + 7 = 24
Divide by how many numbers (there are 3 numbers): 24 / 3 = 8
The Mean is 8
However, I highly recommend you to go for the median, since the dimensions allow it in your case.
Example: find the Median of 12, 3 and 5
Put them in order:
3, 5, 12
The middle number is 5, so the median is 5.
Source
You do not really need to sort them. Pseudo-sorting is enough, be using Quickselect for example.
In C++ for example you could use use nth_element() to efficiently find the median. You can see my question here, where I needed the median for general dimensions. In the case of 2D, it can sure be simplified.

Related

Partition matrix to minimize variance of parts

I have a matrix of real numbers and I would like to find a partition of this matrix such that the both the number of parts and the variance of the numbers in each part are minimized. Intuitively, I want as few parts as possible, but I also want all the numbers within any given part to be close together.
More formally, I suppose for the latter I would find for each part the variance of the numbers in that part, and then take the average of those variances over all the parts. This would be part of the "score" for a given solution, the other part of the score would be, for instance, the total number of elements in the matrix minus the number of parts in the partition, so that fewer parts would lead to this part of the score being higher. The final score for the solution would be a weighted average of the two parts, and the best solution is the one with the highest score.
Obviously a lot of this is heuristic: I need to decide how to balance the number of parts versus the variances. But I'm stuck for even a general approach to the problem.
For instance, given the following simple matrix:
10, 11, 12, 20, 21
8, 13, 9, 22, 23
25, 23, 24, 26, 27
It would be a reasonable solution to partition into the following submatrices:
10, 11, 12 | 20, 21
8, 13, 9 | 22, 23
--------------+----------
25, 23, 24 | 26, 27
Partitioning is only allowed by slicing vertically and horizontally.
Note that I don't need the optimal solution, I just need an approach to get a "good" solution. Also, these matrices are several hundred by several hundred, so brute forcing it is probably not a reasonable solution, unless someone can propose a good way to pare down the search space.
I think you'd be better off by starting with a simpler problem. Let's call this
Problem A: given a fixed number of vertical and/or horizontal partitions, where should they go to minimize the sum of variances (or perhaps some other measure of variation, such as the sum of ranges within each block).
I'd suggest using a dynamic programming formulation for problem A.
Once you have that under control, then you can deal with
Problem B: find the best trade-off between variation and the number of vertical and horizontal partitions.
Obviously, you can reduce the variance to 0 by putting each element into its own block. In general, problem B requires you to solve problem A for each choice of vertical and horizontal partition counts that is considered.
To use a dynamic programming approach for problem B, you would have to formulate an objective function that encapsulates the trade-off you seek. I'm not sure how feasible this is, so I'd suggest looking for different approaches.
As it stands, problem B is a 2D problem. You might find some success looking at 2D clustering algorithms. An alternative might be possible if it can be reformulated as a 1D problem: trading off variation with the number of blocks (instead of the number of vertical and horizontal partition count). Then you could use something like the Jenks natural breaks classification method to decide where to draw the line(s).
Anyway, this answer clearly doesn't give you a working algorithm. But I hope that it does at least provide an approach (which is all you asked for :)).

Indexing for similarity search

I have about 100M numeric vectors (Minhash fingerprints), each vector contains 100 integer numbers between 0 and 65536, and I'm trying to do a fast similarity search against this database of fingerprints using Jaccard similarity, i.e. given a query vector (e.g. [1,0,30, 9, 42, ...]) find the ratio of intersection/union of this query set against the database of 100M sets.
The requirement is to return k "nearest neighbors" of the query vector in <1 sec (not including indexing/File IO time) on a laptop. So obviously some kind of indexing is required, and the question is what would be the most efficient way to approach this.
notes:
I thought of using SimHash but in this case actually need to know the size of intersection of the sets to identify containment rather than pure similarity/resemblance, but Simhash would lose that information.
I've tried using a simple locality sensitive hashing technique as described in ch3 of Jeffrey Ullman's book by dividing each vector into 20 "bands" or snippets of length 5, converting these snippets into strings (e.g. [1, 2, 45, 2, 3] - > "124523") and using these strings as keys in a hash table, where each key contains "candidate neighbors". But the problem is that it creates too many candidates for some of these snippets and changing number of bands doesn't help.
I might be a bit late, but I would suggest IVFADC indexing by Jegou et al.: Product Quantization for Nearest Neighbor Search
It works for L2 Distance/dot product similarity measures and is a bit complex, but it's particularly efficient in terms of both time and memory.
It is also implemented in the FAISS library for similarity search, so you could also take a look at that.
One way to go about this is the following:
(1) Arrange the vectors into a tree (a radix tree).
(2) Query the tree with a fuzzy criteria, in other words, a match is if the difference in values at each node of the tree is within a threshold
(3) From (2) generate a subtree that contains all the matching vectors
(4) Now, repeat process (2) on the sub tree with a smaller threshold
Continue until the subtree has K items. If K has too few items, then take the previous tree and calculate the Jacard distance on each member of the subtree and sort to eliminate the worst matches until you have only K items left.
answering my own question after 6 years, there is a benchmark for approximate nearest neighbor search with many algorithms to solve this problem: https://github.com/erikbern/ann-benchmarks, the current winner is "Hierarchical Navigable Small World graphs": https://github.com/nmslib/hnswlib
You can use off-the-shelf similarity search services such as AWS-ES or Pinecone.io.

Algorithm for selecting n vectors out of a set while minimizing cost

assuming we have:
set U of n-dimensional vectors (vector v = < x1,x2 ... ,xn >)
constraint n-dimensional vector c = < x1...xn >
n-dimensional vector of weights w = < x1...xn >
integer S
i need algorithm that would select S vectors from U into set R while minimizing function cost(R)
cost(R) = sum(abs(c-sumVectors(R))*w)
(sumVectors is a function that sums all vectors like so: sumVectors({< 1,2 >; < 3 ,4>}) = < 4,6 > while sum(< 1, 2, 3 >) returns scalar 6)
The solution does not have to be optimal. I just need to get a best guess i can get in preset time.
Any idea where to start? (Preferably something faster/smarter than genetic algorithms)
This is an optimization problem. Since you don't need the optimal solution, you can try the stochastic optimization method, e.g., Hill Climbing, in which you start with a random solution (a random subset of R) and look at the set of neighboring solutions (adding or removing one of the components of current solution) for those that are better with respective of the cost function.
To get better solution, you can also add Simulated Annealing to your hill climbing search. The idea is that in some cases, it's necessary to move to a worse solution and then arrive at a better one later. Simulated Annealing works better because it allows a move for a worse solution near the beginning of the process. The algorithm becomes less likely to allow a worse solution as the process goes on.
I paste some sample hill climbing python code to solve your problem here:
https://gist.github.com/921f398d61ad351ac3d6
In my sample code, R always holds a list of the index into U, and I use euclidean distance to compare the similarity between neighbors. Certainly you can use other distance functions that satisfy your own needs. Also note in the code, I am getting neighbors on the fly. If you have a large pool of vectors in U, you might want to cache the pre-computed neighbors or even consider locality sensitive hashing to avoid O(n^2) comparison. Simulated Annealing can be added onto the above code.
The result of one random run is shown below.
I use only 20 vectors in U, S=10, so that I can compare the result with an optimal solution.
The hill climbing process stops at the 4th step when there is no better choice to move to with replacing only one k-nearest-neighbors.
I also run with an exhaustive search which iterates all possible combinations. You can see that the hill-climbing result is pretty good compared with the exhaustive approach. It takes only 4 steps to get the relatively small cost (a local minimum though) which takes the exhaustive search more than 82K steps to beat it.
initial R [1, 3, 4, 5, 6, 11, 13, 14, 15, 17]
hill-climbing cost at step 1: 91784
hill-climbing cost at step 2: 89574
hill-climbing cost at step 3: 88664
hill-climbing cost at step 4: 88503
exhaustive search cost at step 1: 94165
exhaustive search cost at step 2: 93888
exhaustive search cost at step 4: 93656
exhaustive search cost at step 5: 93274
exhaustive search cost at step 10: 92318
exhaustive search cost at step 44: 92089
exhaustive search cost at step 50: 91707
exhaustive search cost at step 84: 91561
exhaustive search cost at step 99: 91329
exhaustive search cost at step 105: 90947
exhaustive search cost at step 235: 90718
exhaustive search cost at step 255: 90357
exhaustive search cost at step 8657: 90271
exhaustive search cost at step 8691: 90129
exhaustive search cost at step 8694: 90048
exhaustive search cost at step 19637: 90021
exhaustive search cost at step 19733: 89854
exhaustive search cost at step 19782: 89622
exhaustive search cost at step 19802: 89261
exhaustive search cost at step 20097: 89032
exhaustive search cost at step 20131: 88890
exhaustive search cost at step 20134: 88809
exhaustive search cost at step 32122: 88804
exhaustive search cost at step 32125: 88723
exhaustive search cost at step 32156: 88581
exhaustive search cost at step 69336: 88506
exhaustive search cost at step 82628: 88420
You're going to need to check the costs all possible sets R and minimise. If you choose vectors in a stepwise fashion minimsing cost at each addition, you may not find the set with minimum cost. If the set U of vectors is very very large and computation is too slow you may be forced to use a stepwise method.
Your problem is essentially a combinatoric optimisation one. These are hard to solve, but I can offer a couple of suggestions. They're based around the idea that you can't explore all combinations, so you're constrained to exploring in the vicinity of greedily optimal solutions.
There's a very general method called beam-search, which is a heuristic method which essentially modifies best-first search to work with limited memory (the beam width). It relies on the notion that for any given partial solution, you can calculate the score associated with adding some new member to the set, as well as the score for a current set (since you have an objective function that's fine). The idea is that you then start with the empty set, and continually select the n best next states for every state on the stack, and then when all are expanded you throw away all but the n best states on the stack and repeat. This will get you n possible solutions, and you can pick the highest scoring.
This may not work, however, as the particulars of your objective function will make this pick vectors close to the constraint immediately, and then (after some number of steps depending on the relative scales of your cost vectors and component vectors) look for small vectors to reduce the difference. If so, you could use the solution from this method to initialise a random walk/simulated annealing strategy (would allow you to randomly add or remove from the set) to look for better solutions close to the solution you obtained with the beam search.

15 Puzzle Heuristic

The 15 Puzzle is a classical problem for modelling algorithms involving heuristics. Commonly used heuristics for this problem include counting the number of misplaced tiles and finding the sum of the Manhattan distances between each block and its position in the goal configuration. Note that both are admissible, i.e., they never overestimate the number of moves left, which ensures optimality for certain search algorithms such as A*.
What Heuristic do you think is proper, A* seems to work nice, do you have an example, maybe in c or java?
Heuristic
My heuristic of choice is to find if the sum of all inversions in a permutation is odd or even - if it is even, then the 15Puzzle is solvable.
The number of inversions in a permutation is equal to that of its inverse permutation (Skiena 1990, p. 29; Knuth 1998).
Only if I know it can be solved does it make sense to solve it. The task then is to reduce inverses and - viola problem solved. To find a solution should be no more then 80 moves.
Even more help
The equation for permutation is:
Factorials in range of 0 to 16 are {1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, 39916800, 479001600, 6227020800, 87178291200, 1307674368000, 20922789888000}. If you need more of them, search WolframAlpha for Range[1,20]!
If you want to learn more about it read: 15Puzzle.
Fifteen puzzle implemetation in C++ using A* algorihtm https://gist.github.com/sunloverz/7338003

Sort array in ascending order while minimizing "cost"

I'm taking comp 2210 (Data Structures) next semester and I've been doing the homework for the summer semester that is posted online. Until now, I've had no problems doing the assignments. Take a look at assignment 4 below, and see if you can give me a hint as to how to approach it. Please don't provide a complete algorithm, just an approach. Thanks!
A “costed sort” is an algorithm in which a sequence of values must be arranged in ascending order. The sort is
carried out by interchanging the position of two values one at a time until the sequence is in the proper order. Each
interchange incurs a cost, which is calculated as the sum of the two values involved in the interchange. The total
cost of the sort is the sum of the cost of the interchanges.
For example, suppose the starting
sequence were {3, 2, 1}. One possible
series of interchanges is
Interchange 1: {3, 1, 2} interchange cost = 0
Interchange 2: {1, 3, 2} interchange cost = 4
Interchange 3: {1, 2, 3} interchange cost = 5,
given a total cost of 9
You are to write a program that determines the minimal cost to arrange a specific sequence of numbers.
Edit: The professor does not allow brute forcing.
If you want to surprise your professor, you could use Simulated Annealing. Then again, if you manage that, you can probably skip a few courses :). Note that this algorithm will only give an approximate answer.
Otherwise: try a Backtracking algorithm, or Branch and Bound. These will both find the optimal answer.
What do you mean "brute forcing?" Do you mean "try all possible combinations and select the cheapest?" Just checking.
I think "branch and bound" is what you're looking for - check any source on algorithms. It is "like" brute force, except as you try a sequence of moves, as soon as that sequence of moves is less optimal than any other sequence of moves tried so far, you can abandon the sequence that got you to that point - the cost. This is one flavor of the "backtracking" mentioned above.
My preferred language for doing this would be Prolog but I'm weird.
Simulated Annealing is a PROBABLISTIC algorithm - if the solution space has local minima, then you may be trapped in one and get what you think is the right answer but isn't. There are ways around that and the literature all about that can be found but I don't agree that it's the tool you want.
You could try the related genetic algorithms too if that's the way you want to go.
Have you learned trees? You could create a tree with all possible changes leading to the desired result. The trick, of course, is to avoid creating the whole tree -- particularly when a part of it is obviously not the best solution, right?
I think the appropriate approach is to think hard about what defining properties a minimal "cost" sort has. Then figure out the cost by simulating this ideal sort. The key element here is you don't have to implement a general minimal cost sorting algorithm.
For example, let's say the defining property of a minimal cost sort is that every exchange puts at least one of the exchanged element in it's sorted position (I don't know if this is true). Every exchange based sort would love to be able to have this property, but it's not easy(possible?) in the general case. However You can easily create a program that takes an unsorted array, takes the sorted version (which itself can be generated by an unoptimal algorithm), and then using this information decides the minimum cost to achieve the sorted array from the unsorted array.
Description
I think the cheapest way to do this is to swap the cheapest misplaced item with the item that belongs in its spot. I believe this reduces cost by moving the most expensive things just once. If there are n-elements that are out of place, then there will be at most n-1 swaps to put them in place, at a cost of n-1 * cost of least item + cost of all other out of place.
If the globally cheapest element is not misplaced and the spread between this cheapest one and the cheapest misplaced one is great enough, it can be cheaper to swap the cheapest one in its correct place with the cheapest misplaced one. The cost then is n-1 * cheapest + cost of all out of place + cost of cheapest out of place.
Example
For [4,1,2,3], this algorithm exchanges (1,2) to produce:
[4,2,1,3]
and then swaps (3,1) to produce:
[4,2,3,1]
and then swaps (4,1) to produce:
[1,2,3,4]
Notice that each misplaced item in [2,3,4] is moved only once and is swapped with the lowest cost item.
Code
Ooops: "Please don't provide a complete algorithm, just an approach." Removed my code.
In an effort to only get you going on it this may not make complete sense.
Determine all possible moves, and the cost for each move and store those somehow, perform the least expensive move, then determine the moves that can be performed from this variation, storing those with the rest of your stored moves, perform the least expensive etc until the array is sorted.
I love solving things like this.
This problem is also known as the Silly Sort problem in some ACM contests. Take a look at this solution using Divide & Conquer.
Try different sorting algorithms on the same input data and print the minimum.

Resources