Algorithm Help - hit test for small objects - algorithm

When implementing a selection algorithm in a processing sketch I cycle through each object in the scene and check to see if it is within a few pixel range of where the mouse clicked. There are lots of objects and they are very small.
As you can imagine once the scene gets filled with objects this becomes really burdensome. Are there easy ways to speed up this search? Can I easily I make this search binary? The objects in my scene are points so polygon hit-testing algorithms don't seem like the right solution.

Divide the scene into buckets, either into N x-buckets and M y-buckets, or into N*M x*y buckets. In the former case, the buckets are stored in two arrays (an x-array and a y-array); in the latter case, the buckets are stored in an array of arrays (the outer arrays indexes the x-coordinates, the inner arrays index the y-coordinates). In either case, the buckets store references to all of the points within the area indexed by the bucket; for example, the point (8, 12) would be in the x-bucket [5, 10] and the y-bucket [10, 15], or else it would be in the x*y bucket ([5, 10], [10, 15]).
When looking up a point, either look up the appropriate x and y buckets, or else simply look up the appropriate x*y buckets. In the former case, take intersection(union(x-buckets), union(y-buckets)). You may need to look up multiple buckets depending on the hit radius, for example if you're looking up the x-coordinate 9 with radius 2 then you'd need both the [5, 10] and [10, 15] buckets.
Using separate x and y buckets takes up less space (N + M buckets instead of N*M buckets) and makes the indexing easier (two separate arrays vs. one nested array), while the x*y buckets make for faster lookups since you won't need to take any set intersections.
The smaller your buckets, the more space the data structure will take up, but the fewer false positives you'll retrieve. Ideally, if you have sufficient memory, then the buckets will cover the same interval as the hit radius.

Maybe if you sort the arrays by one axis, lets say x you can speed things by returning early, I got this example from thomas.diewald at processing forum in this question. It may suits your need. Here only the part of the test ( you can look at the comlete code in the link above). There is an ArrayList of point which has x and y fields. Take a look
note that he is using a label for returning.
Arrays.sort(points);
__FIND_NEIGHBORS__:
for (int i = 0; i < num_points; i++) {
Point pi = points[i];
for (int j = i+1; j < num_points; j++) {
Point pj = points[j];
// 1. check in x
float dx = pj.pos.x-pi.pos.x; // always positive -> points are sorted
if( dx > max_dist ){
continue __FIND_NEIGHBORS__; // ... no more points within max_dist.
}
// 2. check in y
float dy = Math.abs(pj.pos.y-pi.pos.y); // not always positive
if( dy > max_dist ){
continue;
}
// 3. check, could also just draw the line here (Manhattan distance)
if ((dx*dx+dy*dy) < max_dist_sq) {
drawLine(pi, pj);
connections++;
}
}
}

Related

Algorithm for downsampling array of intervals

I have a sorted array of N intervals of different length. I am plotting these intervals with alternating colors blue/green.
I am trying to find a method or algorithm to "downsample" the array of intervals to produce a visually similar plot, but with less elements.
Ideally I could write some function where I can pass the target number of output intervals as an argument. The output length only has to come close to the target.
input = [
[0, 5, "blue"],
[5, 6, "green"],
[6, 10, "blue"],
// ...etc
]
output = downsample(input, 25)
// [[0, 10, "blue"], ... ]
Below is a picture of what I am trying to accomplish. In this example the input has about 250 intervals, and the output about ~25 intervals. The input length can vary a lot.
Update 1:
Below is my original post which I initially deleted, because there were issues with displaying the equations and also I wasn't very confident if it really makes sense. But later, I figured that the optimisation problem that I described can be actually solved efficiently with DP (Dynamic programming).
So I did a sample C++ implementation. Here are some results:
Here is a live demo that you can play with in your browser (make sure browser support WebGL2, like Chrome or Firefox). It takes a bit to load the page.
Here is the C++ implementation: link
Update 2:
Turns out the proposed solution has the following nice property - we can easily control the importance of the two parts F1 and F2 of the cost function. Simply change the cost function to F(α)=F1 + αF2, where α >= 1.0 is a free parameter. The DP algorithm remains the same.
Here are some result for different α values using the same number of intervals N:
Live demo (WebGL2 required)
As can be seen, higher α means it is more important to cover the original input intervals even if this means covering more of the background in-between.
Original post
Even-though some good algorithms have already been proposed, I would like to propose a slightly unusual approach - interpreting the task as an optimisation problem. Although, I don't know how to efficiently solve the optimisation problem (or even if it can be solved in reasonable time at all), it might be useful to someone purely as a concept.
First, without loss of generality, lets declare the blue color to be background. We will be painting N green intervals on top of it (N is the number provided to the downsample() function in OP's description). The ith interval is defined by its starting coordinate 0 <= xi < xmax and width wi >= 0 (xmax is the maximum coordinate from the input).
Lets also define the array G(x) to be the number of green cells in the interval [0, x) in the input data. This array can easily be pre-calculated. We will use it to quickly calculate the number of green cells in arbitrary interval [x, y) - namely: G(y) - G(x).
We can now introduce the first part of the cost function for our optimisation problem:
The smaller F1 is, the better our generated intervals cover the input intervals, so we will be searching for xi, wi that minimise it. Ideally we want F1=0 which would mean that the intervals do not cover any of the background (which of course is not possible because N is less than the input intervals).
However, this function is not enough to describe the problem, because obviously we can minimise it by taking empty intervals: F1(x, 0)=0. Instead, we want to cover as much as possible from the input intervals. Lets introduce the second part of the cost function which corresponds to this requirement:
The smaller F2 is, the more input intervals are covered. Ideally we want F2=0 which would mean that we covered all of the input rectangles. However, minimising F2 competes with minimising F1.
Finally, we can state our optimisation problem: find xi, wi that minimize F=F1 + F2
How to solve this problem? Not sure. Maybe use some metaheuristic approach for global optimisation such as Simulated annealing or Differential evolution. These are typically easy to implement, especially for this simple cost function.
Best case would be to exist some kind of DP algorithm for solving it efficiently, but unlikely.
I would advise you to use Haar wavelet. That is a very simple algorithm which was often used to provide the functionality of progressive loading for big images on websites.
Here you can see how it works with 2D function. That is what you can use. Alas, the document is in Ukrainian, but code in C++, so readable:)
This document provides an example of 3D object:
Pseudocode on how to compress with Haar wavelet you can find in Wavelets for Computer Graphics: A Primer Part 1y.
You could do the following:
Write out the points that divide the whole strip into intervals as the array [a[0], a[1], a[2], ..., a[n-1]]. In your example, the array would be [0, 5, 6, 10, ... ].
Calculate double-interval lengths a[2]-a[0], a[3]-a[1], a[4]-a[2], ..., a[n-1]-a[n-3] and find the least of them. Let it be a[k+2]-a[k]. If there are two or more equal lengths having the lowest value, choose one of them randomly. In your example, you should get the array [6, 5, ... ] and search for the minimum value through it.
Swap the intervals (a[k], a[k+1]) and (a[k+1], a[k+2]). Basically, you need to assign a[k+1]=a[k]+a[k+2]-a[k+1] to keep the lengths, and to remove the points a[k] and a[k+2] from the array after that because two pairs of intervals of the same color are now merged into two larger intervals. Thus, the numbers of blue and green intervals decreases by one each after this step.
If you're satisfied with the current number of intervals, end the process, otherwise go to the step 1.
You performed the step 2 in order to decrease "color shift" because, at the step 3, the left interval is moved a[k+2]-a[k+1] to the right and the right interval is moved a[k+1]-a[k] to the left. The sum of these distances, a[k+2]-a[k] can be considered a measure of change you're introducing into the whole picture.
Main advantages of this approach:
It is simple.
It doesn't give a preference to any of the two colors. You don't need to assign one of the colors to be the background and the other to be the painting color. The picture can be considered both as "green-on-blue" and "blue-on-green". This reflects quite common use case when two colors just describe two opposite states (like the bit 0/1, "yes/no" answer) of some process extended in time or in space.
It always keeps the balance between colors, i.e. the sum of intervals of each color remains the same during the reduction process. Thus the total brightness of the picture doesn't change. It is important as this total brightness can be considered an "indicator of completeness" at some cases.
Here's another attempt at dynamic programming that's slightly different than Georgi Gerganov's, although the idea to try and formulate a dynamic program may have been inspired by his answer. Neither the implementation nor the concept is guaranteed to be sound but I did include a code sketch with a visual example :)
The search space in this case is not reliant on the total unit width but rather on the number of intervals. It's O(N * n^2) time and O(N * n) space, where N and n are the target and given number of (green) intervals, respectively, because we assume that any newly chosen green interval must be bound by two green intervals (rather than extend arbitrarily into the background).
The idea also utilises the prefix sum idea used to calculate runs with a majority element. We add 1 when we see the target element (in this case green) and subtract 1 for others (that algorithm is also amenable to multiple elements with parallel prefix sum tracking). (I'm not sure that restricting candidate intervals to sections with a majority of the target colour is always warranted but it may be a useful heuristic depending on the desired outcome. It's also adjustable -- we can easily adjust it to check for a different part than 1/2.)
Where Georgi Gerganov's program seeks to minimise, this dynamic program seeks to maximise two ratios. Let h(i, k) represent the best sequence of green intervals up to the ith given interval, utilising k intervals, where each is allowed to stretch back to the left edge of some previous green interval. We speculate that
h(i, k) = max(r + C*r1 + h(i-l, k-1))
where, in the current candidate interval, r is the ratio of green to the length of the stretch, and r1 is the ratio of green to the total given green. r1 is multiplied by an adjustable constant to give more weight to the volume of green covered. l is the length of the stretch.
JavaScript code (for debugging, it includes some extra variables and log lines):
function rnd(n, d=2){
let m = Math.pow(10,d)
return Math.round(m*n) / m;
}
function f(A, N, C){
let ps = [[0,0]];
let psBG = [0];
let totalG = 0;
A.unshift([0,0]);
for (let i=1; i<A.length; i++){
let [l,r,c] = A[i];
if (c == 'g'){
totalG += r - l;
let prevI = ps[ps.length-1][1];
let d = l - A[prevI][1];
let prevS = ps[ps.length-1][0];
ps.push(
[prevS - d, i, 'l'],
[prevS - d + r - l, i, 'r']
);
psBG[i] = psBG[i-1];
} else {
psBG[i] = psBG[i-1] + r - l;
}
}
//console.log(JSON.stringify(A));
//console.log('');
//console.log(JSON.stringify(ps));
//console.log('');
//console.log(JSON.stringify(psBG));
let m = new Array(N + 1);
m[0] = new Array((ps.length >> 1) + 1);
for (let i=0; i<m[0].length; i++)
m[0][i] = [0,0];
// for each in N
for (let i=1; i<=N; i++){
m[i] = new Array((ps.length >> 1) + 1);
for (let ii=0; ii<m[0].length; ii++)
m[i][ii] = [0,0];
// for each interval
for (let j=i; j<m[0].length; j++){
m[i][j] = m[i][j-1];
for (let k=j; k>i-1; k--){
// our anchors are the right
// side of each interval, k's are the left
let jj = 2*j;
let kk = 2*k - 1;
// positive means green
// is a majority
if (ps[jj][0] - ps[kk][0] > 0){
let bg = psBG[ps[jj][1]] - psBG[ps[kk][1]];
let s = A[ps[jj][1]][1] - A[ps[kk][1]][0] - bg;
let r = s / (bg + s);
let r1 = C * s / totalG;
let candidate = r + r1 + m[i-1][j-1][0];
if (candidate > m[i][j][0]){
m[i][j] = [
candidate,
ps[kk][1] + ',' + ps[jj][1],
bg, s, r, r1,k,m[i-1][j-1][0]
];
}
}
}
}
}
/*
for (row of m)
console.log(JSON.stringify(
row.map(l => l.map(x => typeof x != 'number' ? x : rnd(x)))));
*/
let result = new Array(N);
let j = m[0].length - 1;
for (let i=N; i>0; i--){
let [_,idxs,w,x,y,z,k] = m[i][j];
let [l,r] = idxs.split(',');
result[i-1] = [A[l][0], A[r][1], 'g'];
j = k - 1;
}
return result;
}
function show(A, last){
if (last[1] != A[A.length-1])
A.push(last);
let s = '';
let j;
for (let i=A.length-1; i>=0; i--){
let [l, r, c] = A[i];
let cc = c == 'g' ? 'X' : '.';
for (let j=r-1; j>=l; j--)
s = cc + s;
if (i > 0)
for (let j=l-1; j>=A[i-1][1]; j--)
s = '.' + s
}
for (let j=A[0][0]-1; j>=0; j--)
s = '.' + s
console.log(s);
return s;
}
function g(A, N, C){
const ts = f(A, N, C);
//console.log(JSON.stringify(ts));
show(A, A[A.length-1]);
show(ts, A[A.length-1]);
}
var a = [
[0,5,'b'],
[5,9,'g'],
[9,10,'b'],
[10,15,'g'],
[15,40,'b'],
[40,41,'g'],
[41,43,'b'],
[43,44,'g'],
[44,45,'b'],
[45,46,'g'],
[46,55,'b'],
[55,65,'g'],
[65,100,'b']
];
// (input, N, C)
g(a, 2, 2);
console.log('');
g(a, 3, 2);
console.log('');
g(a, 4, 2);
console.log('');
g(a, 4, 5);
I would suggest using K-means it is an algorithm used to group data(a more detailed explanation here: https://en.wikipedia.org/wiki/K-means_clustering and here https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
this would be a brief explanation of how the function should look like, hope it is helpful.
from sklearn.cluster import KMeans
import numpy as np
def downsample(input, cluster = 25):
# you will need to group your labels in a nmpy array as shown bellow
# for the sake of example I will take just a random array
X = np.array([[1, 2], [1, 4], [1, 0],[4, 2], [4, 4], [4, 0]])
# n_clusters will be the same as desired output
kmeans = KMeans(n_clusters= cluster, random_state=0).fit(X)
# then you can iterate through labels that was assigned to every entr of your input
# in our case the interval
kmeans_list = [None]*cluster
for i in range(0, X.shape[0]):
kmeans_list[kmeans.labels_[i]].append(X[i])
# after that you will basicly have a list of lists and every inner list will contain all points that corespond to a
# specific label
ret = [] #return list
for label_list in kmeans_list:
left = 10001000 # a big enough number to exced anything that you will get as an input
right = -left # same here
for entry in label_list:
left = min(left, entry[0])
right = max(right, entry[1])
ret.append([left,right])
return ret

O(n) algorithm for two identical points

The Problem Statement
Given n points in a 2D plane having x and y coordinate. Two points are identical if one can be obtained from the other by multiplication by the same number. Example: (10,15) and (2,3) are identical whereas (10,15) and (10,20) are not. Suggest an O(n) algorithm which determines whether the input n points contains two identical points or not.
The simple approach can be just checking for each points i.e. if there are 5 points, for the first one I have 4 comparisons, for the second one I have 3 comparisons and so on. But that isn't an O(n) time complexity solution. I really can't think ahead of that. Any suggestions?
One obvious (but possibly inadequate) possibility would be to reduce each point to a floating point number representing the ratio, so (2,3) and (10,15) both become 0.66667, and (10, 20) become 0.5.
The reason this wouldn't work is that floating point numbers tend to be approximate, so you'd just about need to use an approximate comparison, and put up with the fact that it would show points as identical as long as they were equal to (say) 15 decimal places.
If you don't want that, you could create a rational number class that supported comparison (e.g., reduced each ratio to lowest terms).
Either way, once you've reduced a point to a single number, you just insert each into (for one possibility) a hash table. As you insert each you check whether that ratio is already in the hash table--if it is, you have an identical point. If not, insert it normally.
One way to reduce a point to a single number is to multiply the first co-ordinate of the point by product of all the second co-ordinates of the other points.
So for e.g:
(10, 20) -> 10 * 10 * 4 = 400
(5, 10) -> 5 * 20 * 4 = 400
(3, 4) -> 3 * 20 * 10 = 600
The first and second point match. For large sets of points the products would be very large, and would require using a BigNumber (which will be more than O(n)) but you could keep the numbers within a reasonable limit by taking a modulo after each multiplication. Then use a hash table as suggested in Jerry Coffin's answer.
You can easily compute the product of all the second co-ordinates by doing a single forward pass then a single backwards pass over the array and keeping running products:
e.g. in Java:
long m = 9223372036854775783L; // largest prime less than max long
int[][] points = {{1, 2}, {1, 3}, {1, 4}, {1, 5}, {2, 6}};
long[] mods = new long[points.length];
long prod = 1;
for(int i = 0; i < points.length; i++)
{
mods[i] = prod;
prod = (points[i][1] * prod) % m;
}
prod = 1;
for(int i = points.length - 1; i >= 0 ; i--)
{
mods[i] = (mods[i] * prod) % m;
prod = (points[i][1] * prod) % m;
}
HashSet<Long> set = new HashSet<Long>();
for(int i = 0; i < points.length; i++)
{
prod = (mods[i] * points[i][0]) % m;
if(set.contains(prod))
System.out.println("Found a match");
set.add(prod);
}
This algorithm assumes all the co-ordinates are integers != 0. Zeroes can be handled as special cases: all points with zero in the first place match each other, likewise for those with zero in the second place, and (0, 0) matches all points. As an optimization, the second and third pass through the array could be merged into a single pass.

Algorithm to find matching real values in a list

I have a complex algorithm which calculates the result of a function f(x). In the real world f(x) is a continuous function. However due to rounding errors in the algorithm this is not the case in the computer program. The following diagram gives an example:
Furthermore I have a list of several thousands values Fi.
I am looking for all the x values which meet an Fi value i.e. f(xi)=Fi
I can solve this problem with by simply iterating through the x values like in the following pseudo code:
for i=0 to NumberOfChecks-1 do
begin
//calculate the function result with the algorithm
x=i*(xmax-xmin)/NumberOfChecks;
FunctionResult=CalculateFunctionResultWithAlgorithm(x);
//loop through the value list to see if the function result matches a value in the list
for j=0 to NumberOfValuesInTheList-1 do
begin
if Abs(FunctionResult-ListValues[j])<Epsilon then
begin
//mark that element j of the list matches
//and store the corresponding x value in the list
end
end
end
Of course it is necessary to use a high number of checks. Otherwise I will miss some x values. The higher the number of checks the more complete and accurate is the result. It is acceptable that the list is 90% or 95% complete.
The problem is that this brute force approach takes too much time. As I mentioned before the algorithm for f(x) is quite complex and with a high number of checks it takes too much time.
What would be a better solution for this problem?
Another way to do this is in two parts: generate all of the results, sort them, and then merge with the sorted list of existing results.
First step is to compute all of the results and save them along with the x value that generated them. That is:
results = list of <x, result>
for i = 0 to numberOfChecks
//calculate the function result with the algorithm
x=i*(xmax-xmin)/NumberOfChecks;
FunctionResult=CalculateFunctionResultWithAlgorithm(x);
results.Add(x, FunctionResult)
end for
Now, sort the results list by FunctionResult, and also sort the FunctionResult-ListValues array by result.
You now have two sorted lists that you can move through linearly:
i = 0, j = 0;
while (i < results.length && j < ListValues.length)
{
diff = ListValues[j] - results[i];
if (Abs(diff) < Episilon)
{
// mark this one with the x value
// and move to the next result
i = i + 1
}
else if (diff > 0)
{
// list value is much larger than result. Move to next result.
i = i + 1
}
else
{
// list value is much smaller than result. Move to next list value.
j = j + 1
}
}
Sort the list, producing an array SortedListValues that contains
the sorted ListValues and an array SortedListValueIndices that
contains the index in the original array of each entry in
SortedListValues. You only actually need the second of these and
you can create both of them with a single sort by sorting an array
of tuples of (value, index) using value as the sort key.
Iterate over your range in 0..NumberOfChecks-1 and compute the
value of the function at each step, and then use a binary chop
method to search for it in the sorted list.
Pseudo-code:
// sort as described above
SortedListValueIndices = sortIndices(ListValues);
for i=0 to NumberOfChecks-1 do
begin
//calculate the function result with the algorithm
x=i*(xmax-xmin)/NumberOfChecks;
FunctionResult=CalculateFunctionResultWithAlgorithm(x);
// do a binary chop to find the closest element in the list
highIndex = NumberOfValuesInTheList-1;
lowIndex = 0;
while true do
begin
if Abs(FunctionResult-ListValues[SortedListValueIndices[lowIndex]])<Epsilon then
begin
// find all elements in the range that match, breaking out
// of the loop as soon as one doesn't
for j=lowIndex to NumberOfValuesInTheList-1 do
begin
if Abs(FunctionResult-ListValues[SortedListValueIndices[j]])>=Epsilon then
break
//mark that element SortedListValueIndices[j] of the list matches
//and store the corresponding x value in the list
end
// break out of the binary chop loop
break
end
// break out of the loop once the indices match
if highIndex <= lowIndex then
break
// do the binary chop searching, adjusting the indices:
middleIndex = (lowIndex + 1 + highIndex) / 2;
if ListValues[SortedListValueIndices[middleIndex] < FunctionResult then
lowIndex = middleIndex;
else
begin
highIndex = middleIndex;
lowIndex = lowIndex + 1;
end
end
end
Possible complications:
The binary chop isn't taking the epsilon into account. Depending on
your data this may or may not be an issue. If it is acceptable that
the list is only 90 or 95% complete this might be ok. If not then
you'll need to widen the range to take it into account.
I've assumed you want to be able to match multiple x values for each FunctionResult. If that's not necessary you can simplify the code.
Naturally this depends very much on the data, and especially on the numeric distribution of Fi. Another problem is that the f(x) looks very jumpy, eliminating the concept of "assumption of nearby value".
But one could optimise the search.
Picture below.
Walking through F(x) at sufficient granularity, define a rough min
(red line) and max (green line), using suitable tolerance (the "air"
or "gap" in between). The area between min and max is "AREA".
See where each Fi-value hits AREA, do a stacked marking ("MARKING") at X-axis accordingly (can be multiple segments of X).
Where lots of MARKINGs at top of each other (higher sum - the vertical black "sum" arrows), do dense hit tests, hence increasing the overall
chance to get as many hits as possible. Elsewhere do more sparse tests.
Tighten this schema (decrease tolerance) as much as you dare.
EDIT: Fi is a bit confusing. Is it an ordered array or does it have random order (as i assumed)?
Jim Mischel's solution would work in a O(i+j) instead of the O(i*j) solution that you currently have. But, there is a (very) minor bug in his code. The correct code would be :
diff = ListValues[j] - results[i]; //no abs() here
if (abs(diff) < Episilon) //add abs() here
{
// mark this one with the x value
// and move to the next result
i = i + 1
}
the best methods will relay on the nature of your function f(x).
The best solution is if you can create the reversing to F(x) and use it
as you said F(x) is continuous:
therefore you can start evaluating small amount of far points, then find ranges that makes sense, and refine your "assumption" for x that f(x)=Fi
it is not bullet proof, but it is an option.
e.g. Fi=5.7; f(1)=1.4 ,f(4)=4,f(16)=12.6, f(10)=10.1, f(7)=6.5, f(5)=5.1, f(6)=5.8, you can take 5 < x < 7
on the same line as #1, and IF F(x) is hard to calculate, you can use Interpolation, and then evaluate F(x) only at the values that are probable.

Get X random points in a fixed grid without repetition

I'm looking for a way of getting X points in a fixed sized grid of let's say M by N, where the points are not returned multiple times and all points have a similar chance of getting chosen and the amount of points returned is always X.
I had the idea of looping over all the grid points and giving each point a random chance of X/(N*M) yet I felt like that it would give more priority to the first points in the grid. Also this didn't meet the requirement of always returning X amount of points.
Also I could go with a way of using increments with a prime number to get kind of a shuffle without repeat functionality, but I'd rather have it behave more random than that.
Essentially, you need to keep track of the points you already chose, and make use of a random number generator to get a pseudo-uniformly distributed answer. Each "choice" should be independent of the previous one.
With your first idea, you're right, the first ones would have more chance of getting picked. Consider a one-dimensional array with two elements. With the strategy you mention, the chance of getting the first one is:
P[x=0] = 1/2 = 0.5
The chance of getting the second one is the chance of NOT getting the first one 0.5, times 1/2:
P[x=1] = 1/2 * 1/2 = 0.25
You don't mention which programming language you're using, so I'll assume you have at your disposal random number generator rand() which results in a random float in the range [0, 1), a Hashmap (or similar) data structure, and a Point data structure. I'll further assume that a point in the grid can be any floating point x,y, where 0 <= x < M and 0 <= y < N. (If this is a NxM array, then the same applies, but in integers, and up to (M-1,N-1)).
Hashmap points = new Hashmap();
Point p;
while (items.size() < X) {
p = new Point(rand()*M, rand()*N);
if (!points.containsKey(p)) {
items.add(p, 1);
}
}
Note: Two Point objects of equal x and y should be themselves considered equal and generate equal hash codes, etc.

Given a large database of over 50,000 , How can I quickly search for desired points

I have a database of over 50,000 points. Each point has 3 dimensions. Let's label them [i,j,k]
I wish to look for points in which it is better than another point in some other way.
For example, Object A [10 10 3], and Object B[1 1 4], Object C[1 1 1], Object D[1 1 10]
Then the desired output would be A and D (since C is worser than all of them, and B beats A in dimenson[k] but D beats B in dimension [k])
I've tried some basic comparison algorithms (i.e. if else statements) which do work when I cut down the database size. But with 50,000, it takes more than 10mins to find the desired output, which of course is not a good solution.
Could somebody recommend me a method or two to do this the fastest possible way?
Thanks
EDIT:
Thanks I think I've got it
You can do many optimizations to your code:
{
vector<bool> isinterst(n, true);
for (int i=0; i<n; i++) {
for (int j=0; j<n; j++) {
if (isinterst[i]) {
bool worseelsewhere=false;
for (int k=0; k<d; k++)
{
if (point[i][k]<point[j][k])
{
worseelsewhere=true;
break; //you can exit for loop if worseelsewhere is set to true
}
}
if(worseelsewhere == false)
{
continue; //skip the rest if worseelsewhere is false
}
bool worse=true;
for (int k=0; k<d; k++)
{
if (point[i][k]>point[j][k])
{
worse=false;
break; //you can exit for loop if worse is set to false
}
}
if (worseelsewhere && worse) {
isinterst[i]=false;
//cout << i << " Not desirable " << endl;
}
}
}
}
You're looking for pareto-optimal points. These form a convex hull. That's easiest to see in 2 dimensions. Use an iterative algorithm to determine the pareto-optimal points of the first N points. For N=1, that's just the first point. For N=2, the next point is either dominated by the first (discard 2nd), dominates the 1st (discard 1st), lies above to the left, or below to the right (and so is also pareto-optimal).
You can speed up classification by keeping a simplified upper and lower bound for the convex hull, e.g. just single points {minX, minY, minZ} and {maxX, maxY, maxZ}. If P={x,y,z} is dominated by {minX, minY, minZ} then it is dominated by all pareto-optimal points so far and can be discarded. If P dominates {maxX, maxY, maxZ}, it also dominates all points that were pareto-optimal so far and you can discard all those.
A quick O(log N) initial step is to first sort the collection in X order to find the point with max X, then Y to find the point with max Y, and finally with max Z. Finding the pareto-optimal points in ths N=3 subset is easy, and can be hardcoded. You can then use this set as a first approximation.
A more refined solution is to then sort by X+Y, X+Z, Y+Z and X+Y+Z and find those maxima as well. Again, this produces points which are good initial candidates because they will dominate many other points.
E.g. in your case, sorting by X and sorting by Y would both produce point A; sorting by Z would produce point D, neither dominates the other, and you can then quickly discard B and C.
Without knowing your definition of "better" it's a bit hard to make concrete suggestions here. I note, however, that you appear to working with spatial data. A data structure that is often used when working with spatial data is the R-Tree (http://en.wikipedia.org/wiki/R-tree). This provides an efficient index for multidimensional information.
Perhaps the boost::geometry library has some tools that will assist: http://www.boost.org/doc/libs/1_53_0/libs/geometry/doc/html/geometry/introduction.html

Resources