Best data structure for closest range - data-structures

Consider i have set of ranges that an item can be in, and a given item, which data structure is considered fastest to provide the range that considered the closest to the item? I also care for space complexity so bring that into consideration.
a 1 dimensional example for the use case is frequency, consider we have frequency ranges and a frequency, we wish to find the closest frequency range to that frequency.
a 2 dimensional example is a game, consider we have a ball and objects on a plane, we divide objects to groups and want to find which of the groups is the closest to the ball.

Related

Efficient way to filter a number of things

There is a list of venues. Each venue has a price attached to it which is given, and a latlon. The user enters a max distance and a max price, and the app returns a list of venues that fit those criteria. The distance needs to be calculated on the query, but I can make some sort of structure using the price or the given latlons. I already know how to figure this out in O(n) - I traverse a list of restauraunts, adding them to a result if they fit the criteria.
Is there a way to do this more efficiently? I'm thinking making a BST using the price as a key (which can be calculated before runtime), then cutting off a section of the BST that's over the price limit, and then iterating through everything in the BST, but this is still in O(n), right?
One way to think about this problem is to treat it as a multidimensional range search problem. Each venue can be thought of as a point in a three-dimensional space given by its longitude, latitude, and price. If you want to find all venues within a certain radius of a given point whose price is at most some amount, then you're searching for all points in a cylinder with the given center and radius whose upper and lower bounds are the max price and 0, respectively.
You might want to consider using a multidimesional search tree structure like a k-d tree or R-tree to solve this problem. Once you have this structure, search the structure for the bounding box of the cylinder to get back a list of candidate points. Then, test each point in that box to see if it's inside the cylinder. Assuming the points are somewhat evenly distributed, you'll find that roughly a π / 4 fraction of them are in the cylinder, so you won't waste too much effort.

Partition of graph to locate a point

I have an area which was already partitioned into tens of sub-areas (think like a country divided into states).
Now I have a point coordinate, what is the best algorithm to tell me which state the point in?
Of course I can match sub-area by sub-area but that's stupid because I have to search through half of them in average right?
Is there an algorithm to determine how to group several adjacent sub-area together to facilitate search, so as to optimize the number of search?
I would start by eliminating all areas that cannot have the point inside of them.
Let's assume that you have a 2D Cartesian coordinate system, you have a point as a 2D-vector and the areas are described as a collection of their boundary points.
Then you can sort the areas by their smallest and largest x and y coordinate (in total 4 ways of sorting). You can eliminate all areas which have their smallest x coordinate bigger than the x coordinate of your point etc.
After that, you can check the remaining polygons with a simple ray-casting algorithm and you should be good.
This is very efficient if you have a structure which keeps the areas sorted in all the different directions because you can eliminate the areas in logarithmic time.

Alignment algorithm dynamic programming with gap

I wrote a dynamic programming alignment algorithm. I wish to align two lists of peaks that have been extracted from two different signals. Lists of peaks are dataset with two colums, two features: time for the peak and area of the peak. Since peaks are from two different signals, both lists contains no exact matches. However, both lists of peaks have some peaks in common (~two third), that is to say peaks that are close both regarding time and area.
In my first DP algorithm, i rely on a distance calculation that takes time and area into account. I iterate over peaks in the shortest peak list, and calculate their distance to some peaks in the other dataset. I fill in a score matrix with these distances and i go backward to find back the ooptimal path (with minimum distance). This is working perfect IF I WANT TO ASSIGN ALL PEAKS IN SHORTEST LIST TO PEAKS IN LARGEST LIST. However, it is not working if gap are allowed, that is to say if some elements in shortest data set have no match in largest data set.
Which refinement of DP could enable to handle this type of problems? What other algos are at hand to handle these problems ?
Thanks!

Querying a collection of rectangles for the overlap of an input rectangle

In a multi-dimensional space, I have a collection of rectangles, all of which are aligned to the grid. (I am using the word "rectangles" loosely - in a three dimensional space, they would be rectangular prisms.)
I want to query this collection for all rectangles that overlap an input rectangle.
What is the best data structure for holding the collection of rectangles? I will be adding rectangles to and removing rectangles from the collection from time to time, but these operations will be infrequent. The operation I want to be fast is the query.
One solution is to keep the corners of the rectangles in a list, and do a linear scan over the list, finding which rectangles overlap the query rectangle and skipping over the ones that don't.
However, I want the query operation to be faster than linear.
I've looked at the R-tree data structure, but it holds a collection of points, not a collection of rectangles, and I don't see any obvious way to generalize it.
The coordinates of my rectangles are discrete, in case you find that helpful.
I am interested in the general solution, but I will also tell you the properties of my specific problem: my problem space has three dimensions, and their multiplicity varies wildly. The first dimension has two possible values, the second dimension has 87 values, and the third dimension has 1.8 million values.
You can probably use KD-Trees which can be used for rectangles according to the wiki page:
Variations
Instead of points
Instead of points, a kd-tree can also
contain rectangles or
hyperrectangles[5]. A 2D rectangle is
considered a 4D object (xlow, xhigh,
ylow, yhigh). Thus range search
becomes the problem of returning all
rectangles intersecting the search
rectangle. The tree is constructed the
usual way with all the rectangles at
the leaves. In an orthogonal range
search, the opposite coordinate is
used when comparing against the
median. For example, if the current
level is split along xhigh, we check
the xlow coordinate of the search
rectangle. If the median is less than
the xlow coordinate of the search
rectangle, then no rectangle in the
left branch can ever intersect with
the search rectangle and so can be
pruned. Otherwise both branches should
be traversed. See also interval tree,
which is a 1-dimensional special case.
Let's call the original problem by PN - where N is number of dimensions.
Suppose we know the solution for P1 - 1-dimensional problem: find if a new interval is overlapping with a given collection of intervals.
Once we know to solve it, we can check if the new rectangle is overlapping with the collection of rectangles in each of the x/y/z projections.
So the solution of P3 is equivalent to P1_x AND P1_y AND P1_z.
In order to solve P1 efficiently we can use sorted list. Each node of the list will include coordinate and number-of-opened-intetrvals-up-to-this-coordinate.
Suppose we have the following intervals:
[1,5]
[2,9]
[3,7]
[0,2]
then the list will look as follows:
{0,1} , {1,2} , {2,2}, {3,3}, {5,2}, {7,1}, {9,0}
if we receive a new interval, say [6,7], we find the largest item in the list that is smaller than 6: {5,2} and smllest item that is greater than 7: {9,0}.
So it is easy to say that the new interval does overlap with the existing ones.
And the search in the sorted list is faster than linear :)
You have to use some sort of a partitioning technique. However, because your problem is constrained (you use only rectangles), the data-structure can be a little simplified. I haven't thought this through in detail, but something like this should work ;)
Using the discrete value constraint - you can create a secondary table-like data-structure where you store the discrete values of second dimension (the 87 possible values). Assume that these values represent planes perpendicular to this dimension. For each of these planes you can store, in this secondary table, the rectangles that intersect these planes.
Similarly for the third dimension you can use another table with as many equally spaced values as you need (1.8 million is too much, so you would probably want to make this at least a couple of magnitudes smaller), and create a map the rectangles that are between two chosen values.
Given a query rectangle you can query the first table in constant time to determine a set of tables which possibly intersects this query. Then you can do another query on the second table, and do an intersection of the results from the first and the second query results. This should narrow down the number of actual intersection tests that you have to perform.

How do I group objects in a set by proximity?

I have a set containing thousands of addresses. If I can get the longitude and latitude of each address, how do I split the set into groups by proximity?
Further, I may want to retry the 'clustering' according to different rules:
N groups
M addresses per group
maximum distance between any address in a group
You could try the k-means clustering algorithm.
You want vector quantization:
http://en.wikipedia.org/wiki/Vector_quantization
"It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms."
Here the vectors are the geographic coordinates of each address, and you can feed your algorithms with other parameters depending on your constraints (proximity, group size, number of groups...).
You can start with k-means, but from my experience a Voronoi based algorithm is more flexible. A good introduction here.
It depends a bit on the scale of the data you are wanting to cluster. The brute force approach is to calculate the distance between all combination of points into a distance array. The resulting array is N^2 and since the distance from A to B is the same as B to A you only need half those, so the resulting set is N^2/2.
For relatively close lat lon coordinates you can sometimes get away with using the lat long as an x,y grid and calculating the Cartesian distance. Since the real world is not flat the Cartesian distance will have error. For a more exact calculation which you should use if your addresses are located across the country, see this link from Mathforum.com.
If you don't have the scale to handle the entire distance matrix, you will need to do some algorithm programming to increase efficiency.
The "N groups" and "M addresses per group" constraints are mutually exclusive. One implies the other.
Build a matrix of distances between all addresses.
Starting with a random address, sort the matrix by ascending distance to that address
Removing the addresses from the matrix as you go along, place the addresses closest to the start address into a new group until you reach your criteria (size of group or max distance).
Once a group is full, choose another random address and resort the matrix by distance to that address
Continue like this until all addresses are taken out of the matrix.
If addresses were distributed evenly, each group would have a sort of circular shape around the start address. The problem comes when start addresses are near existing groups. When this happens, the new group will sort of wrap around the old one and could even circle it completely if your stop criteria is only group size. If you use the max-distance constraint, then this is not going to happen (assuming no other constraints).
I don't really know if this is a good way of doing it but it's what I'd try. I'm sure lots of optimization would be required. Especially for addresses on the edges.

Resources