Efficient method for finding points near other points - data-structures

I've got a geospatial search problem that I'm hoping someone can help with.
I have a set of points of class A and B and I'd like to find all points of A within some distance to B.
For example, given the below:
----------
| B B |
|123 4 5 |
----------
If points of type A are the numbers in the above and the distance function was to only allow adjacent points, then the result of my search would be all the numbers except '4'.
Using R-Trees, quad trees, and similar concepts allows me to find the closest points to another point, but to find the closest points to other points would seem to always require an O(... n) where n is the size of the set of Bs in the example above.
If there are any examples of existing implementations of this type of functionality (none of the major geo-supporting data stores or indexes seem to support it as far as I can tell), or ideas of strategies that might help this work at scale, that would be great to hear about also.
Hoping someone that has some experience in the domain could help. Thanks in advance.

If you want find closest other points you can use the hierarchy of a quadtree,I.e when you have your query give the other points.

Related

How can this bipartite matching solution be improved?

I'm working through codefights and am attempting the busyHolidays challenge from the Instacart company challenges.
The challenge provides three arrays. Shoppers contains strings representing the start and end times of their shifts. Orders contains strings representing the start and end times of the orders, and leadTime contains integers representing the number of minutes it takes to complete the job.
The goal is to determine if the orders can be matched to shoppers such that each shopper has only one order and each order has a shopper. An order may only be matched to a shopper if the shopper can both begin and complete it within the order time.
I have a solution that passes 19/20 tests, but since I can't see the last test I have no idea what's going wrong. I originally spent a couple days trying to learn algorithms like Edmond's Algorithm and the Hungarian Algorithm, but my lack of CS background and weakness in math kind of bit me in the ass and I can't seem to wrap my head around how to actually implement those methodologies, so I came up with a solution that involves weighting each node on each side of the graph according to its number of possible connections. I would appreciate it if anyone could help me take a look at my solution and either point out where it might be messing up or suggest a more standard solution to the problem in a way that might be easier for someone without formal training in algorithms to understand. Thanks in advance.
I'll put the code in a gist since it's fairly length
Code: https://gist.github.com/JakeTompkins/7e1afc4722fb828f26f8f6a964774a25
Well, I don't see any reason to think that the algorithm you're writing is actually going to work so the question about how you might be messing it up doesn't seem to be relevant.
You have correctly identified this as an instance of the assignment problem. More specifically this is the "maximum bipartite matching" problem, and the Edmonds-Karp algorithm is the simpliest way to solve it (https://en.wikipedia.org/wiki/Edmonds%E2%80%93Karp_algorithm)
However, this is an algorithm for finding the maximum flow in a network, which is a larger problem than simple bipartite matching, and the explanations of this algorithm are really a lot more complicated than you need. It's understandable that you had some trouble implementing this from the literature, but actually when the problem is reduced to simple (unweighted) bipartite matching, the algorithm is easy to understand:
Make an initial assignment
Try to find an improvement
Repeat until no more improvements can be found.
For bipartite matching, an "improvement" always has the same form, which is what makes this problem easy to solve. To find an improvement, you have to find a path that connects an unassigned shopper to an unassigned order, following these rules:
The path can go from any shopper to any order he/she could fulfill but does not
The path can go from any order only to the shopper that is fulfilling it in the current assignment.
You use bread-first search to find the shortest path, which will correspond to the improvement that changes the smallest number of existing assignments.
The path you find will necessarily have an odd number of edges, and the even-numbered edges will be assignments. To implement the improvement, you remove those assignments and replace them with the odd-numbered edges. There's one more of those, which is what makes it an improvement. It looks like this:
PREVIOUS PATH FOUND IMPROVED ASSIGNMENT
1 1 1
/ /
A A A
\ \
2 2 2
/ /
B B B
\ \
3 3 3
/ /
C C C

Any algorithm/approach to 'simulate' a cyclic graph over time?

I don't know if something like this is even possible or where to look for something that could help address the same - hence the question for seeking some pointers.
Here is my situation: I have a matrix representation of a graph of activities. Each entry in the matrix indicates the relative impact of an activity on the other i.e., (There are 'n' activities in the 'system'. The matrix is just an 'n x n' representation of these activities and the entries imply relative impact)
0 (no impact) 1, 2, 3 (low, medium, high) 'positive' impact i.e., they positively (add) contribute to the activity
Negative numbers: -1, -2, -3 imply a 'negative' impact i.e., they negatively (subtract) contribute
(The numbers are informational, could be any numbers really but just simplified it to 0-3).
Now given this matrix I'll have a description of a graph. What I'd like to do is to 'simulate' the graph over time i.e., starting at time t=0 I'd like to be able to simulate the working of the 'system' over time. I'll have cycles in the graph for sure (very likely) and thus a time-step based simulation would be apt here.
I am not aware of anything in that I could use to help me understand the effects over time for a cyclic graph. I am aware of ONLY one such solution i.e., to use System Dynamics and convert this graph into stock/flow diagram and then simulate it to get what I want. Effectively the graph (above) is then a causal-loop diagram.
Issue: I'd really like to go from the matrix representation to a simulate-able system without forcing someone to understand system dynamics (basically do something in the background).
The question is: Is System Dynamics the only way to achieve what I'm looking for? How should I go about systematically converting any arbitrary matrix representation of graph into a system dynamic model?
If NOT system dynamics, then what other approaches should I look at to solve such a problem? Algorithm names with corresponding pointers for reference would be appreciated!
An example representation of a graph:
Say I have the following matrix of 3 activities:
Rows: Nodes that are 'cause' (outgoing arrows)
Columns: Nodes being 'affected' (incoming arrows)
__| A | B | C |
A | - | 3 | 2 |
B | 1 | - |-2 |
C |-1 | 0 | - |
If I 'start' the graph (simulation) with 10 units for A I'd like to see how the system plays out over time given the relative impacts in the matrix representation.
UPDATE: The 'simulation' would be in a series of time steps i.e., at time t=0 the node A would have the value of 10 and B would either multiply by 3 or add 3 depending on how someone would want to specify the 'impact'. The accumulated values of the nodes over time could be plotted on a graph to show the trend of how the value progresses.
It seems like you are looking for Markov chains.
Let G be a system of states.
The probability of the system transferring from one state to another is given by the matrix T.
f
After n transferences, the probability of the system transferring from one state to another is given by Tn.
For example, after 3 transferences:
This matrix represents:
Given the system is in A, it has a
32.4% chance of remaining at A
31.2% chance of transferring to B
36.4% chance of transferring to C
etcetera for B and C
I would attempt to apply this to your situation for you, but I do not really understand it.
If you are to use Markov chains, you must establish a probability of the system transferring.
Note that, because this is "chance of system being at a given node", you can apply it to a population of systems.
For example: After n transferences, X.XX% of the population will be at Y.
There have been several attempts to do this, with various origins in the cybernetics and system dynamics literature, without much success.
Basically the problem is that your matrix, while it may contain a lot of insight and have useful applications for group process, falls short of the degree of specification that one needs to actually simulate a dynamic system.
The matrix identifies the feedback loops that exist in your system, but to relate that structure to behavior, you also need to specify phase and gain relationships around those loops (i.e., identify the stocks and the slope of each relationship, which may be nonlinear). Without doing this, there's simply no way to establish which loops are dominant drivers of the behavior.
You might be able to get some further insight out of your matrix through graph theoretic approaches to identifying and visualizing important features, but unfortunately there's no escaping the model-building process if you want to simulate.

Trilateration of a signal using Time Difference of Arrival

I am having some trouble to find or implement an algorithm to find a signal source. The objective of my work is to find the sound emitter position.
To accomplish this I am using three microfones. The technique that I am using is multilateration that is based on the time difference of arrival.
The time difference of arrival between each microfones are found using Cross Correlation of the received signals.
I already implemented the algorithm to find the time difference of arrival, but my problem is more on how multilateration works, it's unclear for me based on my reference, and I couldn't find any other good reference for this that are free/open.
If you have some references on how I can implement a multilateration algorithm, or some other trilateration algorithm that I can use based on time difference of arrival it would be a great help.
Thanks in advance.
The point you are looking for is the intersection of three hyperbolas. I am assuming 2D here since you only use 3 receptors. Technically, you can find a unique 3D solution but as you likely have noise, I assume that if you wanted a 3D result, you would have taken 4 microphones (or more).
The wikipedia page makes some computations for you. They do it in 3D, you just have to set z = 0 and solve for system of equations (7).
The system is overdetermined, so you will want to solve it in the least squares sense (this is the point in using 3 receptors actually).
I can help you with multi-lateration in general.
Basically, if you want a solution in 3d - you have to have at least 4 points and 4 distances from them (2-give you the circle in which is the solution - because that is the intersection between 2 spheres, 3 points give you 2 possible solutions (intersection between 3 spheres) - so, in order to have one solution - you need 4 spheres). So, when you have some points (4+) and the distance between them (there is an easy way to transform the TDOA into the set of equations for just having the length type distances /not time/) you need a way to solve the set of equations. First - you need a cost function (or solution error function, as I call it) which would be something like
err(x,y,z) = sum(i=1..n){sqrt[(x-xi)^2 + (y-yi)^2 + (z-zi)^2] - di}
where x, y, z are coordinates of the current point in the numerical solution and xi, yi, zi and di are the coordinates and distance towards the ith reference point. In order to solve this - my advice is NOT to use Newton/Gauss or Newton methods. You need first and second derivative of the aforementioned function - and those have a finite discontinuation in some points in space - hence that is not a smooth function and these methods won't work. What will work is direct search family of algorithms for optimization of functions (finding minimums and maximums. in our case - you need minimum of the error/cost function).
That should help anyone wanting to find a solution for similar problem.

How to find the closest 2 points in a 100 dimensional space with 500,000 points?

I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?
Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.
There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.
Although it can't be extended directly to your problem (as constant 7 would be replaced with 2^101 - 1), it should be just fine for most datasets. So, if you have reasonably random input, it will give you O(n*logn*m) complexity where n is the number of points and m is the number of dimensions.
edit
That's all assuming you have Euclidian space. I.e., length of vector v is sqrt(v0^2 + v1^2 + v2^2 + ...). If you can choose metric, however, there could be other options to optimize the algorithm.
Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.
http://en.wikipedia.org/wiki/Kd-tree
P.S. Fun problem!
You could try the ANN library, but that only gives reliable results up to 20 dimensions.
Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.
Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.
Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.
http://en.wikipedia.org/wiki/Kd-tree.
My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.
Here's his published paper:
http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf

Averaging a set of points on a Google Map into a smaller set

I'm displaying a small Google map on a web page using the Google Maps Static API.
I have a set of 15 co-ordinates, which I'd like to represent as points on the map.
Due to the map being fairly small (184 x 90 pixels) and the upper limit of 2000 characters on a Google Maps URL, I can't represent every point on the map.
So instead I'd like to generate a small list of co-ordinates that represents an average of the big list.
So instead of having 15 sets, I'd end up with 5 sets, who's positions approximate the positions of the 15. Say there are 3 points that are in closer proximity to each-other than to any other point on the map, those points will be collapsed into 1 point.
So I guess I'm looking for an algorithm that can do this.
Not asking anyone to spell out every step, but perhaps point me in the direction of a mathematical principle or general-purpose function for this kind of thing?
I'm sure a similar function is used in, say, graphics software, when pixellating an image.
(If I solve this I'll be sure to post my results.)
I recommend K-means clustering when you need to cluster N objects into a known number K < N of clusters, which seems to be your case. Note that one cluster may end up with a single outlier point and another with say 5 points very close to each other: that's OK, it will look closer to your original set than if you forced exactly 3 points into every cluster!-)
If you are searching for such functions/classes, have a look at MarkerClusterer and MarkerManager utility classes. MarkerClusterer closely matches the described functionality, as seen in this demo.
In general I think the area you need to search around in is "Vector Quantization". I've got an old book title Vector Quantization and Signal Compression by Allen Gersho and Robert M. Gray which provides a bunch of examples.
From memory, the Lloyd Iteration was a good algorithm for this sort of thing. It can take the input set and reduce it to a fixed sized set of points. Basically, uniformly or randomly distribute your points around the space. Map each of your inputs to the nearest quantized point. Then compute the error (e.g. sum of distances or Root-Mean-Squared). Then, for each output point, set it to the center of the set that maps to it. This will move the point and possibly even change the set that maps to it. Perform this iteratively until no changes are detected from one iteration to the next.
Hope this helps.

Resources