Constant time search - algorithm

Suppose I have a rod which I cut to pieces. Given a point on the original rod, is there a way to find out which piece it belongs to, in constant time?
For example:
|------------------|---------|---------------|
0.0 4.5 7.8532 9.123
Given a position:
^
|
8.005
I would like to get 3rd piece.
It is possible to easily get such answer in O(log n) time with binary search but is it possible to do it in O(1)? If I pre-process the "cut" positions somehow?

If you assume the point you want to query is uniformly randomly chosen along the rod, then you can have EXPECTED constant time solution, without crazy memory explosion, as follows. If you break up the rod into N equally spaced pieces, where N is the number of original irregularly spaced segments you have in your rod, and then record for each of the N equal-sized pieces which of the original irregular segment(s) it overlaps, then to do a query you first just take the query point and do simple round-off to find out which equally spaced piece it lies in, then use that index to look up which of your original segments intersect the equally spaced piece, and then check each intersecting original segment to see if the segment contains your point (and you can use binary search if you want to make sure the worst-case performance is still logarithmic). The expected running time for this approach is constant if you assume that the query point is randomly chosen along your rod, and the amount of memory is O(N) if your rod was originally cut into N irregular pieces, so no crazy memory requirements.
PROOF OF EXPECTED O(1) RUNNING TIME:
When you count the total number of intersection pairs between your original N irregular segments and the N equally-spaced pieces I propose constructing, the total number is no more than 2*(N+1) (because if you sort all the end-points of all the regular and irregular segments, a new intersection pair can always be charged to one of the end-points defining either a regular or irregular segment). So you have a multi-set of at most 2(N+1) of your irregular segments, distributed out in some fashion among the N regular segments that they intersect. The actual distribution of intersections among the regular segments doesn't matter. When you have a uniform query point and compute the expected number of irregular segments that intersect the regular segment that contains the query point, each regular segment has probability 1/N of being chosen by the query point, so the expected number of intersected irregular segments that need to be checked is 2*(N+1)/N = O(1).

For arbitrary cuts and precisions, not really, you have to compare the position with the various start or end points.
But, if you're only talking a small number of cuts, performance shouldn't really be an issue.
For example, even with ten segments, you only have nine comparisons, not a huge amount of computation.
Of course, you can always turn the situation into a ploynomial formula (such as ax^4 + bx^3 +cx^2 + dx + e), generated using simultaneous equations, which will give you a segment but the highest power tends to rise with the segment count so it's not necessarily as efficient as simple checks.

You're not going to do better than lg n with a comparison-based algorithm. Reinterpreting the 31 non-sign bits of a positive IEEE float as a 31-bit integer is an order-preserving transformation, so tries and van Emde Boas trees both are options. I would steer you first toward a three-level trie.

You could assign an integral number to every position and then use that as index into a lookup table, which would give you constant-time lookup. This is pretty easy if your stick is short and you don't cut it into pieces that are fractions of a millimeter long. If you can get by with such an approximation, that would be my way to go.
There is one enhanced way which generalizes this even further. In each element of a lookup table, you store the middle position and the segment ID to the left and right. This makes one lookup (O(1)) plus one comparison (O(1)). The downside is that the lookup table has to be so large that you never have more than two different segments in the same table element's range. Again, it depends on your requirements and input data whether this works or not.

Related

Nearest neighbor searches in non-metric spaces

I would like to know about nearest neighbor search algorithms when working in non-metric spaces? In particular, is there any variant of a kd-tree algorithm in this setting with provable time complexity etc?
Probably more of theoretic interest for you:
The PH-Tree is similar to a quadtree, however, it transforms floating points coordinates into a non-metric system before storing them. The PH-Tree performs all queries (including kNN queries) on the non-metric data using a non-metric distance function (you can define your own distance functions on top of that).
In terms of kNN, the PH-Tree performs on par with trees like R+Trees and usually outperforms kd-trees.
The non-metric data storage appears to have little negative, possibly even positive, effect on performance, except maybe for the (almost negligible) execution time for the transformation and distance function.
The reason that the data is transformed comes from an inherent constraint of the tree: The tree is a bit-wise trie, which means it can only store bitsequences (can be seen as integer numbers). In order to store floating point numbers in the tree, we simply use the IEEE bit representation of the floating point number and interpret it as an integer (this works fine for positive number, negative numbers are a bit more complex). Crucially, this preserves the ordering, ie. if a floating point f1 is larger than f2, then the integer representation of the bits of int(f1) is also always larger than int(f2). Trivially, this transformation allows storing floating point numbers as integers without any loss of precision(!).
The transformation is non-metric, because the leading bits (after the sign bit) of a floating point number are the exponent bits, followed by the fraction bits. Clearly, if two number differ in their exponent bits, their distance grows exponentially faster (or slower for negative exponents) compared to distances cause by differences in the fraction bits.
Why did we use a bit-wise try? If we have d dimensions, it allows an easy transformation such that we can map the n'th bit of each of the d values of a coordinate into bit string with d bits. For example, for d=60, we get a 60 bit string. Assuming a CPU register width of 64 bits, this means we can perform many operations related to queries in constant time, i.e. many operations cost just one CPU operation, independent of whether we have 3 dimensions or 60 dimensions. It's probably hard to understand what's going on from this short text, more details on this can be found here.
NMSLIB provides a library for performing Nearest Neighor Search in non-metric spaces. That Github page provides a dozen of papers to read, but not all apply to non-metric spaces.
Unfortunately, there are few theoretical results regarding the complexity of Neaest Neighbor Search for non-metric spaces and there are no comprehensive empirical evaluations.
I can onsly see some theoretical results in Effective Proximity Retrieval
by Ordering Permutations, but I am not convinced. However, I suggest you take a look.
There seem to be not many people, if any, that uses k-d trees for non-metric spaces. They seem to use VP trees, etc. densitrees are also used, as described in Near Neighbor Search in Nonmetric Spaces.
Intuitively, densitrees are a class of decorated trees that hold the points of the dataset in a way similar to the metric tree. The critical difference
lies in the nature of tree decoration; instead of having one or several real values reflecting some bounds on the triangular inequality attached to every tree node, each densitree node is associated to a particular classifier called here a density estimator.

Generating 100 balls at random position within restricted space (has radius and no overlapping)

For example, the restricted space is 100 x 100 x 100 big, and the radius of each ball is 5, I need to generate 100 of these balls at random position within this space and no overlapping allowed. I come up with two approaches:
Use srand and get 100 positions, then do a checking to delete balls that overlap each other ( check if the distance of the center of two balls are less than two times the radius), then generate another x balls (x is the number of balls deleted) and keep repeating the process until 100 balls don't overlap.
First divide the space into 100 cubes, and place each ball within its allocated cube using srand, this way they won't overlap at all.
I feel the first way is more proper in terms of random, but too time consuming and the second way is fast and easy but I'm not sure about the idea of random there. And this model is trying to simulate the position of molecules in the air. Maybe neither of these ways are good, please let me know if there's better way. Thanks in advance!
Edit:
#Will provides me an option that's similar but much cleaner than my original first approach; every time when adding a new ball, check if it overlap with any existing ones, if it does, regenerate. The complexity is 1+2+3...+(n-1), which is about O(n^n). I still wonder if there's faster algorithm though.
Edit2:
Sorry 1+2+..n should be n^2
You can do an O((n + f) log n) algorithm, where f is the number of failed attempts. Essentially the issue with the time taking too long is finding which neighboring balls you overlap with. You can use an external data structure called a KD-tree to efficiently store the positions of the balls. Then you can look up through the KD-tree your "nearest" neighboring ball. This will take O(log n) time. Determine if they overlap, then add the ball to the space and to the KD-tree -- inserting is a O(log n) operation. In total n balls each taking O(log n) will be O(n log n), and accounting for failed attempts will be O((n+F)*log n). CGAL (computational geometric algorithms library) provides a nice KD-tree implementation. Here is a link to CGAL and a link to KD trees:
http://www.cgal.org/
https://en.wikipedia.org/wiki/K-d_tree
There are other structures like a K-D tree, but this would be the easiest to use for your case.
If you would like to avoid using a fancy data structure, you can compute a grid over the space. Insert each random ball from the entire space into its grid cell. Then when checking overlap you only need to check the balls in adjacent cells (assuming the ball size will not overlap more than one adjacency). This will not improve the overall time complexity, but is a common method in computer graphics to improve implementation time for neighbor finding routines.
Instead of dividing the area into a 100 cubes, you could divide it into 8,000 5 by 5 cubes, and then place balls centered into 100 of those cubes. This way the balls are still placed randomly in the space but the can't overlap.
Edit: Also, for when checking if the balls overlap, you might want to think about using a data structure that would allow you to only check the balls that are closest to the ball you are checking. Checking all of them is wasteful because there's no chance of balls on totally different sides of the space overlap. I'm not too familiar with octrees but you might want to look into them, if you really want to optimize your code.
The volume of your spheres is about 1/1900th the volume of your space, so if you just pick random locations and check for overlap, you won't have to regenerate many. And if you really only need 100, using a fancy algorithm like Octrees to check for collisions would be a waste.
Of course as soon as you code it up, someone will ask you to do it for 10,000 spheres instead of 100, so everything I just said will be wrong.
I like Chris's suggestion of just putting them in randomly chosen cubes. Not the most realistic perhaps, but close and much simpler.

Efficiently re-computing area under ROC when one label changes

Say that you have a list of scores with binary labels (for simplicity, assume no ties), and that we've used the labels to compute the area under the associated receiver operating characteristic (ROC) curve. For a set of n scores, this calculation is straightforward to do in O(n log n) time -- you simply sort the list, then traverse the list in sorted order, keeping a running total of the number of positively labeled examples you've seen so far. Every time you see a negative label, you add the number of positives, and at the end you divide the resulting sum by the product of the number of positives times the number of negatives.
Now, having done that calculation, say that someone comes along and flips exactly one label (from positive to negative or vice versa). The scores themselves do not change, so you don't need to re-sort. It's straightforward to calculate the new area under the curve (AUC) in O(n) time by re-traversing the sorted list. My question is, is it possible to compute the new AUC in something better than O(n)? I.e., do I have to re-traverse the entire sorted list to get the new AUC?
I think I can do the re-calculation in O(1) time by storing a count, at each position in the ranked list, to the number of positives and negatives above this position. But I am going to need to repeatedly calculate the AUC as more labels get flipped. And I think that if I rely on those stored values, then updating them for the next time will be O(n).
Yes, it is possible to compute AUC in O(log(n)). You need two sets of scores, one for positives and one for negatives, that provide the following operations:
Querying the number of items with higher (or lower) score than a given value (score of the label being flipped).
Inserting and removing the elements.
Knowing the number of positives above/below given position lets you update AUC efficiently as you already mentioned. After that you have to remove the item from the set of positives/negatives and insert to negatives/positives, respectively.
Balanced search trees can do both operations in O(log(n)).
Furthermore, actual values of scores do not matter, only position is relevant. This leads to very simple and efficient implementation using binary indexed tree. See http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees for explanation.
Also, you don't really need to maintain two sets. Since you already know the total number of positives and negatives above given position, single set is enough.

Algorithm for nearest point

I've got a list of ~5000 points (specified as longitude/latitude pairs), and I want to find the nearest 5 of these to another point, specified by the user.
Can anyone suggest an efficient algorithm for working this out? I'm implementing this in Ruby, so if there's a suitable library then that would be good to know, but I'm still interested in the algorithm!
UPDATE: A couple of people have asked for more specific details on the problem. So here goes:
The 5000 points are mostly within the same city. There might be a few outside it, but it's safe to assume that 99% of them lie within a 75km radius, and that all of them lie within a 200km radius.
The list of points changes rarely. For the sake of argument, let's say it gets updated once per day, and we have to deal with a few thousand requests in that time.
You could accelerate the search by partitioning the 2D space with a quad-tree or a kd-tree and then once you've reach a leaf node you compare the remaining distances one by one until you find the closest match.
See also this blog post which refers to this other blog post which both discuss nearest neighbors searches with kd-trees in Ruby.
You can get a very fast upper-bound estimator on distance using Manhattan distance (scaled for latitude), this should be good enough for rejecting 99.9% of candidates if they're not close (EDIT: since then you tell us they are close. In that case, your metric should be distance-squared, as per Lars H comment).
Consider this equivalent to rejecting anything outside a spherical-rectangle bounding-box (as an approximation to a circle bounding-box).
I don't do Ruby so here is algorithm with pseudocode:
Let the latitude, longitude of your reference point P (pa,po) and the other point X (xa,xo).
Precompute ka, the latitude scaling factor for longitudinal distances: ka (= cos(pa in°)). (Strictly, ka = constant is a linearized approximation in the vicinity of P.)
Then the distance estimator is: D(X,P) = ka*|xa-pa| + |xo-po| = ka*da + do
where |z| means abs(z). At worst this overestimates true distance by a factor of √2 (when da==do), hence we allow for that as follows:
Do a running search and keep Dmin, the fifth-smallest scaled-Manhattan-distance-estimate.
Hence you can reject upfront all points for which D(X,P) > √2 * Dmin (since they must be at least farther away than √((ka*da)² + do²) - that should eliminate 99.9% of points).
Keep a list of all remaining candidate points with D(X,P) <= √2 * Dmin. Update Dmin if you found a new fifth-smallest D. Priority-queue, or else a list of (coord,D) are good data structures.
Note that we never computed Euclidean distance, we only used float multiplication and addition.
(Consider this similar to quadtree except filtering out everything except the region that interests us, hence no need to compute accurate distances upfront or build the data structure.)
It would help if you tell us the expected spread in latitudes, longitudes (degrees, minutes or what? If all the points are close, the √2 factor in this estimator will be too conservative and mark every point as a candidate; a lookup-table based distance estimator would be preferable.)
Pseudocode:
initialize Dmin with the fifth-smallest D from the first five points in list
for point X in list:
if D(X,P) <= √2 * Dmin:
insert the tuple (X,D) in the priority-queue of candidates
if (Dmin>D): Dmin = D
# after first pass, reject candidates with D > √2 * Dmin (use the final value of Dmin)
# ...
# then a second pass on candidates to find lowest 5 exact distances
Since your list is quite short, I'd highly recommend brute force. Just compare all 5000 to the user-specified point. It'll be O(n) and you'll get paid.
Other than that, a quad-tree or Kd-tree are the usual approaches to spacial subdivision. But in your case, you'll end up doing a linear number of insertions into the tree, and then a constant number of logarithmic lookups... a bit of a waste, when you're probably better off just doing a linear number of distance comparisons and being done with it.
Now, if you want to find the N nearest points, you're looking at sorting on the computed distances and taking the first N, but that's still O(n log n)ish.
EDIT: It's worth noting that building the spacial tree becomes worthwhile if you're going to reuse the list of points for multiple queries.
Rather than pure brute-force, for 5000 nodes, I would calculate the individual x+y distances for every node, rather than the straight line distance.
Once you've sorted that list, if e.g. x+y for the 5th node is 38, you can rule out any node where either x or y distance is > 38. This way, you can rule out a lot of nodes without having to calculate the straight line distance. Then brute force calculate the straight line distance for the remaining nodes.
These algorithms are not easily explained, thus I will only give you some hints in the right direction. You should look for Voronoi Diagrams. With a Voronoi Diagram you can easily precompute a graph in O(n^2 log n) time and search the closest point in O(log n) time.
Precomputation is done with a cron job at night and searching is live. This corresponds to your specification.
Now you could save the k closests pairs of each of your 5000 points and then starting from the nearest point from the Voronoi Diagram and search the remaining 4 points.
But be warned that these algorithms are not very easy to implement.
A good reference is:
de Berg: Computational Geometry Algorithms Applications (2008) chapters 7.1 and 7.2
Since you have that few points, I would recommend doing a brute-force search, to the effect of trying all points against each other with is an O(n^2) operation, with n = 5000, or roughly 25/2 million iterations of a suitable algorithm, and just storing the relevant results. This would have sub 100 ms execution time in C, so we are looking at a second or two at the most in Ruby.
When the user picks a point, you can use your stored data to give the results in constant time.
EDIT I re-read your question, and it seems as though the user provides his own last point. In that case it's faster to just do a O(n) linear search through your set each time user provides a point.
if you need to repeat this multiple times, with different user-entered locations, but don't want to implement a quad-tree (or can't find a library implementation) then you can use a locality-sensitive hashing (kind-of) approach that's fairly intuitive:
take your (x,y) pairs and create two lists, one of (x, i) and one of (y, i) where i is the index of the point
sort both lists
then, when given a point (X, Y),
bisection sort for X and Y
expand outwards on both lists, looking for common indices
for common indices, calculate exact distances
stop expanding when the differences in X and Y exceed the exact distance of the most-distant of the current 5 points.
all you're doing is saying that a nearby point must have a similar x and a similar y value...

finding closest hamming distance

I have N < 2^n randomly generated n-bit numbers stored in a file the lookup for which is expensive. Given a number Y, I have to search for a number in the file that is at most k hamming dist. from Y. Now this calls for a C(n 1) + C(n 2) + C(n 3)...+C(n,k) worst case lookups which is not feasible in my case. I tried storing the distribution of 1's and 0's at each bit position in memory and prioritized my lookups. So, I stored probability of bit i being 0/1:
Pr(bi=0), Pr(bi=1) for all i from 0 to n-1.
But it didn't help much since N is too large and have almost equal distribution of 1/0 in every bit location. Is there a way this thing can be done more efficiently. For now, you can assume n=32, N = 2^24.
Google gives a solution to this problem for k=3, n=64, N=2^34 (much larger corpus, fewer bit flips, larger fingerprints) in this paper. The basic idea is that for small k, n/k is quite large, and hence you expect that nearby fingerprints should have relatively long common prefixes if you formed a few tables with permuted bits orders. I am not sure it will work for you, however, since your n/k is quite a bit smaller.
If by "lookup", you mean searching your entire file for a specified number, and then repeating the "lookup" for each possible match, then it should be faster to just read through the whole file once, checking each entry for the hamming distance to the specified number as you go. That way you only read through the file once instead of C(n 1) + C(n 2) + C(n 3)...+C(n,k) times.
You can use quantum computation for speeding up your search process and at the same time minimizing the required number of steps. I think Grover's search algorithm will be help full to you as it provides quadratic speed up to the search problem.....
Perhaps you could store it as a graph, with links to the next closest numbers in the set, by hamming distance, then all you need to do is follow one of the links to another number to find the next closest one. Then use an index to keep track of where the numbers are by file offset, so you don't have to search the graph for Y when you need to find its nearby neighbors.
You also say you have 2^24 numbers, which according to wolfram alpha (http://www.wolframalpha.com/input/?i=2^24+*+32+bits) is only 64MB. Could you just put it all in ram to make the accesses faster? Perhaps that would happen automatically with caching on your machine?
If your application can afford to do some extensive preprocessing, you could, as you're generating the n-bit numbers, compute all the other numbers which are at most k distant from that number and store it in a lookup table. It'd be something like a Map >. riri claims you can fit it in memory, so hash tables might work well, but otherwise, you'd probably need a B+ tree for the Map. Of course, this is expensive as you mentioned before, but if you can do it beforehand, you'd have fast lookups later, either O(1) or O(log(N) + log(2^k)).

Resources