I'm tutoring a student and one of her assignments is to describe an O(nlogn) algorithm for the closest pair of points in the one-dimensional case. But the restriction is she's not allowed to use a divide-and-conquer approach. I understand the two-dimensional case from a question a user posted some years ago. I'll link it in case someone wants to look at it: For 2-D case (plane) - "Closest pair of points" algorithm.
However, for the 1-D case, I can only think of a solution which involves checking each and every point on the line and comparing it to the closest point to the left and right of it. But this solution isn't O(nlogn) since checking each point will take time proportional to n and the comparisons for each point would take time proportional to 2n. I'm not sure where log(n) would come from without using a divide-and-conquer approach.
For some reason, I can't come up with a solution. Any help would be appreciated.
Hint: If the points were ordered from left to right, what would you do, and what would the complexity be? What is the complexity of ordering the points first?
It seems to me that one could:
Sort the locations into order - O(n log n)
Find the differences between the ordered locations - O(n)
Find the smallest difference - O(n)
The smallest difference defines the two closest points.
The overall result would be O(n log n).
Related
I am prepping for interview leet-code type problems and I came across the k closest problem, but given a sorted array. This problem requires finding the k closest elements by value to an input value from the array. The answer to this problem was fairly straight forward and I did not have any issues determining a linear-time algorithm to solve it.
However, working on this problem got me thinking. Is it possible to solve this problem given an unsorted array in linear time? My first thought was to use a heap and that would give an O(nlogk) time complexity solution, but I am trying to determine if its possible to come up with an O(n) solution? I was thinking about possibly using something like quickselect, but the issue is that this has an expected time of O(n), not a worst case time of O(n).
Is this even possible?
The median-of-medians algorithm makes Quickselect take O(n) time in the worst case.
It is used to select a pivot:
Divide the array into groups of 5 (O(n))
Find the median of each group (O(n))
Use Quickselect to find the median of the n/5 medians (O(n))
The resulting pivot is guaranteed to be greater and less than 30% of the elements, so it guarantees linear time Quickselect.
After selecting the pivot, of course, you have to continue on with the rest of Quickselect, which includes a recursive call like the one we made to select the pivot.
The worst case total time is T(n) = O(n) + T(0.7n) + T(n/5), which is still linear. Compared to the expected time of normal Quickselect, though, it's pretty slow, which is why we don't often use this in practice.
Your heap solution would be very welcome at an interview, I'm sure.
If you really want to get rid of the logk, which in practical applications should seldom be a problem, then yes, using Quickselect would be another option. Something like this:
Partition your array in values smaller and larger than x. <- O(n).
For the lower half, run Quickselect to find the kth largest number, then take the right-side partition which are your k largest numbers. <- O(n)
Repeat step 2 for the higher half, but for the k smallest numbers. <- O(n)
Merge your k smallest and k largest numbers and extract the k closest numbers. <- O(k)
This gives you a total time complexity of O(n), as you said.
However, a few points about your worry about expected time vs worst-case time. I understand that if an interview question explicitly insists on worst-case O(n), then this solution might not be accepted, but otherwise, this can well be considered O(n) in practice.
The key here being that for randomized quickselect and random or well-behaved input, the probability that the time complexity goes beyond O(n) decreases exponentially as the input grows. Meaning that already at largeish inputs, the probability is as small as guessing at a specific atom in the known universe. The assumption on well-behaved input concerns being somewhat random in nature and not adversarial. See this discussion on a similar (not identical) problem.
I have a set of points in the plane that I want to sort based on when they encounter an arbitrary sweepline. An alternative definition is that I want to be able to sort them based on any linear combination of the x- and y-coordinates. I want to do the sorting in linear time, but am allowed to perform precomputation on the set of points in quadratic time (but preferably O(n log(n))). Is this possible? I would love a link to a paper that discusses this problem, but I could not find it myself.
For example, if I have the points (2,2), (3,0), and (0,3) I want to be able to sort them on the value of 3x+2y, and get [(0,3), (3,0), (2,2)].
Edit: In the comments below the question a helpful commenter has shown me that the naive algorithm of enumerating all possible sweeplines will give a O(n^2 log(n)) preprocessing algorithm (thanks again!). Is it possible to have a O(n log(n)) preprocessing algorithm?
First note, that enumerating all of the sweeplines takes O(n^2 log(n)), but then you have to sort the n^2 sweeplines. Doing that naively will take time O(n^3 log(n)) and space O(n^3).
I think I can get average performance down to O(n) with O(n^2 log*(n)) time and O(n^2) space spent on preprocessing. (Here log* is the iterated logarithm and for all intents and purposes it is a constant.) But this is only average performance, not worst case.
The first thing to note is that there are n choose 2 = n*(n-1)/2 pairs of points. As we rotate 360 degrees, each pair will cross the other twice, for at most O(n^2) different orderings and O(n^2) pair crossings between them. Also note that after a pair crosses, it does not cross again for 180 degrees. Over any range of less than 180 degrees, a given pair either will cross once or won't.
Now the idea is that we'll store a random O(n) of those possible orderings and which sweepline they correspond to. Between any sweepline and the next, we'll see O(n^2 / n) = O(n) pairs of points cross. Therefore both sorts are correct to on average O(1), and every inversion between the first and the order we want is an inversion between the first and second sorts. We'll use this to find our final sort in O(n).
Let me fill in details backwards.
We have our O(n) sweeplines precalculated. In time O(log(n)) we find the two nearest. Let's assume we find the following data structures.
pos1: Lookup from point to its position in sweepline 1.
points1: Lookup from position to the point there in sweepline 1.
points2: Lookup from position to the point there in sweepline 2.
We will now try to sort in time O(n).
We initialize the following data structures:
upcoming: Priority queue of points that could be next.
is_seen: Bitmap from position to whether we've added the point to upcoming.
answer: A vector/array/whatever you language calls it that will hold the answer at the end.
max_j: The farthest point in line 2 that we have added to upcoming. Starts at -1.
And now we do the following.
for i in range(n):
while is_seen[i] == 0:
# Find another possible point
max_j++
point = points2[max_j]
upcoming.add(point with where it is encountered as priority)
is_seen[pos1[point]] = 1
# upcoming has points1[i] and every point that can come before it.
answer.append(upcoming.pop())
Waving my hands vigorously, every point is put into upcoming once, and taken out once. On average, upcoming has O(1) points in it, so all operations average out to O(1). Since there are n points, the total time is O(n).
OK, how do we set up our sweeplines? Since we only care about average performance, we cheat. We randomly choose O(n) pairs of points. Each pair of points defines a sweepline. We sort those sweeplines in O(n log(n)).
Now we have to sort O(n) sweeplines. How do we do this?
Well we can sort a fixed number of them by any method we want. Let's pick 4 evenly chosen sweeplines and do that. (We actually only need to do the calculation 2x. We pick 2 pairs of points. We pick the sweepline where the first 2 cross, then the second 2 cross, then the other 2 sweeplines are at 180 degrees from the first 2, and therefore are just reversed order.) After that, we can use the algorithm above to sort a sweepline between 2 others. And do that through bisection to smaller and smaller intervals.
Now, of course, the sweeplines will not be as close as they were above. But let's note that if we expect the points to agree to within an average O(f(n)) places between the sweepline, then the heap will have O(f(n)) elements in it, and operations on it will take O(log(f(n))) time, and so we get the intermediate sweepline in O(n log(f(n)). How long is the whole calculation?
Well, we have kind of a tree of calculations to do. Let's divide the sweeplines by what level they are, the group them. The grouping will be the top:
1 .. n/log(n)
n/log(n) .. n/log(log(n))
n/log(log(n)) .. n/log(log(log(n)))
...and so on.
In each group we have O(n / log^k(n)) sweeplines to calculate. Each sweepline takes O(n log^k(n)) time to calculate. Therefore each level takes O(n^2). The number of levels is the iterated logarithm, log*(n). So total preprocessing time is O(n^2 log*(n)).
I need an algorithm for the following problem:
I'm given a set of 2D points P = { (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) } on a plane. I need to group them in pairs in the following manner:
Find two closest points (x_a, y_a) and (x_b, y_b) in P.
Add the pair <(x_a, y_a), (x_b, y_b)> to the set of results R.
Remove <(x_a, y_a), (x_b, y_b)> from P.
If initial set P is not empty, go to the step one.
Return set of pairs R.
That naive algorithm is O(n^3), using faster algorithm for searching nearest neighbors it can be improved to O(n^2 logn). Could it be made any better?
And what if the points are not in the euclidean space?
An example (resulting groups are circled by red loops):
Put all of the points into an http://en.wikipedia.org/wiki/R-tree (time O(n log(n))) then for each point calculate the distance to its nearest neighbor. Put points and initial distances into a priority queue. Initialize an empty set of removed points, and an empty set of pairs. Then do the following pseudocode:
while priority_queue is not empty:
(distance, point) = priority_queue.get();
if point in removed_set:
continue
neighbor = rtree.find_nearest_neighbor(point)
if distance < distance_between(point, neighbor):
# The previous neighbor was removed, find the next.
priority_queue.add((distance_between(point, neighbor), point)
else:
# This is the closest pair.
found_pairs.add(point, neighbor)
removed_set.add(point)
removed_set.add(neighbor)
rtree.remove(point)
rtree.remove(neighbor)
The slowest part of this is the nearest neighbor searches. An R-tree does not guarantee that those nearest neighbor searches will be O(log(n)). But they tend to be. Furthermore you are not guaranteed that you will do O(1) neighbor searches per point. But typically you will. So average performance should be O(n log(n)). (I might be missing a log factor.)
This problem calls for a dynamic Voronoi diagram I guess.
When the Voronoi diagram of a point set is known, the nearest neighbor pair can be found in linear time.
Then deleting these two points can be done in linear or sublinear time (I didn't find precise info on that).
So globally you can expect an O(N²) solution.
If your distances are arbitrary and you can't embed your points into Euclidean space (and/or the dimension of the space would be really high), then there's basically no way around at least a quadratic time algorithm because you don't know what the closest pair is until you check all the pairs. It is easy to get very close to this, basically by sorting all pairs according to distance and then maintaining a boolean look up table indicating which points in your list have already been taken, and then going through the list of sorted pairs in order and adding a pair of points to your "nearest neighbors" if neither point in the pair is in the look up table of taken points, and then adding both points in the pair to the look up table if so. Complexity O(n^2 log n), with O(n^2) extra space.
You can find the closest pair with this divide and conquer algorithm that runs in O(nlogn) time, you may repeat this n times and you will get O(n^2 logn) which is not better than what you got.
Nevertheless, you can exploit the recursive structure of the divide and conquer algorithm. Think about this, if the pair of points you removed were on the right side of the partition, then everything will behave the same on the left side, nothing changed there, so you just have to redo the O(logn) merge steps bottom up. But consider that the first new merge step will be to merge 2 elements, the second merges 4 elements then 8, and then 16,.., n/4, n/2, n, so the total number of operations on these merge steps are O(n), so you get the second closest pair in just O(n) time. So you repeat this n/2 times by removing the previously found pair and get a total O(n^2) runtime with O(nlogn) extra space to keep track of the recursive steps, which is a little better.
But you can do even better, there is a randomized data structure that let you do updates in your point set and get an expected O(logn) query and update time. I'm not very familiar with that particular data structure but you can find it in this paper. That will make your algorithm O(nlogn) expected time, I'm not sure if there is a deterministic version with similar runtimes, but those tend to be way more cumbersome.
I was reading about interesting algorithms in my free time and I just discovered the convex hull trick algorithm, with which we can compute the maxima of several lines in the plane on a given x coordinate. I found this article:
http://wcipeg.com/wiki/Convex_hull_trick
Here the author says, that the dynamic version of this algorithm runs in logarithmic time, but there is no proof. When we insert a line, we test some of his neighbors, but I dont understand how can it be O(log N) when we can test all the N lines by such an insert. Is it correct or I missed something?
UPDATE: this question is already answered, the interesting ones are the rest below
How can we delete?
I mean... if we delete a line, we will probably need the previous ones to reset the whole hull, but that algoritm is deleting all the not necessary lines, when inserting a new one.
Is it an another approach, to solve problems like above(or similar ones, for example managing queries like insert, delete, find maximum on an x point or on a given range etc.)
Thank you in advance!
To answer your first question: "How can insertion be O(logn)?", you can indeed end up checking O(n) neighbors, but note that you only need to check an extra neighbor when you discover that you need to do a delete operation.
The point is that if you are going to insert n new lines, then you can at most do the delete operation n times. Therefore the total amount of extra work is at most O(n) in addition to the O(logn) work per line that you need to find its position in the sorted data structure.
So the total effort for inserting all n lines is O(n) + O(nlogn) = O(nlogn), or in other words, amortized O(logn) per line.
The article claims amortized(not the worst case) O(log N) time per insertion. The amortize bound is easy to prove(each line is removed at most once and each check is either the last one or leads to a deletion of one line).
The article does not say that this data structure supports deletions at all. I'm not sure if it is possible to handle them efficiently. There is an obstacle: the time complexity analysis is based on fact that if we remove a line, we will never need it in the future, which is not the case when deletions are allowed.
Insertion can be quicker than O(log n), it can be achieve in O(Log h) where h is the set of already calculated Hull points. Insertion in batch or one by one can be done in O(log h) per point. You can read my articles about that:
A Convex Hull Algorithm and its implementation in O(n log h)
Fast and improved 2D Convex Hull algorithm and its implementation in O(n log h)
First and Extremely fast Online 2D Convex Hull Algorithm in O(Log h) per point
About delete: I'm pretty sure, but it has to be proven, that it can be achieve in O(log n + log h) = O(log n) per point. A text about it is available at the end of the my third article.
The following is an interview question which I've tried hard to solve. The required bound is to be less than O(n^2). Here is the problem:
You are given with a set of points S = (x1,y1)....(xn,yn). The points
are co-ordinates on the XY plane. A point (xa,ya) is said to be
greater than point (xb,yb) if and only if xa > xb and ya > yb.
The objective is the find all pairs of points p1 = (xa,ya) and p2 = (xb,yb) from the set S such that p1 > p2.
Example:
Input S = (1,2),(2,1),(3,4)
Answer: {(3,4),(1,2)} , {(3,4),(2,1)}
I can only come up with an O(n^2) solution that involves checking each point with other. If there is a better approach, please help me.
I am not sure you can do it.
Example Case: Let the points be (1,1), (2,2) ... (n,n).
There are O(n^2) such points and outputting them itself takes O(n^2) time.
I am assuming you actually want to count such pairs.
Sort descendingly by x in O(n log n). Now we have reduced the problem to a single dimension: for each position k we need to count how many numbers before it are larger than the number at position k. This is equivalent to counting inversions, a problem that has been answered many times on this site, including by me, for example here.
The easiest way to get O(n log n) for that problem is by using the merge sort algorithm, if you want to think about it yourself before clicking that link. Other ways include using binary indexed trees (fenwick trees) or binary search trees. The fastest in practice is probably by using binary indexed trees, because they only involve bitwise operations.
If you want to print the pairs, you cannot do better than O(n^2) in the worst case. I would be interested in an output-sensitive O(num_pairs) algorithm too however.
Why don't you just sort the list of points by X, and Y as a secondary index? (O(nlogn))
Then you can just give a "lazy" indicator that shows for each point that all the points on its right are bigger than it.
If you want to find them ALL, it will take O(n^2) anyway, because there's O(n^2) pairs.
Think of a sorted list, the first one is smallest, so there's n-1 bigger points, the second one has n-2 bigger points... which adds up to about (n^2)/2 == O(n^2)