K nearest points. Time complexity O(n), not O(nLogn). How?

K nearest points. Time complexity O(n), not O(nLogn). How? - algorithm

Given a million list of co-ordinates in the form of longitude and latitude just as Google maps, how will you print closest k cities to a given location?
I had this question asked during an interview. The interviewer said this can be done in O(n) by using insertion sort up to k rather that sorting the whole list, which is NlogN. I found other answers online, and most say NLogN... was he[interviewer] correct?

I think, when calculating the distance, you can maintain a list of K elements.
Every time you have a new distance, insert it into the list if it is smaller than the largest one, and remove the largest one.
This insertion can be O(k) if you are using an sorted array, or O(logK) if you are using a binary heap.
In the worst case, you will insert n times. In total, it will be O(NK) or O(NlogK). If K is small enough, it is O(N).

It's an algorithm of quickselect (https://en.wikipedia.org/wiki/Quickselect)
Basically it's quicksort with a modification - whenever you have two halves you sort only one of them:
If a half contains k-th position - continue with subdividing and sorting it
If a half is completely after k-th position - no need to sort it, we are not interested in those elements
If a half is completely before k-th position - no need to sort it, we need all those elements and their order doesn't matter
After finish you will have the closest k elements in the first k places of the array (but they are not necessarily sorted).
Since at every step you process only one half, time will be n+n/2+n/4+n/8+...=2n (ignoring constants).
For guarantied O(n) you can always select a good pivot with e.g. median of medians (https://en.wikipedia.org/wiki/Median_of_medians).

Working on the assumption that latitude and longitude have a given number of digits, we can actually use radix sort. It seems similar to Hanqiu's answer, but I'm not sure if it's the same one. The Wikipedia description:
In computer science, radix sort is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits which share the same significant position and value. A positional notation is required, but because integers can represent strings of characters (e.g., names or dates) and specially formatted floating point numbers, radix sort is not limited to integers. Radix sort dates back as far as 1887 to the work of Herman Hollerith on tabulating machines.
The article says the following about efficiency:
The topic of the efficiency of radix sort compared to other sorting algorithms is somewhat tricky and subject to quite a lot of misunderstandings. Whether radix sort is equally efficient, less efficient or more efficient than the best comparison-based algorithms depends on the details of the assumptions made. Radix sort complexity is O(wn) for n keys which are integers of word size w. Sometimes w is presented as a constant, which would make radix sort better (for sufficiently large n) than the best comparison-based sorting algorithms, which all perform Θ(n log n) comparisons to sort n keys.
In your case, the w corresponds to the word size of your latitude and longitude, that is the number of digits. In particular, this gets more efficiently for lower precision (fewer digits) in your coordinates. Whether it's more efficient that nlogn algorithms depends on your n and your implementation. Asymptotically, it's better than nlogn.
Obviously, you'd still need to combine the two into actual distance.

You could also use this algorithm with O(N) complexity, which exploits a "HashMap-like" array which would automatically sort the distances, within a given resolution.
Here's the pseudo-code in Java:
City[] cities = //your city list
Coordinate coor = //the coordinate of interest
double resolution = 0.1, capacity = 1000;
ArrayList<City>[] cityDistances = new ArrayList<City>[(int)(capacity/resolution)];
ArrayList<City> closestCities = new ArrayList<City>();
for(City c : cities) {
double distance = coor.getDistance(c);
int hash = distance/resolution;
if(cityDistances[hash] == null) cityDistances[hash] = new ArrayList<City>();
cityDistances[hash].add(c);
}
for(int index = 0 ; closestCities.size() < 10 ; index++) {
ArrayList<City> cList = cityDist[index];
if(cList == null) continue;
closestCities.addAll(cList);
}
The idea is to loop through the list of cities, calculate the distance with the coordinate of interest, and then use the distance to determine where the city should be added to the "HashMap-like" array cityDistances. The smaller the distance, the closer the index will be to 0.
The smaller the resolution, the more likely that the list closestCities will end up with 10 cities after the last loop.

Related

A linear algorithm for this specification?

This is my question I have got somewhere.
Given a list of numbers in random order write a linear time algorithm to find the 𝑘th smallest number in the list. Explain why your algorithm is linear.
I have searched almost half the web and what I got to know is a linear-time algorithm is whose time complexity must be O(n). (I may be wrong somewhere)
We can solve the above question by different algorithms eg.
Sort the array and select k-1 element [O(n log n)]
Using min-heap [O(n + klog n)]
etc.
Now the problem is I couldn't find any algorithm which has O(n) time complexity and satisfies that algorithm is linear.
What can be the solution for this problem?

This is std::nth_element
From cppreference:
Notes
The algorithm used is typically introselect although other selection algorithms with suitable average-case complexity are allowed.
Given a list of numbers
although it is not compatible with std::list, only std::vector, std::deque and std::array, as it requires RandomAccessIterator.

linear search remembering k smallest values is O(n*k) but if k is considered constant then its O(n) time.
However if k is not considered as constant then Using histogram leads to O(n+m.log(m)) time and O(m) space complexity where m is number of possible distinct values/range in your input data. The algo is like this:
create histogram counters for each possible value and set it to zero O(m)
process all data and count the values O(m)
sort the histogram O(m.log(m))
pick k-th element from histogram O(1)
in case we are talking about unsigned integers from 0 to m-1 then histogram is computed like this:
int data[n]={your data},cnt[m],i;
for (i=0;i<m;i++) cnt[i]=0;
for (i=0;i<n;i++) cnt[data[i]]++;
However if your input data values does not comply above condition you need to change the range by interpolation or hashing. However if m is huge (or contains huge gaps) is this a no go as such histogram is either using buckets (which is not usable for your problem) or need list of values which lead to no longer linear complexity.
So when put all this together is your problem solvable with linear complexity when:
n >= m.log(m)

How to sort an array with unique values

Quicksort gives us a pretty nice O(nlogn); However, I was thinking is there a way to sort an array with unique values that is faster than Quicksort?

Here are some of the fastest sorting algorithms and their runtimes:
Mergesort: O(nlogn)
Timsort: O(nlogn)
Heapsort: O(nlogn)
Radix sort: O(nk)
Counting sort: O(n + k)

About sorting algorithms and techniques #Bilal answer is quite helpful!!
A work around the problem could run in O(N*log(N)) but for further calculations will be less because of removing of the duplicated values.
So the idea is to input the values and insert it in std::set which will automatically remove duplicated values, and if the duplicates are needed you can store it's count while getting input from user!!
A sample code will be something like this:
int n,x;
set<int> st;
int cnt[MAX_VAL];
int main(){
cin>>n;
for (int i=1;i<=n;i++){
cin>>x;
cnt[x]++;
st.insert(x);
}
// Rest of your code
}

Without additional assumptions, the lower bound for worst time complexity of any algorithm that uses comparisons isBigTheta(nlogn) .Note that the sorting of a permutation will in fact be the inverse of p. This means that if you are able to sort p({1,2,...n), then you are able to determine which permutation was applied to your data, out of all possible permutations.
The total number of permutations is n!, and for every information bit acquired your set is partitioned into two sets representing the outcomes consistent with that bit. Therefore you can represent the search for which permutation you use as a binary tree where every node is a set of permutations, the children of a node are the partitions of the parent set and the leaves are the outcome of your algorithm.
If your algorithm determines which partition you use, the leaves are singletons, so you end up with n! leaves. The tree with minimal height that contains n! leaves is log(n!) which is asymptotically nlog(n). http://lcm.csa.iisc.ernet.in/dsa/node208.html is a good reference for all of this.

Find Pair with Difference less than K with O(n) complexity on average

I have an unsorted array of n positive numbers and a parameter k, I need to find out if there is a pair of numbers in the array that the difference between than is less than k and I need to do so in time complexity of O(n) on probable average and in space complexity of O(n).
I believe it requires the use of a universal hash table but I'm not sure how, any ideas?

This answer works even on unbounded integers and floats (doing some assumptions on the nicety of the hashmap you'll be using - the java implementation should work for instance):
keep a hashmap<int, float> all_divided_values. For each key y,
if all_divided_values[y] exists, it will contain a value v that
is in the array such that floor(v/k) = y.
For each value v in the original array A, if v/k is in all_divided_values's keys, output (v, all_divided_values[v/k])
(they are distant by less than k). Else, store v in
all_divided_values[v/k]
Once all_divided_values is filled, go through A again. For each v, test whether all_divided_values[v/k - 1] exists, and if so,
output the pair (v, all_divided_values[v/k - 1]) if and only if abs(v-all_divided_values[v/k - 1])<=k
Inserting in a hashmap is usually (with Java hashmap for instance) O(1) in average, so the total time is O(n). But please note that technically this could be false, for instance if your language's implementation does not have a nice strategy about the hashmap.

Simple solution:
1- Sort the array
2- Calculate the difference between consecutive elements
a) If the difference is smaller than k return that pair
b) If no consecutive number difference yields a value smaller than k, then your array has no pair of numbers such that the difference is smaller than k.
Sorting is O(nlogn), but if you have only Integers of limited size, you can use Counting sort, that is O(n)

You can consider this way.
The problem can be modeled as this:-
consider each element (considering integer) now you convert them to a range (A[i]-K,A[i]+K)
Now you want to check if any of the two intervals overlap.
Interval intersection problem without any sorted ness is not solvable in O(n) (worst case). You need to sort them and then inn O(n) you can check if hey intersect.
Same goes for your logic. Sort it and find it.

sort array of locations by nearest point

So the question is like this:
Given a location X and an array of locations, I want to get an array
of locations that are closest to location X, in other words sorted by
closest distance.
The way I solved this is by iterating through each location array and calculate the distance between X and that particular location, store that distance, and sort the location by distance using a Comparator. Done! Is there a better way to do this? Assuming that the sort is a merge sort, it should be O(n log n).

If I understand this right, you can do this pretty quickly for multiple queries - as in, given several values of X, you wouldn't have to re-sort your solution array every time. Here's how you do it:
Sort the array initially (O(n logn) - call this pre-processing)
Now, on every query X, binary search the array for X (or closest number smaller than X). Maintain, two indices, i and j, one which points to the current location, one to the next. One among these is clearly the closest number to X on the list. Pick the smaller distance one and put it in your solution array. Now, if i was picked, decrement i - if j was picked, increment j. Repeat this process till all the numbers are in the solution array.
This takes O(n + logn) for each query with O(nlogn) preprocessing. Of course, if we were talking about just one X, this is not better at all.

The problem you describe sounds like a m-nearest neighbor search to me.
So, if I correctly understood your question, i.e. the notion of a location being a vector in a multidimensional metric space, and the distance being a proper metric in this space, then it would be nice to put the array of locations in a k-d-Tree.
You have some overhead for the tree building once, but you get the search for O(log n) then.
A benefit of this, assuming you are just interested in the m < n closest available locations, you don't need to evaluate all n distances every time you search for a new X.

You should try using min-heap ds to implement this. Just keep on storing the locations in heap with key = diff of X and that location

You can't do asymptotically better than O(n log n) if using a comparison-based sort. If you want to talk about micro-optimization of the code, though, some ideas include...
Sort by squared distance; no reason to ever use sqrt() - sqrt() is expensive
Only compute squared distance if necessary; if |dx1| <= |dx2| and |dy1| <= |dy2|, pt1 is closer than pt2 - integer multiplication is fast, but avoiding it in many cases may be somewhat faster
A thinking-outside-the-box solution might be to use e.g. Bucket sort... A linear time sorting algorithm which might be applicable here.

Proof by contradiction that you can't do better:
It is well known that comparison-based sorts are the only way to sort arbitrary numbers (which may include irrational numbers), and that they can't do better than n*log(n) time.
If you go through the list in O(n) time and select the smallest number, then use that as X, and somehow come up with a list of numbers that are sorted by distance to X, then you have sorted n numbers in less than O(n*log(n)) time.

Is it possible to find two numbers whose difference is minimum in O(n) time

Given an unsorted integer array, and without making any assumptions on
the numbers in the array:
Is it possible to find two numbers whose
difference is minimum in O(n) time?
Edit: Difference between two numbers a, b is defined as abs(a-b)

Find smallest and largest element in the list. The difference smallest-largest will be minimum.
If you're looking for nonnegative difference, then this is of course at least as hard as checking if the array has two same elements. This is called element uniqueness problem and without any additional assumptions (like limiting size of integers, allowing other operations than comparison) requires >= n log n time. It is the 1-dimensional case of finding the closest pair of points.

I don't think you can to it in O(n). The best I can come up with off the top of my head is to sort them (which is O(n * log n)) and find the minimum difference of adjacent pairs in the sorted list (which adds another O(n)).

I think it is possible. The secret is that you don't actually have to sort the list, you just need to create a tally of which numbers exist. This may count as "making an assumption" from an algorithmic perspective, but not from a practical perspective. We know the ints are bounded by a min and a max.
So, create an array of 2 bit elements, 1 pair for each int from INT_MIN to INT_MAX inclusive, set all of them to 00.
Iterate through the entire list of numbers. For each number in the list, if the corresponding 2 bits are 00 set them to 01. If they're 01 set them to 10. Otherwise ignore. This is obviously O(n).
Next, if any of the 2 bits is set to 10, that is your answer. The minimum distance is 0 because the list contains a repeated number. If not, scan through the list and find the minimum distance. Many people have already pointed out there are simple O(n) algorithms for this.
So O(n) + O(n) = O(n).
Edit: responding to comments.
Interesting points. I think you could achieve the same results without making any assumptions by finding the min/max of the list first and using a sparse array ranging from min to max to hold the data. Takes care of the INT_MIN/MAX assumption, the space complexity and the O(m) time complexity of scanning the array.

The best I can think of is to counting sort the array (possibly combining equal values) and then do the sorted comparisons -- bin sort is O(n + M) (M being the number of distinct values). This has a heavy memory requirement, however. Some form of bucket or radix sort would be intermediate in time and more efficient in space.

Sort the list with radixsort (which is O(n) for integers), then iterate and keep track of the smallest distance so far.
(I assume your integer is a fixed-bit type. If they can hold arbitrarily large mathematical integers, radixsort will be O(n log n) as well.)

It seems to be possible to sort unbounded set of integers in O(n*sqrt(log(log(n))) time. After sorting it is of course trivial to find the minimal difference in linear time.
But I can't think of any algorithm to make it faster than this.

No, not without making assumptions about the numbers/ordering.
It would be possible given a sorted list though.

I think the answer is no and the proof is similar to the proof that you can not sort faster than n lg n: you have to compare all of the elements, i.e create a comparison tree, which implies omega(n lg n) algorithm.
EDIT. OK, if you really want to argue, then the question does not say whether it should be a Turing machine or not. With quantum computers, you can do it in linear time :)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio