Best data structure for nearest neighbour in 1 dimension - algorithm

I have a list of values (1-dimensional) and I would like to know the best data structure / algorithm for finding the nearest to a query value I have. Most of the solutions (all?) I found for questions here are for 2 or more dimensions. Can anybody suggest to me the approach for my case?
My instinct tells me to sort the data and use binary search somehow. By the way, there is no limit on the construction or insertion time for any tree needed, so probably someone can suggest a better tree than simply a sorted list.

If you need something faster than O(log(n)), which you can easily get with a sorted array or a binary search tree, you can use a van Emde Boas Tree. vEB trees give you O(log(log(n))) to search for the closest element on either side.

If insertion time is irrelevant, then binary search on a sorted array is the simplest way to achieve O(log N) query time. Each time an item is added sort everything. For each query, perform a binary search. If a match is found, return it. Otherwise, the binary search should return the index of the item, where it should have been inserted. Use this index to check the two neighboring items and determine which of them is closer to the query point.
I suppose that there are solutions with O(1) time. I will try to think of one that doesn't involve too much memory usage...

As you already mentioned, the fastest and easiest way should be sorting the data and then looking for the left and right neighbour of a data point.

Sort the list and use binary search to find the element you are looking for, then compare your left and right neighbors. You can use an array which is O(1) access.
Something like:
int nearest(int[] list, int element) {
sort(list);
int idx = binarySearch(element, list);
// make sure you are accessing elements that exist
min = (element - list[idx-1] <= list[idx+1] - element) ? idx-1 : idx+1;
return list[min];
}
This is O(n log n), which will be amortized if you are going to perform many look ups.
EDIT: For that you'd have to move the sorting out of this method

Using OCaml's Set:
module S = Set.Make(struct type t = int let compare = compare end)
let nearest xs y =
let a, yisin, b = S.split y xs in
if yisin then y
else
let amax, bmin = S.max_elt a, S.min_elt b in
if abs (amax - y) < abs (bmin - y) then amax else bmin
Incidentally, you may appreciate my nth-nearest neighbor sample from OCaml for Scientists and The F#.NET Journal article Traversing networks: nth-nearest neighbors.

Related

Best case of fractional knapsack

the worst case running time of fractional knapsack is O(n), then what should be its best case? is it O(1), because if a weight limit is 16 and you get first item having value, is it right??
True if you assume that input is given in sorted order of value !!!
But as per the definition, the algorithm is expected to take non-sorted input too. see this.
If you are considering a normal input that may or may not be sorted. Then there are two approaches to solve the problem:
Sort the input. which can not be less than O(n) even in best case that too if you use bubble/insertion sort. Which looks completely foolish because both of these sorting algorithms has O(n^2) avarage/worst case performance.
Use the weighted medians approach . That will cost you O(n) as finding the weighted median will take O(n). The code for this approach is given below.
Weighted median approach for fractional knapsack:
We will work on value per unit of item in the following code. The code will first find the middle value (i.e. mid of values per unit of items if given in sorted order) and place it in its correct position. We will use quick sort partition method for this. Once we get the middle (call it mid) element, following two cases need to be taken into consideration:
When sum of weight of all items present in the right side of mid is more than the value of W, we need to search our answer in right side of mid.
else sum all the values present in right side of mid (call it v_left) and search for W-v_left in the left side of mid (include mid as well).
Following is the implementation in python (Use only floating point numbers everywhere):
Please note that i am not providing you the production level code and there are cases which will fail as well. Think about what can cause worst case/failure for finding kth max in array (when all valules are same may be).
def partition(weights,values,start,end):
x = values[end]/weights[end]
i = start
for j in range(start,end):
if values[j]/weights[j] < x:
values[i],values[j] = values[j],values[i]
weights[i], weights[j] = weights[j],weights[i]
i+=1
values[i],values[end] = values[end],values[i]
weights[i], weights[end] = weights[end],weights[i]
return i
def _find_kth(weights,values,start,end,k):
ind = partition(weights,values,start,end)
if ind - start == k-1:
return ind
if ind - start > k-1:
return _find_kth(weights,values,start,ind-1,k)
return _find_kth(weights,values,ind+1,end,k-ind-1)
def find_kth(weights,values,k):
return _find_kth(weights,values,0,len(weights)-1,k)
def fractional_knapsack(weights,values,w):
if w == 0 or len(weights)==0:
return 0
if len(weights) == 1 and weights[0] > w:
return w*(values[0]/weights[0])
mid = find_kth(weights,values,len(weights)/2)
w1 = reduce(lambda x,y: x+y,weights[mid+1:])
v1 = reduce(lambda x,y: x+y, values[mid+1:])
if(w1>w):
return fractional_knapsack(weights[mid+1:],values[mid+1:],w)
return v1 + fractional_knapsack(weights[:mid+1],values[:mid+1],w-w1)
(Editing and rewriting the answer after discussion with #Shasha99, since I feel answers before 2016-12-06 are a bit deceiving)
Summary
O(1) best case is possible if the items are already sorted. Otherwise best case is O(n).
Discussion
If the items are not sorted, you need to find the best item (for the case where one item already fills the knapsack), and that alone will take O(n), since you have to check all of them. Therefore, best case O(n).
On the opposite end, you could have a knapsack where all the items fit. Searching for best would not be needed, but you need to put all of them in, so it's still O(n).
More analysis
Funny enough, O(n) worst case does not imply items being sorted.
Apparently idea from http://algo2.iti.kit.edu/sanders/courses/algdat03/sol12.pdf paired with fast median selection algorithm (weighted medians or maybe median of medians?). Thanks to #Shasha99 for finding this algorithm.
Note that plain quickselect is O(n) expected, O(n*n) worst, but if you use median-of-medians that becomes O(n) worst case. The downside is quite a complicated algorithm.
I'd be interested in a working implementation of any algorithm. More sources to (hopefully simple) algorithms also wouldn't hurt.

what the running time of this recursive algorithm?

Search_item(A,item,c)
index <- c
If (A[index] > item)
Then
index <- 2 * index
Search_item(A, item, index)
Else
If (A[index] < item)
Then
index <- 2 * index + 1
Search_item(A, item, index)
Else
Return index
You gave very little (no) information. This is very sad, but I will give an answer and hope you will show more effort in other questions, as I show effort in answering your question.
The algorithm you posted seems to be incomplete. I think there are some Returns missing. However:
The simplest complexity analysis is this:
You work on an array A. Let n be the number of elements in A. In every recursive call you are either multiplying the current index by two (and make another recursive call) or you return the current index.
Assuming your algorithm is correct, you can have only k recursive calls with 2k < n. So k < logâ‚‚(n) holds.
This means your algorithm has a time-complexity of O(log n).
Since your algorithm is a named as a "search" algorithm, it looks like the array A is a representation of a binary search tree and your search algorithm is a recursive binary search.
This fits the complexity I calculated.

Create Binary Tree from Ancestor Matrix

The question is how to create a binary tree, given its ancestor matrix. I found a cool solution at http://www.ritambhara.in/build-binary-tree-from-ancestor-matrics/. Problem is that it involves deleting rows and columns from the matrix. Now how do I do that? Can anybody suggest a pseudocode for this? Or, is there any better algo possible?
You don't have to actually delete the rows and columns. You can either flag them as deleted in some additional array, or you can make them all zeros, which I think will be effectively the same (actually, you'll still need to know that they are removed, so you don't choose them again in step 4.c - so, flagging the node as deleted should be good enough).
Here are the modifications to the pseudocode from the page:
4.b.
used[temp] = true;
for (i = 0 to N)
Sum[i] -= matrix[i][temp]; (aka decrement sum if temp is a predecessor of i)
matrix[i][temp] = 0;
4.c. Look for all rows for which Sum[i] == 0 and used[i] == false.
This reminds me of the Dancing Links used by Doanld Knuth to implement his Algorithm X
It's basically a structure of circular doubly linked list. You could maintain a seperate Sum array and update it with removal of rows and columns as required.
Actually you don't need to maintain a separate Sum array.
Edit:
I meant -
You could use a structure made up of circular 2D linked lists.
The node structure would somewhat look like:
struct node{
int val;
struct node *left;
struct node *right;
struct node *down;
};
The Top-most and Left-most List is the header List for the vertices(Binary tree node values).
If vertex j is an ancestor of vertex i, build a (empty)new node such that j column's current down is assigned this new node and i's current left is assigned this new node. Note: Structure can be easily built by scanning each rows of ancestor matrix from left to right and inserting rows from 0 to N. (assuming N is the no. of vertices here)
I borrowed these images from Image1 and Image2 to give an idea of the grid. 2nd image is missing the Left-most header though.
If N is no. of vertices. There can be at worse O(N^2) entries in ancestor matrix(in case tree is skewed) or on average O(NlogN) entries.
To search for current Root: O(N)
Assuming a dummy node to start with, linearly scan the Leftmost header and choose a node with node->down->right == node->down.
To delete this vertex information: O(N)
Deleting row:O(1)
node->down = node->down->down;
Deleting column:O(N)
Goto the corresponding column - say(p):
node* q = p;
while(q->down != p){
q->down->left->right = q->down->right;
q->down->right->left = q->down->left;
q = q->down;
}
After discovering current Root you can assign it to it's parent node and insert them into a Queue to process the next level as per that link suggests you to.
Overall time complexity: N + (N-1) + (N-2) +.... = O(N^2).
Worst case space complexity O(N^2)
Though there is no big improvement in the asymptotic run-time from the solution you already have. I thought it's worth mentioning since, this kind of structure can be particularly useful for storing sparse matrices and defining operations like multiplication on them or if you are working with some backtracking algorithm which removes a row/column and later backtracks and adds it again like Knuth's AlgorithmX.
You don't have to update the matrix. Just decrement the values in the sum array for any descendents of the current node, and check if any of them reaches zero, which means the current noe is that last ancestor, e.g. the direct parent:
for (i = 0 to N)
if matrix[i][temp]==1:
Sum[i]=Sum[i]-1
if Sum[i]==0:
add i as child of temp
add i to queue

Sorting algorithm for sorted vectors of "moving" values

This question is related to Which sort algorithm works best on mostly sorted data?
The difference is that I have other very important restriction: the values are changed with small amounts after every sort.
This means that the vector stays almost sorted and the displaced values are nearly in their position. After making some tests it seems same answer apply for my case.
Do you know other algorithms that may be better in this case?
Consider timsort or smoothsort. These are designed with mostly-sorted data in mind.
If frequent update is the case, maintain an index structure (i.e. a binary search tree) is perhaps a better choice than sorting the vector over and over again.
Insertion-Sort and Bubble-Sort both have linear best case complexity for an input, which is already sorted (which is optimal since the values are continuously changing, i.e. you have to have a look at each element of the input vector) and they are stable (which seems to be a useful property given your problem description).
Compare all the pairs a[i] <= a[i+1]. If this is false, move the second element to a new array.
Sort the new array (mergesort, heapsort or any other O(n*log n) algorithm), and merge the new and old array again.
How about this:
Def CheckedMergeSort(L)
Count = 0
S(1) = 0
For I in 2 to |L|
If (L(I-1) < L(I))
Count = Count + 1
S(I) = Count
Def MergeSort(A, B)
If (A != B and S(B)-S(A) != B-A)
C = (B + A) / 2
MergeSort(A,C)
MergeSort(C+1,B)
InplaceMerge(L(A..C), L(C+1..B))
MergeSort(1, |L|)
A linear time prepass of the input is made to fill in S(i) which keeps track of how many pairs previous to i have been in sorted order.
Then by subtracting two bounds S(j)-S(i) and comparing it to j-1 we can determine if any subsequence L(i..j) is in sorted order.
The merge sort then can skip any sorted sequences it finds in its recursion in constant time.
(For example if the array is sorted at entry then MergeSort(1, |L|) becomes a noop.)

Find median value from a growing set

I came across an interesting algorithm question in an interview. I gave my answer but not sure whether there is any better idea. So I welcome everyone to write something about his/her ideas.
You have an empty set. Now elements are put into the set one by one. We assume all the elements are integers and they are distinct (according to the definition of set, we don't consider two elements with the same value).
Every time a new element is added to the set, the set's median value is asked. The median value is defined the same as in math: the middle element in a sorted list. Here, specially, when the size of set is even, assuming size of set = 2*x, the median element is the x-th element of the set.
An example:
Start with an empty set,
when 12 is added, the median is 12,
when 7 is added, the median is 7,
when 8 is added, the median is 8,
when 11 is added, the median is 8,
when 5 is added, the median is 8,
when 16 is added, the median is 8,
...
Notice that, first, elements are added to set one by one and second, we don't know the elements going to be added.
My answer.
Since it is a question about finding median, sorting is needed. The easiest solution is to use a normal array and keep the array sorted. When a new element comes, use binary search to find the position for the element (log_n) and add the element to the array. Since it is a normal array so shifting the rest of the array is needed, whose time complexity is n. When the element is inserted, we can immediately get the median, using instance time.
The WORST time complexity is: log_n + n + 1.
Another solution is to use link list. The reason for using link list is to remove the need of shifting the array. But finding the location of the new element requires a linear search. Adding the element takes instant time and then we need to find the median by going through half of the array, which always takes n/2 time.
The WORST time complexity is: n + 1 + n/2.
The third solution is to use a binary search tree. Using a tree, we avoid shifting array. But using the binary search tree to find the median is not very attractive. So I change the binary search tree in a way that it is always the case that the left subtree and the right subtree are balanced. This means that at any time, either the left subtree and the right subtree have the same number of nodes or the right subtree has one node more than in the left subtree. In other words, it is ensured that at any time, the root element is the median. Of course this requires changes in the way the tree is built. The technical detail is similar to rotating a red-black tree.
If the tree is maintained properly, it is ensured that the WORST time complexity is O(n).
So the three algorithms are all linear to the size of the set. If no sub-linear algorithm exists, the three algorithms can be thought as the optimal solutions. Since they don't differ from each other much, the best is the easiest to implement, which is the second one, using link list.
So what I really wonder is, will there be a sub-linear algorithm for this problem and if so what will it be like. Any ideas guys?
Steve.
Your complexity analysis is confusing. Let's say that n items total are added; we want to output the stream of n medians (where the ith in the stream is the median of the first i items) efficiently.
I believe this can be done in O(n*lg n) time using two priority queues (e.g. binary or fibonacci heap); one queue for the items below the current median (so the largest element is at the top), and the other for items above it (in this heap, the smallest is at the bottom). Note that in fibonacci (and other) heaps, insertion is O(1) amortized; it's only popping an element that's O(lg n).
This would be called an "online median selection" algorithm, although Wikipedia only talks about online min/max selection. Here's an approximate algorithm, and a lower bound on deterministic and approximate online median selection (a lower bound means no faster algorithm is possible!)
If there are a small number of possible values compared to n, you can probably break the comparison-based lower bound just like you can for sorting.
I received the same interview question and came up with the two-heap solution in wrang-wrang's post. As he says, the time per operation is O(log n) worst-case. The expected time is also O(log n) because you have to "pop an element" 1/4 of the time assuming random inputs.
I subsequently thought about it further and figured out how to get constant expected time; indeed, the expected number of comparisons per element becomes 2+o(1). You can see my writeup at http://denenberg.com/omf.pdf .
BTW, the solutions discussed here all require space O(n), since you must save all the elements. A completely different approach, requiring only O(log n) space, gives you an approximation to the median (not the exact median). Sorry I can't post a link (I'm limited to one link per post) but my paper has pointers.
Although wrang-wrang already answered, I wish to describe a modification of your binary search tree method that is sub-linear.
We use a binary search tree that is balanced (AVL/Red-Black/etc), but not super-balanced like you described. So adding an item is O(log n)
One modification to the tree: for every node we also store the number of nodes in its subtree. This doesn't change the complexity. (For a leaf this count would be 1, for a node with two leaf children this would be 3, etc)
We can now access the Kth smallest element in O(log n) using these counts:
def get_kth_item(subtree, k):
left_size = 0 if subtree.left is None else subtree.left.size
if k < left_size:
return get_kth_item(subtree.left, k)
elif k == left_size:
return subtree.value
else: # k > left_size
return get_kth_item(subtree.right, k-1-left_size)
A median is a special case of Kth smallest element (given that you know the size of the set).
So all in all this is another O(log n) solution.
We can difine a min and max heap to store numbers. Additionally, we define a class DynamicArray for the number set, with two functions: Insert and Getmedian. Time to insert a new number is O(lgn), while time to get median is O(1).
This solution is implemented in C++ as the following:
template<typename T> class DynamicArray
{
public:
void Insert(T num)
{
if(((minHeap.size() + maxHeap.size()) & 1) == 0)
{
if(maxHeap.size() > 0 && num < maxHeap[0])
{
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
num = maxHeap[0];
pop_heap(maxHeap.begin(), maxHeap.end(), less<T>());
maxHeap.pop_back();
}
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
}
else
{
if(minHeap.size() > 0 && minHeap[0] < num)
{
minHeap.push_back(num);
push_heap(minHeap.begin(), minHeap.end(), greater<T>());
num = minHeap[0];
pop_heap(minHeap.begin(), minHeap.end(), greater<T>());
minHeap.pop_back();
}
maxHeap.push_back(num);
push_heap(maxHeap.begin(), maxHeap.end(), less<T>());
}
}
int GetMedian()
{
int size = minHeap.size() + maxHeap.size();
if(size == 0)
throw exception("No numbers are available");
T median = 0;
if(size & 1 == 1)
median = minHeap[0];
else
median = (minHeap[0] + maxHeap[0]) / 2;
return median;
}
private:
vector<T> minHeap;
vector<T> maxHeap;
};
For more detailed analysis, please refer to my blog: http://codercareer.blogspot.com/2012/01/no-30-median-in-stream.html.
1) As with the previous suggestions, keep two heaps and cache their respective sizes. The left heap keeps values below the median, the right heap keeps values above the median. If you simply negate the values in the right heap the smallest value will be at the root so there is no need to create a special data structure.
2) When you add a new number, you determine the new median from the size of your two heaps, the current median, and the two roots of the L&R heaps, which just takes constant time.
3) Call a private threaded method to perform the actual work to perform the insert and update, but return immediately with the new median value. You only need to block until the heap roots are updated. Then, the thread doing the insert just needs to maintain a lock on the traversing grandparent node as it traverses the tree; this will ensue that you can insert and rebalance without blocking other inserting threads working on other sub-branches.
Getting the median becomes a constant time procedure, of course now you may have to wait on synchronization from further adds.
Rob
A balanced tree (e.g. R/B tree) with augmented size field should find the median in lg(n) time in the worst case. I think it is in Chapter 14 of the classic Algorithm text book.
To keep the explanation brief, you can efficiently augment a BST to select a key of a specified rank in O(h) by having each node store the number of nodes in its left subtree. If you can guarantee that the tree is balanced, you can reduce this to O(log(n)). Consider using an AVL which is height-balanced (or red-black tree which is roughly balanced), then you can select any key in O(log(n)). When you insert or delete a node into the AVL you can increment or decrement a variable that keeps track of the total number of nodes in the tree to determine the rank of the median which you can then select in O(log(n)).
In order to find the median in linear time you can try this (it just came to my mind). You need to store some values every time you add number to your set, and you won't need sorting. Here it goes.
typedef struct
{
int number;
int lesser;
int greater;
} record;
int median(record numbers[], int count, int n)
{
int i;
int m = VERY_BIG_NUMBER;
int a, b;
numbers[count + 1].number = n:
for (i = 0; i < count + 1; i++)
{
if (n < numbers[i].number)
{
numbers[i].lesser++;
numbers[count + 1].greater++;
}
else
{
numbers[i].greater++;
numbers[count + 1].lesser++;
}
if (numbers[i].greater - numbers[i].lesser == 0)
m = numbers[i].number;
}
if (m == VERY_BIG_NUMBER)
for (i = 0; i < count + 1; i++)
{
if (numbers[i].greater - numbers[i].lesser == -1)
a = numbers[i].number;
if (numbers[i].greater - numbers[i].lesser == 1)
b = numbers[i].number;
m = (a + b) / 2;
}
return m;
}
What this does is, each time you add a number to the set, you must now how many "lesser than your number" numbers have, and how many "greater than your number" numbers have. So, if you have a number with the same "lesser than" and "greater than" it means your number is in the very middle of the set, without having to sort it. In the case that you have an even amount of numbers you may have two choices for a median, so you just return the mean of those two. BTW, this is C code, I hope this helps.

Resources