Sorting Algorithms with constraints on Array - algorithm

I am trying to come up with an algorithm that sorts and array A in O(nlog(logn)) time.
Where A[0...n-1] with the property A[i] >= A[i-j] for all j >= log(n).
So far I have thought to partition A into blocks that are each logn size.
Then I think that the first block will be be strightly smaller then blocks that come after it?
I think I'm missing part of it.

Tree Sort would be an option here. You start at the left end of your array and feed elements into the tree. Whenever your tree has more than log(n) elements you take the smallest element out, because you know for sure that all subsequent elements are larger, and put it back into the sorted array. This way the tree size is always log(n), and the cost of a tree operation is log(log(n)). In fact you only need the operations (1)insert random element and (2) remove smallest element, so you don't need necessarily a tree, but any sort of priority queue would do for that purpose. This way both average and worst-case performance meet your requirements.

Related

Balancing KD-Tree: Which approach is more efficient?

I'm trying to balance a set of (Million +) 3D points using a KD-tree and I have two ways of doing it.
Way 1:
Use an O(n) algorithm to find the arraysize/2-th largest element along a given axis and store it at the current node
Iterate over all the elements in the vector and for each, compare them to the element I just found and put those smaller in newArray1, and those larger in newArray2
Recurse
Way 2:
Use quicksort O(nlogn) to sort all the elements in the array along a given axis, take the element at position arraysize/2 and store it in the current node.
Then put all the elements from index 0 to arraysize/2-1 in newArray1, and those from arraysize/2 to arraysize-1 in newArray2
Recurse
Way 2 seems more "elegant" but way 1 seems faster since the median search and the iterating are both O(n) so I get O(2n) which just reduces to O(n). But then at the same time, even though way 2 is O(nlogn) time to sort, splitting up the array into 2 can be done in constant time, but does it make up for the O(nlogn) time for sorting?
What should I do? Or is there an even better way to do this that I'm not even seeing?
How about Way 3:
Use an O(n) algorithm such as QuickSelect to ensure that the element at position length/2 is the correct element, all elements before are less, and all afterwards are larger than it (without sorting them completely!) - this is probably the algorithm you used in your Way 1 step 1 anyway...
Recurse into each half (except middle element) and repeat with next axis.
Note that you actually do not need to make "node" objects. You can actually keep the tree in a large array. When searching, start at length/2 with the first axis.
I've seen this trick being used by ELKI. It uses very little memory and code, which makes the tree quite fast.
Another way:
Sort for each of the dimensions: O(K N log N). This will be performed only once, we will utilize the sorted list on the dimensions.
For the current dimension, find the median in O(1) time, split for the median in O(N) time, split also the sorted arrays for each of the dimensions in O(KN) time, and recurse for the next dimension.
In that way, you will perform sorts at the beginning. And perform (K+1) splits/filterings for each subtree, for a known value. For small K, this approach should be faster than the other approaches.
Note: The additional space needed for the algorithm can be decreased by the tricks pointed out by Anony-Mousse.
Notice that if the query hyper-rectangle contains many points (all of them for example) it does not matter if the tree is balanced or not. A balanced tree is useful if the query hyper-rects are small.

Fastest method for Queue Implementation in Java

The task is to implement a queue in java with the following methods:
enqueue //add an element to queue
dequeue //remove element from queue
peekMedian //find median
peekMinimum //find minimum
peakMaximum //find maximum
size // get size
Assume that ALL METHODS ARE CALLED In EQUAL FREQUENCY, the task is to have the fastest implementation.
My Current Approach:
Maintain a sorted array, in addition to the queue, so enqueue and dequeue are take O(logn) and peekMedian, peekMaximum, peekMinimum all take O(1) time.
Please suggest a method that will be faster, assuming all methods are called in equal frequency.
Well, you are close - but there is still something missing, since inserting/deleting from a sorted array is O(n) (because at probability 1/2 the inserted element is at the first half of the array, and you will have to shift to the right all the following elements, and there are at least n/2 of these, so total complexity of this operation is O(n) on average + worst case)
However, if you switch your sorted DS to a skip list/ balanced BST - you are going to get O(logn) insertion/deletion and O(1) minimum/maximum/median/size (with caching)
EDIT:
You cannot get better then O(logN) for insertion (unless you decrease the peekMedian() to Omega(logN)), because that will enable you to sort better then O(NlogN):
First, note that the median moves one element to the right for each "high" elements you insert (in here, high means >= the current max).
So, by iteratively doing:
while peekMedian() != MAX:
peekMedian()
insert(MAX)
insert(MAX)
you can find the "higher" half of the sorted array.
Using the same approach with insert(MIN) you can get the lowest half of the array.
Assuming you have o(logN) (small o notation, better then Theta(logN) insertion and O(1) peekMedian(), you got yourself a sort better then O(NlogN), but sorting is Omega(NlogN) problem.
=><=
Thus insert() cannot be better then O(logN), with median still being O(1).
QED
EDIT2: Modifying the median in insertions:
If the tree size before insertion is 2n+1 (odd) then the old median is at index n+1, and the new median is at the same index (n+1), so if the element was added before the old median - you need to get the preceding node of the last median - and that's the new median. If it was added after it - do nothing, the old median is the new one as well.
If the list is even (2n elements), then after the insertion, you should increase an index (from n to n+1), so if the new element was added before the median - do nothing, if it was added after the old median, you need to set the new median as the following node from the old median.
note: In here next nodes and preceding nodes are those that follow according to the key, and index means the "place" of the node (smallest is 1st and biggest is last).
I only explained how to do it for insertion, the same ideas hold for deletion.
There is a simpler and perhaps better solution. (As has been discussed, the sorted array makes enqueue and dequeue both O(n), which is not so good.)
Maintain two sorted sets in addition to the queue. The Java library provides e.g. SortedSet, which are balanced search trees. The "low set" stores the first ceiling (n/2) elements in sorted order. The second "high set" has the last floor(n/2).
NB: If duplicates are allowed, you'll have to use something like Google's TreeMultiset instead of regular Java sorted sets.
To enqueue, just add to the queue and the correct set. If necessary, re-establish balance between the sets by moving one element: either the greatest element in the low set to the upper set or the least element in the high set to the low. Dequeuing needs the same re-balance operation.
Finding the median if n is odd is just looking up the max element in the low set. If n is even, find the max element in the low set and min in the high set and average them.
With the native Java sorted set implementation (balanced tree), this will be O(log n) for all operations. It will be very easy to code. About 60 lines.
If you implement your own sifting heaps for the low and high sets, then you'll have O(1) for the find median operation while all other ops will remain O(log n).
If you go on and implement your own Fibonacci heaps for the low and high sets, then you'll have O(1) insert as well.

Data structure to find median

This is an interview question. Design a class, which stores integers and provides two operations:
void insert(int k)
int getMedian()
I guess I can use BST so that insert takes O(logN) and getMedian takes O(logN) (for getMedian I should add the number of of left/right children for each node).
Now I wonder if this is the most efficient solution and there is no better one.
You can use 2 heaps, that we will call Left and Right.
Left is a Max-Heap.
Right is a Min-Heap.
Insertion is done like this:
If the new element x is smaller than the root of Left then we insert x to Left.
Else we insert x to Right.
If after insertion Left has count of elements that is greater than 1 from the count of elements of Right, then we call Extract-Max on Left and insert it to Right.
Else if after insertion Right has count of elements that is greater than the count of elements of Left, then we call Extract-Min on Right and insert it to Left.
The median is always the root of Left.
So insertion is done in O(lg n) time and getting the median is done in O(1) time.
See this Stack Overflow question for a solution that involves two heaps.
Would it beat an array of integers witch performs a sort at insertion time with a sort algorithm dedicated for integer (http://en.wikipedia.org/wiki/Sorting_algorithm) if you choose your candidate amongst O < O(log(n)) and using an array, then getMedian would be taking the index at half of the size would be O(1), no? It seems possible to me to do better than log(n) + log(n).
Plus by being a little more flexible you can improve your performance by changing your sort algorithm according to the properties of your input (are the input almost sorted or not ...).
I am pretty much autodidact in computer science, but that is the way I would do it: simpler is better.
You could consider a self-balancing tree, too. If the tree is fully balanced, then the root node is your median. Say, the tree is one-level deeper on one end. Then, you just need to know how many nodes are there in the deeper-side to pick the correct median.

Looking for a data container with O(1) indexing and O(log(n)) insertion and deletion

I'm not sure if it's possible but it seems a little bit reasonable to me, I'm looking for a data structure which allows me to do these operations:
insert an item with O(log n)
remove an item with O(log n)
find/edit the k'th-smallest element in O(1), for arbitrary k (O(1) indexing)
of course editing won't result in any change in the order of elements. and what makes it somehow possible is I'm going to insert elements one by one in increasing order. So if for example I try inserting for the fifth time, I'm sure all four elements before this one are smaller than it and all the elements after this this are going to be larger.
I don't know if the requested time complexities are possible for such a data container. But here is a couple of approaches, which almost achieve these complexities.
First one is tiered vector with O(1) insertion and indexing, but O(sqrt N) deletion. Since you expect only about 10000 elements in this container and sqrt(10000)/log(10000) = 7, you get almost the required performance here. Tiered vector is implemented as an array of ring-buffers, so deleting an element requires moving all elements, following it in the ring-buffer, and moving one element from each of the following ring-buffers to the one, preceding it; indexing in this container means indexing in the array of ring-buffers and then indexing inside the ring-buffer.
It is possible to create a different container, very similar to tiered vector, having exactly the same complexities, but working a little bit faster because it is more cache-friendly. Allocate a N-element array to store the values. And allocate a sqrt(N)-element array to store index corrections (initialized with zeros). I'll show how it works on the example of 100-element container. To delete element with index 56, move elements 57..60 to positions 56..59, then in the array of index corrections add 1 to elements 6..9. To find 84-th element, look up eighth element in the array of index corrections (its value is 1), then add its value to the index (84+1=85), then take 85-th element from the main array. After about half of elements in main array are deleted, it is necessary to compact the whole container to attain contiguous storage. This gets only O(1) cumulative complexity. For real-time applications this operation may be performed in several smaller steps.
This approach may be extended to a Trie of depth M, taking O(M) time for indexing, O(M*N1/M) time for deletion and O(1) time for insertion. Just allocate a N-element array to store the values, N(M-1)/M, N(M-2)/M, ..., N1/M-element arrays to store index corrections. To delete element 2345, move 4 elements in main array, increase 5 elements in the largest "corrections" array, increase 6 elements in the next one and 7 elements in the last one. To get element 5678 from this container, add to 5678 all corrections in elements 5, 56, 567 and use the result to index the main array. Choosing different values for 'M', you can balance the complexity between indexing and deletion operations. For example, for N=65000 you can choose M=4; so indexing requires only 4 memory accesses and deletion updates 4*16=64 memory locations.
I wanted to point out first that if k is really a random number, then it might be worth considering that the problem might be completely different: asking for the k-th smallest element, with k uniformly at random in the range of the available elements is basically... picking an element at random. And it can be done much differently.
Here I'm assuming you actually need to select for some specific, if arbitrary, k.
Given your strong pre-condition that your elements are inserted in order, there is a simple solution:
Since your elements are given in order, just add them one by one to an array; that is you have some (infinite) table T, and a cursor c, initially c := 1, when adding an element, do T[c] := x and c := c+1.
When you want to access the k-th smallest element, just look at T[k].
The problem, of course, is that as you delete elements, you create gaps in the table, such that element T[k] might not be the k-th smallest, but the j-th smallest with j <= k, because some cells before k are empty.
It then is enough to keep track of the elements which you have deleted, to know how many have been deleted that are smaller than k. How do you do this in time at most O(log n)? By using a range tree or a similar type of data structure. A range tree is a structure that lets you add integers and then query for all integers in between X and Y. So, whenever you delete an item, simply add it to the range tree; and when you are looking for the k-th smallest element, make a query for all integers between 0 and k that have been deleted; say that delta have been deleted, then the k-th element would be in T[k+delta].
There are two slight catches, which require some fixing:
The range tree returns the range in time O(log n), but to count the number of elements in the range, you must walk through each element in the range and so this adds a time O(D) where D is the number of deleted items in the range; to get rid of this, you must modify the range tree structure so as to keep track, at each node, of the number of distinct elements in the subtree. Maintaining this count will only cost O(log n) which doesn't impact the overall complexity, and it's a fairly trivial modification to do.
In truth, making just one query will not work. Indeed, if you get delta deleted elements in range 1 to k, then you need to make sure that there are no elements deleted in range k+1 to k+delta, and so on. The full algorithm would be something along the line of what is below.
KthSmallest(T,k) := {
a = 1; b = k; delta
do {
delta = deletedInRange(a, b)
a = b + 1
b = b + delta
while( delta > 0 )
return T[b]
}
The exact complexity of this operation depends on how exactly you make your deletions, but if your elements are deleted uniformly at random, then the number of iterations should be fairly small.
There is a Treelist (implementation for Java, with source code), which is O(lg n) for all three ops (insert, delete, index).
Actually, the accepted name for this data structure seems to be "order statistic tree". (Apart from indexing, it's also defined to support indexof(element) in O(lg n).)
By the way, O(1) is not considered much of an advantage over O(lg n). Such differences tend to be overwhelmed by the constant factor in practice. (Are you going to have 1e18 items in the data structure? If we set that as an upper bound, that's just equivalent to a constant factor of 60 or so.)
Look into heaps. Insert and removal should be O(log n) and peeking of the smallest element is O(1). Peeking or retrieval of the K'th element, however, will be O(log n) again.
EDITED: as amit stated, retrieval is more expensive than just peeking
This is probably not possible.
However, you can make certain changes in balanced binary trees to get kth element in O(log n).
Read more about it here : Wikipedia.
Indexible Skip lists might be able to do (close) what you want:
http://en.wikipedia.org/wiki/Skip_lists#Indexable_skiplist
However, there's a few caveats:
It's a probabilistic data structure. That means it's not necessarily going to be O(log N) for all operations
It's not going to be O(1) for indexing, just O(log N)
Depending on the speed of your RNG and also depending on how slow traversing pointers are, you'll likely get worse performance from this than just sticking with an array and dealing with the higher cost of removals.
Most likely, something along the lines of this is going to be the "best" you can do to achieve your goals.

Find k-th smallest element data structure

I have a problem here that requires to design a data structure that takes O(lg n) worst case for the following three operations:
a) Insertion: Insert the key into data structure only if it is not already there.
b) Deletion: delete the key if it is there!
c) Find kth smallest : find the ݇k-th smallest key in the data structure
I am wondering if I should use heap but I still don't have a clear idea about it.
I can easily get the first two part in O(lg n), even faster but not sure how to deal with the c) part.
Anyone has any idea please share.
Two solutions come in mind:
Use a balanced binary search tree (Red black, AVG, Splay,... any would do). You're already familiar with operation (1) and (2). For operation (3), just store an extra value at each node: the total number of nodes in that subtree. You could easily use this value to find the kth smallest element in O(log(n)).
For example, let say your tree is like follows - root A has 10 nodes, left child B has 3 nodes, right child C has 6 nodes (3 + 6 + 1 = 10), suppose you want to find the 8th smallest element, you know you should go to the right side.
Use a skip list. It also supports all your (1), (2), (3) operations for O(logn) on average but may be a bit longer to implement.
Well, if your data structure keeps the elements sorted, then it's easy to find the kth lowest element.
The worst-case cost of a Binary Search Tree for search and insertion is O(N) while the average-case cost is O(lgN).
Thus, I would recommend using a Red-Black Binary Search Tree which guarantees a worst-case complexity of O(lgN) for both search and insertion.
You can read more about red-black trees here and see an implementation of a Red-Black BST in Java here.
So in terms of finding the k-th smallest element using the above Red-Black BST implementation, you just need to call the select method, passing in the value of k. The select method also guarantees worst-case O(lgN).
One of the solution could be using the strategy of quick sort.
Step 1 : Pick the fist element as pivot element and take it to its correct place. (maximum n checks)
now when you reach the correct location for this element then you do a check
step 2.1 : if location >k
your element resides in the first sublist. so you are not interested in the second sublist.
step 2.2 if location
step 2.3 if location == k
you have got the element break the look/recursion
Step 3: repete the step 1 to 2.3 by using the appropriate sublist
Complexity of this solution is O(n log n)
Heap is not the right structure for finding the Kth smallest element of an array, simply because you would have to remove K-1 elements from the heap in order to get to the Kth element.
There is a much better approach to finding Kth smallest element, which relies on median-of-medians algorithm. Basically any partition algorithm would be good enough on average, but median-of-medians comes with the proof of worst-case O(N) time for finding the median. In general, this algorithm can be used to find any specific element, not only the median.
Here is the analysis and implementation of this algorithm in C#: Finding Kth Smallest Element in an Unsorted Array
P.S. On a related note, there are many many things that you can do in-place with arrays. Array is a wonderful data structure and only if you know how to organize its elements in a particular situation, you might get results extremely fast and without additional memory use.
Heap structure is a very good example, QuickSort algorithm as well. And here is one really funny example of using arrays efficiently (this problem comes from programming Olympics): Finding a Majority Element in an Array

Resources