Given an array of integers. I need to find k nearest integers for every element in array. (without element itself)
Example:
k = 2
Array: [1, 2, 4, 7]
Answer: [4, 3, 5, 8]
I've come up with the following algorithm. (There can be inaccuracies with indices, but I hope the main idea is clear).
Sort an array
Suppose we have an answer for ith element, i.e the segment L, which is before i, and segment R after i, such that |L|+|R|=k
Considering answer for i+1 element, we can take all elements from R (except a[i+1] itself), because they're even closer to a[i+1] if the array is sorted. Then I find other k-|R|+1 elements using two pointers, which move in different directions: l starting from i, r starting from i+|R|.
I don't like that I'm scanning previous elements using that l pointer. I suppose in a worst case scenario this algorithm would have O(n^2) time complexity. How can I improve it?
Your algorithm could use a small tweak, that's it. It doesn't have O(n²) behavior. As is, it's by description O(max(k × n, n log n)) because, if I'm reading you correctly, you think you're doing a O(k) scan outwards for each element that does not meaningfully benefit from the state of the window you used for the prior element. A small tweak would drop the k × n term to n, leaving it O(cost-of-sorting algorithm), e.g. O(n log n) for good general purpose sorts, or some lower big-O greater than or equal to O(n) (for special purpose sorts like counting sort). And really, even without that tweak your existing code would already behave that way, you just don't see it (the tweak is more how you think about it, where the lack of the tweak just adds an additional O(n) step, and going from n to 2n work doesn't change the big-O).
Your lower bound is O(n log n) if you use general purpose sorting. If the inputs fall in restricted ranges you could get O(n + k) computational work, with O(k) auxiliary storage, by using a special purpose sort like counting sort, where k is the size of the range, but we'll assume the integers can run from -inf to inf, so the O(n + k) is worse than O(n log n) general purpose sorting. Either way, that's a lower bound, if the rest of your algorithm is O(n) (and it has to be at least that high, since it must traverse the input element by element), then your sorting algorithm determines your overall work.
But while the way you describe it makes it sound like the per-element work for window adjustment is O(k), in practice, it's actually doing n - k work to perform the window shifts for the whole sequence, with the -k factor spread over the whole sequence. Your auxiliary pointers would never scan backwards, because each time you move forward, your old value is closer to the new value than the furthest left value was by definition. So the only possibilities are:
The window for the prior value included the next value; replace the new value with the prior value, and then check:
a. If some number of values above the old window are closer than the values at the bottom of the old window, the window slides forward (up to k times), or
b. If the value above the old window is further than the bottom value in the window, the window doesn't shift (aside from the tweak to include the prior value in place of the next value)
The window for the prior value was entirely to its left: Replace the bottom of the window with the prior value unconditionally (instead of replacing the new value with the prior value as in #1, since the new value wasn't in there), then follow the same rules as case #1 (either the window remains unmoved beyond the swap, or it shifts right up to k times)
So the window sliding work can be as much as k at any given step (say, if k is 2 and you're at the value 8 in [6, 7, 8, 100, 101, 102], then when you shift to 100, your preliminary window from 6-7 unconditionally becomes 7-8, then you conditionally shift twice, first to 8-101, then to 101-102 where it stops). But since the only options at any given step are:
The window doesn't move (effectively moving backwards by one relative to advancing value's position), or
The window moves up to k + 1 forward
this means that for every time the window moves forward a given number of steps, s, it must not have moved at all for s prior values (to move it backward enough that it could move forward that far), meaning that the peak per-element work is O(k), but the amortized per-element work is O(1) (and it's actually slightly less than one, because over the course of the whole traversal, you move from a window that is definitionally k elements to the right of the current value, to one k elements to the left, and those leftward "moves" are actually staying put, so to speak, so your window shifts slightly less than once per element).
Your original plan is effectively the same, you'd just end up doing an unnecessary check at each stage to see if the window could move backwards (it never could). Since that single check is fixed cost, it's O(1) per-element, imposing no multiplier on the O(n) cost for processing the whole sequence.
In short: Your algorithm is already big-O optimal at O(n) for the work done post-sort. If you use a general purpose sorting algorithm, the O(n log n) work it does dominates everything else; a special purpose algorithm like counting sort would be the only way to lower your big-O of the whole process, including the sort, any lower, and the minimum cost would be O(n) even if you had a magical "sorts for free" tool.
Related
Problem: An odd number K is given. At the input, we gradually receive a sequence of N numbers. After reading each number (except the first K-1), print the median of the last K numbers.
I solved it using two heaps (MaxHeap and MinHeap, e.g. like here ) and added function remove_element(value) to both heaps. remove_element(value) iterates over the heap, looking for the value to remove. Removes it and then rebalances the tree. So it works in O(K+log K).
And I solved the problem like this: by iterating the stream date, I add a new element to the heaps and delete the old one, which is already outside the window (K+1th element). So it works in O(N*(K + log K + log K)) = O(N*K).
I'm wondering 1) if I correctly estimated the time complexity of the algorithm, 2) is it possible to speed up this algorithm. 3) if not (the algorithm is optimal), then how can I prove its optimality.
As for the third question, if my estimate of the O(N*K) algorithm is correct, I think it can be proven based on the obvious idea that in order to write out the median, we need to check all K elements for each of N requests
For a better approach, make the heaps be heaps of (value, i). And now you don't need to remove stale values promptly, you can simply keep track of how many non-stale values are on each side, and throw away stale values when they pop up from the heap.
And if more than half of a heap is stale, then do garbage collection to remove stale values.
If you do this well, you should be able to solve this problem with O(k) extra memory in time O(n log(k)).
def quick_sort(array):
if len(array) <=1:
return array
pivot = array[-1]
array.pop()
less = []
greater = []
for num in array:
if num > pivot:
greater.append(num)
else:
less.append(num)
return quick_sort(less) + [pivot] + quick_sort(greater)
What's the space complexity of this implementation of quicksort? I just picked the last element as the pivot, created an array of the elements greater and the elements lesser and moved them accordingly. Then I recursively did that for both the lesser and greater arrays. So at the end, I'd have [pivot] + [pivot] + [pivot]... all in sorted order. Now I'm kind of confused about the space complexity. I have two sub arrays for the lesser and greater and also there's the recursion call stack. What do you think?
The space complexity of your implementation of quicksort is Θ(n2) in the worst case and Θ(n) on average.
Here’s how to see this. Imagine drawing out the full recursion tree for your algorithm. At any one point in time, the algorithm is in one of those recursive calls, with space needed to store all the data from that recursive call, plus all the space for the recursive calls above it. That’s because the call stack, at any one point in time, is a path from some call back up to the root call. Therefore, the space complexity is the maximum amount of space used on any path from a leaf in the recursion tree back up to the root.
Imagine, then, that you happen to pick the absolute worst pivot possible at each step - say, you always pick the smallest or largest element. Then your recursion tree is essentially a giant linked list, where the root holds an array of length n, under that is an array of length n-1, under that is an array of length n-2, etc. until you’re down to an array of length one. The space usage then is 1+2+3+...+n, which is Θ(n2). That’s not great.
On the other hand, suppose you’re looking at a more “typical” run of quicksort, in which you generally get good pivots. In that case, you’d expect that, about half the time, you get a pivot in the middle 50% of the array. With a little math, you can show that this means that, on expectation, you’ll have about two splits before the array size drops to 75% of its previous size. That makes the depth of the recursion tree O(log n). You’ll then have about two layers with arrays of size roughly n, about two layers with arrays of size roughly .75n, about two layers of size roughly (.75)2n, etc. That makes your space usage roughly
2(n + .75n + (.75)2n + ...)
= 2n(1 + .75 + (.75)2 + ...)
= Θ(n).
That last step follows because that’s the sum of a geometric series, which converges to some constant.
To improve your space usage, you’ll need to avoid creating new arrays at each level for your lesser and greater elements. Consider using an in-place partition algorithm to modify the array in place. If you’re clever, you can use that approach and end up with O(log n) total space usage.
Hope this helps!
Given an array where number of occurrences of each number is odd except one number whose number of occurrences is even. Find the number with even occurrences.
e.g.
1, 1, 2, 3, 1, 2, 5, 3, 3
Output should be:
2
The below are the constraints:
Numbers are not in range.
Do it in-place.
Required time complexity is O(N).
Array may contain negative numbers.
Array is not sorted.
With the above constraints, all my thoughts failed: comparison based sorting, counting sort, BST's, hashing, brute-force.
I am curious to know: Will XORing work here? If yes, how?
This problem has been occupying my subway rides for several days. Here are my thoughts.
If A. Webb is right and this problem comes from an interview or is some sort of academic problem, we should think about the (wrong) assumptions we are making, and maybe try to explore some simple cases.
The two extreme subproblems that come to mind are the following:
The array contains two values: one of them is repeated an even number of times, and the other is repeated an odd number of times.
The array contains n-1 different values: all values are present once, except one value that is present twice.
Maybe we should split cases by complexity of number of different values.
If we suppose that the number of different values is O(1), each array would have m different values, with m independent from n. In this case, we could loop through the original array erasing and counting occurrences of each value. In the example it would give
1, 1, 2, 3, 1, 2, 5, 3, 3 -> First value is 1 so count and erase all 1
2, 3, 2, 5, 3, 3 -> Second value is 2, count and erase
-> Stop because 2 was found an even number of times.
This would solve the first extreme example with a complexity of O(mn), which evaluates to O(n).
There's better: if the number of different values is O(1), we could count value appearances inside a hash map, go through them after reading the whole array and return the one that appears an even number of times. This woud still be considered O(1) memory.
The second extreme case would consist in finding the only repeated value inside an array.
This seems impossible in O(n), but there are special cases where we can: if the array has n elements and values inside are {1, n-1} + repeated value (or some variant like all numbers between x and y). In this case, we sum all the values, substract n(n-1)/2 from the sum, and retrieve the repeated value.
Solving the second extreme case with random values inside the array, or the general case where m is not constant on n, in constant memory and O(n) time seems impossible to me.
Extra note: here, XORing doesn't work because the number we want appears an even number of times and others appear an odd number of times. If the problem was "give the number that appears an odd number of times, all other numbers appear an even number of times" we could XOR all the values and find the odd one at the end.
We could try to look for a method using this logic: we would need something like a function, that applied an odd number of times on a number would yield 0, and an even number of times would be identity. Don't think this is possible.
Introduction
Here is a possible solution. It is rather contrived and not practical, but then, so is the problem. I would appreciate any comments if I have holes in my analysis. If this was a homework or challenge problem with an “official” solution, I’d also love to see that if the original poster is still about, given that more than a month has passed since it was asked.
First, we need to flesh out a few ill-specified details of the problem. Time complexity required is O(N), but what is N? Most commentators appear to be assuming N is the number of elements in the array. This would be okay if the numbers in the array were of fixed maximum size, in which case Michael G’s solution of radix sort would solve the problem. But, I interpret constraint #1, in absence of clarification by the original poster, as saying the maximum number of digits need not be fixed. Therefore, if n (lowercase) is the number of elements in the array, and m the average length of the elements, then the total input size to contend with is mn. A lower bound on the solution time is O(mn) because this is the read-through time of the input needed to verify a solution. So, we want a solution that is linear with respect to total input size N = nm.
For example, we might have n = m, that is sqrt(N) elements of sqrt(N) average length. A comparison sort would take O( log(N) sqrt(N) ) < O(N) operations, but this is not a victory, because the operations themselves on average take O(m) = O(sqrt(N)) time, so we are back to O( N log(N) ).
Also, a radix sort would take O(mn) = O(N) if m were the maximum length instead of average length. The maximum and average length would be on the same order if the numbers were assumed to fall in some bounded range, but if not we might have a small percentage with a large and variable number of digits and a large percentage with a small number of digits. For example, 10% of the numbers could be of length m^1.1 and 90% of length m*(1-10%*m^0.1)/90%. The average length would be m, but the maximum length m^1.1, so the radix sort would be O(m^1.1 n) > O(N).
Lest there be any concern that I have changed the problem definition too dramatically, my goal is still to describe an algorithm with time complexity linear to the number of elements, that is O(n). But, I will also need to perform operations of linear time complexity on the length of each element, so that on average over all the elements these operations will be O(m). Those operations will be multiplication and addition needed to compute hash functions on the elements and comparison. And if indeed this solution solves the problem in O(N) = O(nm), this should be optimal complexity as it takes the same time to verify an answer.
One other detail omitted from the problem definition is whether we are allowed to destroy the data as we process it. I am going to do so for the sake of simplicity, but I think with extra care it could be avoided.
Possible Solution
First, the constraint that there may be negative numbers is an empty one. With one pass through the data, we will record the minimum element, z, and the number of elements, n. On a second pass, we will add (3-z) to each element, so the smallest element is now 3. (Note that a constant number of numbers might overflow as a result, so we should do a constant number of additional passes through the data first to test these for solutions.) Once we have our solution, we simply subtract (3-z) to return it to its original form. Now we have available three special marker values 0, 1, and 2, which are not themselves elements.
Step 1
Use the median-of-medians selection algorithm to determine the 90th percentile element, p, of the array A and partition the array into set two sets S and T where S has the 10% of n elements greater than p and T has the elements less than p. This takes O(n) steps (with steps taking O(m) on average for O(N) total) time. Elements matching p could be placed either into S or T, but for the sake of simplicity, run through array once and test p and eliminate it by replacing it with 0. Set S originally spans indexes 0..s, where s is about 10% of n, and set T spans the remaining 90% of indexes s+1..n.
Step 2
Now we are going to loop through i in 0..s and for each element e_i we are going to compute a hash function h(e_i) into s+1..n. We’ll use universal hashing to get uniform distribution. So, our hashing function will do multiplication and addition and take linear time on each element with respect to its length.
We’ll use a modified linear probing strategy for collisions:
h(e_i) is occupied by a member of T (meaning A[ h(e_i) ] < p but is not a marker 1 or 2) or is 0. This is a hash table miss. Insert e_i by swapping elements from slots i and h(e_i).
h(e_i) is occupied by a member of S (meaning A[ h(e_i) ] > p) or markers 1 or 2. This is a hash table collision. Do linear probing until either encountering a duplicate of e_i or a member of T or 0.
If a member of T, this is a again a hash table miss, so insert e_i as in (1.) by swapping to slot i.
If a duplicate of e_i, this is a hash table hit. Examine the next element. If that element is 1 or 2, we’ve seen e_i more than once already, change 1s into 2s and vice versa to track its change in parity. If the next element is not 1 or 2, then we’ve only seen e_i once before. We want to store a 2 into the next element to indicate we’ve now seen e_i an even number of times. We look for the next “empty” slot, that is one occupied by a member of T which we’ll move to slot i, or a 0, and shift the elements back up to index h(e_i)+1 down so we have room next to h(e_i) to store our parity information. Note we do not need to store e_i itself again, so we’ve used up no extra space.
So basically we have a functional hash table with 9-fold the number of slots as elements we wish to hash. Once we start getting hits, we begin storing parity information as well, so we may end up with only 4.5-fold number of slots, still a very low load factor. There are several collision strategies that could work here, but since our load factor is low, the average number of collisions should be also be low and linear probing should resolve them with suitable time complexity on average.
Step 3
Once we finished hashing elements of 0..s into s+1..n, we traverse s+1..n. If we find an element of S followed by a 2, that is our goal element and we are done. Any element e of S followed by another element of S indicates e was encountered only once and can be zeroed out. Likewise e followed by a 1 means we saw e an odd number of times, and we can zero out the e and the marker 1.
Rinse and Repeat as Desired
If we have not found our goal element, we repeat the process. Our 90th percentile partition will move the 10% of n remaining largest elements to the beginning of A and the remaining elements, including the empty 0-marker slots to the end. We continue as before with the hashing. We have to do this at most 10 times as we process 10% of n each time.
Concluding Analysis
Partitioning via the median-of-medians algorithm has time complexity of O(N), which we do 10 times, still O(N). Each hash operation takes O(1) on average since the hash table load is low and there are O(n) hash operations in total performed (about 10% of n for each of the 10 repetitions). Each of the n elements have a hash function computed for them, with time complexity linear to their length, so on average over all the elements O(m). Thus, the hashing operations in aggregate are O(mn) = O(N). So, if I have analyzed this properly, then on whole this algorithm is O(N)+O(N)=O(N). (It is also O(n) if operations of addition, multiplication, comparison, and swapping are assumed to be constant time with respect to input.)
Note that this algorithm does not utilize the special nature of the problem definition that only one element has an even number of occurrences. That we did not utilize this special nature of the problem definition leaves open the possibility that a better (more clever) algorithm exists, but it would ultimately also have to be O(N).
See the following article: Sorting algorithm that runs in time O(n) and also sorts in place,
assuming that the maximum number of digits is constant, we can sort the array in-place in O(n) time.
After that it is a matter of counting each number's appearences, which will take in average n/2 time to find one number whose number of occurrences is even.
I'm not sure if it's possible but it seems a little bit reasonable to me, I'm looking for a data structure which allows me to do these operations:
insert an item with O(log n)
remove an item with O(log n)
find/edit the k'th-smallest element in O(1), for arbitrary k (O(1) indexing)
of course editing won't result in any change in the order of elements. and what makes it somehow possible is I'm going to insert elements one by one in increasing order. So if for example I try inserting for the fifth time, I'm sure all four elements before this one are smaller than it and all the elements after this this are going to be larger.
I don't know if the requested time complexities are possible for such a data container. But here is a couple of approaches, which almost achieve these complexities.
First one is tiered vector with O(1) insertion and indexing, but O(sqrt N) deletion. Since you expect only about 10000 elements in this container and sqrt(10000)/log(10000) = 7, you get almost the required performance here. Tiered vector is implemented as an array of ring-buffers, so deleting an element requires moving all elements, following it in the ring-buffer, and moving one element from each of the following ring-buffers to the one, preceding it; indexing in this container means indexing in the array of ring-buffers and then indexing inside the ring-buffer.
It is possible to create a different container, very similar to tiered vector, having exactly the same complexities, but working a little bit faster because it is more cache-friendly. Allocate a N-element array to store the values. And allocate a sqrt(N)-element array to store index corrections (initialized with zeros). I'll show how it works on the example of 100-element container. To delete element with index 56, move elements 57..60 to positions 56..59, then in the array of index corrections add 1 to elements 6..9. To find 84-th element, look up eighth element in the array of index corrections (its value is 1), then add its value to the index (84+1=85), then take 85-th element from the main array. After about half of elements in main array are deleted, it is necessary to compact the whole container to attain contiguous storage. This gets only O(1) cumulative complexity. For real-time applications this operation may be performed in several smaller steps.
This approach may be extended to a Trie of depth M, taking O(M) time for indexing, O(M*N1/M) time for deletion and O(1) time for insertion. Just allocate a N-element array to store the values, N(M-1)/M, N(M-2)/M, ..., N1/M-element arrays to store index corrections. To delete element 2345, move 4 elements in main array, increase 5 elements in the largest "corrections" array, increase 6 elements in the next one and 7 elements in the last one. To get element 5678 from this container, add to 5678 all corrections in elements 5, 56, 567 and use the result to index the main array. Choosing different values for 'M', you can balance the complexity between indexing and deletion operations. For example, for N=65000 you can choose M=4; so indexing requires only 4 memory accesses and deletion updates 4*16=64 memory locations.
I wanted to point out first that if k is really a random number, then it might be worth considering that the problem might be completely different: asking for the k-th smallest element, with k uniformly at random in the range of the available elements is basically... picking an element at random. And it can be done much differently.
Here I'm assuming you actually need to select for some specific, if arbitrary, k.
Given your strong pre-condition that your elements are inserted in order, there is a simple solution:
Since your elements are given in order, just add them one by one to an array; that is you have some (infinite) table T, and a cursor c, initially c := 1, when adding an element, do T[c] := x and c := c+1.
When you want to access the k-th smallest element, just look at T[k].
The problem, of course, is that as you delete elements, you create gaps in the table, such that element T[k] might not be the k-th smallest, but the j-th smallest with j <= k, because some cells before k are empty.
It then is enough to keep track of the elements which you have deleted, to know how many have been deleted that are smaller than k. How do you do this in time at most O(log n)? By using a range tree or a similar type of data structure. A range tree is a structure that lets you add integers and then query for all integers in between X and Y. So, whenever you delete an item, simply add it to the range tree; and when you are looking for the k-th smallest element, make a query for all integers between 0 and k that have been deleted; say that delta have been deleted, then the k-th element would be in T[k+delta].
There are two slight catches, which require some fixing:
The range tree returns the range in time O(log n), but to count the number of elements in the range, you must walk through each element in the range and so this adds a time O(D) where D is the number of deleted items in the range; to get rid of this, you must modify the range tree structure so as to keep track, at each node, of the number of distinct elements in the subtree. Maintaining this count will only cost O(log n) which doesn't impact the overall complexity, and it's a fairly trivial modification to do.
In truth, making just one query will not work. Indeed, if you get delta deleted elements in range 1 to k, then you need to make sure that there are no elements deleted in range k+1 to k+delta, and so on. The full algorithm would be something along the line of what is below.
KthSmallest(T,k) := {
a = 1; b = k; delta
do {
delta = deletedInRange(a, b)
a = b + 1
b = b + delta
while( delta > 0 )
return T[b]
}
The exact complexity of this operation depends on how exactly you make your deletions, but if your elements are deleted uniformly at random, then the number of iterations should be fairly small.
There is a Treelist (implementation for Java, with source code), which is O(lg n) for all three ops (insert, delete, index).
Actually, the accepted name for this data structure seems to be "order statistic tree". (Apart from indexing, it's also defined to support indexof(element) in O(lg n).)
By the way, O(1) is not considered much of an advantage over O(lg n). Such differences tend to be overwhelmed by the constant factor in practice. (Are you going to have 1e18 items in the data structure? If we set that as an upper bound, that's just equivalent to a constant factor of 60 or so.)
Look into heaps. Insert and removal should be O(log n) and peeking of the smallest element is O(1). Peeking or retrieval of the K'th element, however, will be O(log n) again.
EDITED: as amit stated, retrieval is more expensive than just peeking
This is probably not possible.
However, you can make certain changes in balanced binary trees to get kth element in O(log n).
Read more about it here : Wikipedia.
Indexible Skip lists might be able to do (close) what you want:
http://en.wikipedia.org/wiki/Skip_lists#Indexable_skiplist
However, there's a few caveats:
It's a probabilistic data structure. That means it's not necessarily going to be O(log N) for all operations
It's not going to be O(1) for indexing, just O(log N)
Depending on the speed of your RNG and also depending on how slow traversing pointers are, you'll likely get worse performance from this than just sticking with an array and dealing with the higher cost of removals.
Most likely, something along the lines of this is going to be the "best" you can do to achieve your goals.
An array is given such that its element's value increases from 0th index through some (k-1) index. At k the value is minimum, and than it starts increasing again through the nth element. Find the minimum element.
Essentially, its one sorted list appended to another; example: (1, 2, 3, 4, 0, 1, 2, 3).
I have tried all sorts of algorithm like buliding min-heap, quick select or just plain traversing. But cant get it below O(n). But there is a pattern in this array, something that suggest binary search kind of thing should be possible, and complexity should be something like O(log n), but cant find anything.
Thoughts ??
Thanks
No The drop can be anywhere, there is no structure to this.
Consider the extremes
1234567890
9012345678
1234056789
1357024689
It reduces to finding the minimum element.
Do a breadth-wise binary search for a decreasing range, with a one-element overlap at the binary splits. In other words, if you had, say, 17 elements, compare elements
0,8
8,16
0,4
4,8
8,12
12,16
0,2
2,4
etc., looking for a case where the left element is greater than the right.
Once you find such a range, recurse, doing the same binary search within that range.
Repeat until you've found the decreasing adjacent pair.
The average complexity is not less than O(log n), with a worst-case of O(n). Can anyone get a tighter average-complexity estimate? It seems roughly "halfway between" O(log n) and O(n), but I don't see how to evaluate it. It also depends on any additional constraints on the ranges of values and size of increment from one member to the next.
If the increment between elements is always 1, there's an O(log n) solution.
It can not be done in less then O(n).
The worst case of this kind will always keep troubling us -
An increasing list
a1,a2,a3....ak,ak+1... an
with just one deviation ak < ak-1 e.g. 1,2,3,4,5,6,4,7,8,9,10
And all other numbers hold absolutely zero information about value of 'k' or 'ak'
The simplest solution is to just look forward through the list until the next value is less than the current one, or backward to find a value that is greater than the current one. That is O(n).
Doing both concurrently would still be O(n) but the running time would probably be faster (depending on complicated processor/cache factors).
I don't think you can get it much faster algorithmically than O(n) since a lot of the divide-and-conquer search algorithms rely on having a sorted data set.