Interval tree of an an array with update on array - algorithm

Given a array of size N, and an array of intervals also of size N, each a contiguous segment of the first array, I need to handle Q queries that update the elements of the array and that ask for the sum of an segment in the second array (sum of the elements in the iTH interval to the jTH interval).
Now, the first query can be handled easily. I can build a segment tree from the array. I can use it to calculate the sum of an interval in the first array (an element in the second array). But how can i handle the second query in O(log n)? In the worst case, the element I update will be in all the intervals in the second array.
I need a O(Qlog N) or O(Q(logN)^2) solution.

Here is an O((Q + N) * sqrt(Q)) solution(it is based on a pretty standard idea of sqrt-decomposition):
1. Let's assume that the array is never updated. Then the problem becomes pretty easy: using prefix sums, it is possible to solve this problem in O(N) time for precomputation and O(1) per query(we need 2 prefix sum arrays here: one for the original array and the other one for the array of intervals).
2. Now let's divide our queries into blocks of size sqrt(Q). In the beginning of the each block, we can do the same thing as in 1. taking into accounts only those updates that happened before the beginning of this block. It can be done in linear time(using prefix sums twice). The total number of such computations is Q / sqrt(Q) = sqrt(Q) times(because it is the number of blocks we have). So far, it gives us O((N + Q) * sqrt(Q)) time in total.
3. When we get the query of type 2, all the updates that are outside the current block are already considered. So there are at most sqrt(Q) updates that could affect the answer. So let's process them almost naively: iterate over all updates within the current block that happened before this query and update the answer. To do this, we need to know how many times a given position in the array is present in the intervals from i to j. This part can be solved offline with sweep line algorithm using O(Q * sqrt(N + Q)) time and space(additional log factor does not appear because radix sort can be used).
So we get O((N + Q) * sqrt(Q)) time and space in the worst case in total. It is worse than O(Q * log N), of course, but should work fine for about 10^5 queries and array elements.

Related

set with average O(1) for add/remove and worst max/min O(log n)

Can I have a set where average add/remove operation is O(1) (this is tipical for hashtable-based sets) and worst max/min is less then O(n), probably O(log n) (typical for tree-based sets)?
upd hmm it seems in the simplest case I can just rescan ALL N elements every time max/min disappear and in general it gives me O(1). But i apply my algorithm to stock trading where changes near min/max are much more likely so I just don't want to rescan everything every time max or min disappear, i need something smarter than full rescan which gives O(n).
upd2 In my case set contains 100-300 elements. Changes of max/min elements are very likely, so max/min changes often. And I need to track max/min. I still want O(1) for add/remove.
Here's an impossibility result with bad constants for worst-case, non-amortized bounds in a deterministic model where keys can be compared and hashed but nothing else. (Yes, that's a lot of stipulations. I second the recommendation of a van Emde Boas tree.)
As is usually the case with comparison lower bounds, this is an adversary argument. The adversary's game plan is to insert many keys while selectively deleting the ones about which the algorithm has the most information. Eventually, the algorithm will be unable to handle a call to max().
The adversary decides key comparisons as follows. Associated with each key is a binary string. Each key initially has an empty string. When two keys are compared, their strings are extended minimally so that neither is a prefix of the other, and the comparison is decided according to the dictionary order. For example, with keys x, y, z, we could have:
x < y: string(x) is now 0, string(y) is now 1
x < z: string(z) is now 1
y < z: string(y) is now 10, string(z) is now 11.
Let k be a worst-case upper bound on the number of key comparisons made by one operation. Each key comparison increases the total string length by at most two, so for every sequence of at most 3 * n insert/delete operations, the total string length is at most 6 * k * n. If we insert 2 * n distinct keys with interspersed deletions whenever there is a key whose string has length at least 6 * k, then we delete at most n keys on the way to a set with at least n keys where each key has a string shorter than 6 * k bits.
Extend each key's string arbitrarily to 6 * k bits. The (6 * k)-bit prefix of a key's string is the bucket to which the key belongs. Right now, the algorithm has no information about the relative order of keys within a bucket. There are 2 ** (6 * k) buckets, which we imagine laid out left to right in the increasing order dictated by the (6 * k)-bit prefixes. For n sufficiently large, there exists a bucket with a constant (depending on k) fraction of the keys and at least 2 * k times as many keys as the combined buckets to its right. Delete the latter keys, and max() requires a linear number of additional comparisons to sort out the big bucket that now holds the maximum, because at most a little more than half of the required work has been done by the deletions.
Well, you know that max/min < CONST, and the elements are all numbers. Based on this you can get O(1) insertion and O(k+n/k) find min/max 1.
Have an array of size k, each element in the array will be a hash set. At insertion, insert an element to array[floor((x-MIN)/MAX-MIN)*k)] (special case for x=MAX). Assuming uniform distribution of elements, that means each hash set has an expected number of n/k elements.
At deletion - remove from the relevant set similarly.
findMax() is now done as follows: find the largest index where the set is not empty - it takes O(k) worst case, and O(n/k) to find maximal element in the first non empty set.
Finding optimal k:
We need to minimize k+n/k.
d(n+n/k)/dk = 1-n/k^2 = 0
n = k^2
k = sqrt(n)
This gives us O(sqrt(n) + n/sqrt(n)) = O(sqrt(n)) find min/max on average, with O(1) insertion/deletion.
From time to time you might need to 'reset' the table due to extreme changes of max and min, but given a 'safe boundary' - I believe in most cases this won't be an issue.
Just make sure your MAX is something like 2*max, and MIN is 1/2*min every time you 'reset' the DS.
(1) Assuming all elements are coming from a known distribution. In my answer I assume a uniform distribution - so P(x)=P(y) for each x,y, but it is fairly easy to modify it to any known distribution.

Removing items from a list - algorithm time complexity

Problem consists of two sorted lists with no duplicates of sizes n and m. First list contains strings that should be deleted from second list.
Simplest algorithm would have to do nxm operations (I believe that terminology for this is "quadratic time"?).
Improved solution would be to take advantage of the fact that both list are sorted and skip strings with index that is lower than last deleted index in future comparisons.
I wonder what time complexity would that be?
Are there any solutions for this problem with better time complexity?
You should look into Merge sort. This is the basic idea behind why it works efficiently.
The idea is to scan the two lists together, which takes O(n+m) time:
Make a pointer x for first list, say A and another pointer y for the second list, say B. Set x=0 and y=0. While x < n and y < m, if A[x] < B[y], then add A[x] to the new merged list and increment x. Otherwise add B[y] to the new list and increment y. Once you hit x=n or y=m, take on the remaining elements from B or A, respectively.
I believe the complexity would be O(n+m), because every item in each of the lists would be visited exactly once.
A counting/bucket sort algorithm would work where each string in the second list is a bucket.
You go through the second list (takes m time) and create your buckets. You then go through your first list (takes n time) and increment the number of occurances. You then would have to go through each bucket (takes m time) again and only return strings that occur once. A Trie or a HashMap would work well for storing a buckets. Should be O(n+m+m). If you use a HashSet, in the second pass instead of incrementing a counter, you remove from the Set. It should be O(n+m+(m-n)).
Might it be O(m + log(n)) if binary search is used?

Find the N-th most frequent number in the array

Find the nth most frequent number in array.
(There is no limit on the range of the numbers)
I think we can
(i) store the occurence of every element using maps in C++
(ii) build a Max-heap in linear time of the occurences(or frequence) of element and then extract upto the N-th element,
Each extraction takes log(n) time to heapify.
(iii) we will get the frequency of the N-th most frequent number
(iv) then we can linear search through the hash to find the element having this frequency.
Time - O(NlogN)
Space - O(N)
Is there any better method ?
It can be done in linear time and space. Let T be the total number of elements in the input array from which we have to find the Nth most frequent number:
Count and store the frequency of every number in T in a map. Let M be the total number of distinct elements in the array. So, the size of the map is M. -- O(T)
Find Nth largest frequency in map using Selection algorithm. -- O(M)
Total time = O(T) + O(M) = O(T)
Your method is basically right. You would avoid final hash search if you mark each vertex of the constructed heap with the number it represents. Moreover, it is possible to constantly keep watch on the fifth element of the heap as you are building it, because at some point you can get to a situation where the outcome cannot change anymore and the rest of the computation can be dropped. But this would probably not make the algorithm faster in the general case, and maybe not even in special cases. So you answered your own question correctly.
It depends on whether you want most effective, or the most easy-to-write method.
1) if you know that all numbers will be from 0 to 1000, you just make an array of 1000 zeros (occurences), loop through your array and increment the right occurence position. Then you sort these occurences and select the Nth value.
2) You have a "bag" of unique items, you loop through your numbers, check if that number is in a bag, if not, you add it, if it is here, you just increment the number of occurences. Then you pick an Nth smallest number from it.
Bag can be linear array, BST or Dictionary (hash table).
The question is "N-th most frequent", so I think you cannot avoid sorting (or clever data structure), so best complexity can not be better than O(n*log(n)).
Just written a method in Java8: This is not an efficient solution.
Create a frequency map for each element
Sort the map content based on values in reverse order.
Skip the (N-1)th element then find the first element
private static Integer findMostNthFrequentElement(int[] inputs, int frequency) {
return Arrays.stream(inputs).boxed()
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
.entrySet().stream().sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
.skip(frequency - 1).findFirst().get().getKey();
}

Data structure for storing data and calculating average

In this problem, we are interested in a data structure that supports keeping infinite numbers of Y axis parallel vectors.
Each node contains location (X axis value) and height (Y axis value). We can assume there are no two vectors in the same location.
Please advise for an efficient data structure that supports:
init((x1,y1)(x2,y2)(x3,y3)...(xn,yn)) - the DS will contain all n vectors, while VECTOR#i's location is xi VECTOR#i's hieght is yi.
We also know that x1 < x2 < x3 < ... < xn (nothing is known about the y) - complexity = O(n) on average
insert(x,y) - add vector with location x and height y. - complexity = O(logn) amortized on average.
update(x,y) - update vector#x's height to y. - complexity = O(logn) worst case
average_around(x) - return the heights average of logn neighbors of x - complexity = O(1) on average
Space Complexity: O(n)
I can't provide a full answer, but it might be a hint into the right direction.
Basic ideas:
Let's assume you've calculated the average of n numbers a_1,...,a_n, then this average is avg=(a_1+...+a_n)/n. If we now replace a_n by b, we can recalculate the new average as follows: avg'=(a_1+...+a_(n-1)+b)/n, or - simpler - avg'=((avg*n)-a_n+b)/n. That means, if we exchange one element, we can recompute the average using the original average value by simple, fast operations, and don't need to re-iterate over all elements participating in the average.
Note: I assume that you want to have log n neighbours on each side, i.e. in total we have 2 log(n) neighbours. You can simply adapt it if you want to have log(n) neighbours in total. Moreover, since log n in most cases won't be a natural number, I assume that you are talking about floor(log n), but I'll just write log n for simplicity.
The main thing I'm considering is the fact that you have to tell the average around element x in O(1). Thus, I suppose you have to somehow precompute this average and store it. So, i would store in a node the following:
x value
y value
average around
Note that update(x,y) runs strictly in O(log n) if you have this structure: If you update element x to height y, you have to consider the 2log(n) neighbours whose average is affected by this change. You can recalculate each of these averages in O(1):
Let's assume, update(x,y) affects an element b, whose average is to be updated as well. Then, you simply multiply average(b) by the number of neighbours (2log(n) as stated above). Then, we subtract the old y-value of element x, and add the new (updated) y-value of x. After that, we divide by 2 log(n). This ensures that we now have the updated average for element b. This involved only some calculations and can thus be done in O(1). Since we have 2log n neighbours, update runs in O(2log n)=O(log n).
When you insert a new element e, you have to update the average of all elements affected by this new element e. This is essentially done like in the update routine. However, you have to be careful when log n (or precisely floor(log n)) changes its value. If floor(log n) stays the same (which it will, in most cases), then you can just do the analogue things described in update, however you will have to "remove" the height of one element, and "add" the height of the newly added element. In these "good" cases, run time is again strictly O(log n).
Now, when floor(log n) is changing (incrementing by 1), you have to perform an update for all elements. That is, you have to do an O(1) operation for n elements, resulting in a running time of O(n). However, it is very seldom the case that floor(log n) increments by 1 (you need to double the value of n to increment floor(log n) by 1 - assuming we are talking about log to base 2, which is not uncommon in computer science). We denote this time by c*n or simply cn.
Thus, let's consider a sequence of inserts: The first insert needs an update: c*1, the second insert needs an update: 2*c. The next time an expensive insert occurs, is the fourth insert: 4*c, then the eight insert: 8c, the sixtenth insert: 16*c. The distance between two expensive inserts is doubled each time:
insert # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ..
cost 1c 2c 1 4c 1 1 1 8c 1 1 1 1 1 1 1 16c 1 1 ..
Since no remove is required, we can continue with our analysis without any "special cases" and consider only a sequence of inserts. You see that most inserts cost 1, while few are expensive (1,2,4,8,16,32,...). So, if we have m inserts in total, we have roughly log m expensive inserts, and roughly m-log m cheap inserts. For simplicity, we assume simply m cheap inserts.
Then, we can compute the cost for m inserts:
log m
----
\ i
m*1 + / 2
----
i=0
m*1 counts the cheap operations, the sum the expensive ones. It can be shown that the whole thing is at most 4m (in fact you can even show better estimates quite easily, but for us this suffices).
Thus, we see that m insert operations cost at most 4m in total. Thus, a single insert operation costs at most 4m/m=4, thus is O(1) amortized.
So, there are 2 things left:
How to store all the entries?
How to initialize the data structure in O(n)?
I suggest storing all entries in a skip-list, or some tree that guarantees logarithmic search-operations (otherwise, insert and update require more than O(log n) for finding the correct position). Note that the data structure must be buildable in O(n) - which should be no big problem assuming the elements are sorted according to their x-coordinate.
To initialize the data structure in O(n), I suggest beginning at element at index log n and computing its average the simple way (sum up, the 2log n neighbours, divide by 2 log n).
Then you move the index one further and compute average(index) using average(index-1): average(index)=average(index-1)*log(n)-y(index-1-log(n))+y(index+log(n)).
That is, we follow a similar approach as in update. This means that computing the averages costs O(log n + n*1)=O(n). Thus, we can compute the averages in O(n).
Note that you have to take some details into account which I haven't described here (e.g. border cases: element at index 1 does not have log(n) neighbours on both sides - how do you proceed with this?).

Interview challenge: Find the different elements in two arrays

Stage 1: Given two arrays, say A[] and B[], how could you find out if elements of B is in A?
Stage 2: What about the size of A[] is 10000000000000... and B[] is much smaller than this?
Stage 3: What about the size of B[] is also 10000000000.....?
My answer is as follows:
Stage 1:
double for loop - O(N^2);
sort A[], then binary search - O(NlgN)
Stage 2:
using bit set, since the integer is 32bits....
Stage 3: ..
Do you have any good ideas?
hash all elements in A [iterate the array and insert the elements into a hash-set], then iterate B, and check for each element if it is in B or not. you can get average run time of O(|A|+|B|).
You cannot get sub-linear complexity, so this solution is optimal for average case analyzis, however, since hashing is not O(1) worst case, you might get bad worst-case performance.
EDIT:
If you don't have enough space to store a hash set of elements in B, you might want to concider a probabilistic solution using bloom filters. The problem: there might be some false positives [but never false negative]. Accuracy of being correct increases as you allocate more space for the bloom filter.
The other solution is as you said, sort, which will be O(nlogn) time, and then use binary search for all elements in B on the sorted array.
For 3rd stage, you get same complexity: O(nlogn) with the same solution, it will take approximately double time then stage 2, but still O(nlogn)
EDIT2:
Note that instead of using a regular hash, sometimes you can use a trie [depands on your elements type], for example: for ints, store the number as it was a string, each digit will be like a character. with this solution, you get O(|B|*num_digits+|A|*num_digits) solution, where num_digits is the number of digits in your numbers [if they are ints]. Assuming num_digits is bounded with a finite size, you get O(|A|+|B|) worst case.
Stage 1: make a hash set from A and iterate over B, checking if current element B[i] exists in A (same way that #amit proposed earlier). Complexity (averaged) - O(length(A) + length(B)).
Stage 2: make a hash set from B, then iterate over A and if current element exists in B, remove it from B. If after iterating B has at least 1 element, then not all B's element exist in A; otherwise A is complete superset of B. Complexity (averaged) - O(length(A) + length(B)).
Stage 3: sort both arrays in-place and iterate, searching for same numbers on current positions i and j for A[i] and B[j] (the idea must be obvious). Complexity - O(n*log n), where n = length(A).

Resources