Create a binary tree in O(n) - algorithm

I have a sequence with n numbers and i want to create a data structure to answer the following question:
sequence n = [5 ,7, 4, 24, 8, 3, 12, 34]
I want the min(2,5) then the answer is 3 because a2=7, a3=4, a4=24, a5=8. So the min(i,j) returns the position of minimum number between (i,j).
I thought that a good data structure to save this sequence would be a complete binary tree to save the sequence numbers at leaves. But how can i implement this structure in O(n)?

All you need is a Segment Tree with range minimum query. Here is detailed explanation of it. Building time is O(n), because there are in tree no more than 2 * n nodes, so final time complexity will be O(n).
If you need to find not only the minimum value, but also the position, then inside the vertex you need to store not only the minimum, but also where it was reached. How to update such a structure seems clear: when you recalculate the minimum in the father, you need to see from which son it is received and take the corresponding position of the minimum from the son. For leaves, the positions are equal to the positions of the leaves themselves.

Related

Finding the medians of multiple subarrays in an unsorted array

Suppose you are given an unsorted array of integers S and a list of ranges in T, return a list of medians from each of the ranges.
For example, S = [3,6,1,5,0,0,1,-2], T = [[1,3],[0,5],[4,4]]. Return [5, 2, 0].
Is there a better approach than running Median of Medians on each range? Can we somehow precompute/cache the results?
Let me introduce you to an interesting data structure called Wavelet Tree:
You build it by looking at the bit-string representation of your integers and recursively bisecting them:
You first separate your integers into those with most significant bit (MSB) 0 and those with MSB 1. However you store the MSBs in their original order in a bitvector. Then for each of these subsets of integers, you ignore the MSB and recursively repeat this construction for the next-most significant bit.
If you repeat this down to the least significant bit, you get a tree structure like this (note that the indices are just there for illustration, you should store only the bitvectors):
You can easily see that the construction of this data structure takes O(n log N) time where n is the number of integers and N is their maximum value.
Wavelet trees have the nice property that they represent the original sequence as well as their sorted counterpart at the same time:
If you read the topmost bitvector, you get the MSBs of the input sequence. To reconstruct the next bit of the entries, you can alternate between looking in the bitvector in the root's left child (if the MSB is 0) or in the right child (if the MSB is 1). For the following bits, you can continue recursively.
If you read the leaf nodes from left to right, you get the sorted sequence.
To use a Wavelet tree efficiently, you need two fundamental operations on the bitvectors:
rank1(k) tells you how many 1s come before the kth position in the bitvector, rank0 does the same for 0s
select1(k) tells you the index of the kth 1 in the bitvector, select0 does the same for 0s
Note that there are bitvector representations that require only o(n) (small o) bits of additional storage to implement these operations in O(1)
You can utilize them as follows:
If you are looking at the first 7 in the sequence above, it has index 3. If you now want to know which index it has in the right child node, you simply call rank1(3) on the root bitvector and get 2, which is exactly the index of the first 7 in the right child
If you are at the child containing 4544 and want to know the position of the second 4 (with index 2) in the parent node containing 46754476, you call select0(2) on the parent's bitvector and get the index 5.
Now how can you implement a range median query with this? The most important realization you need to make is that finding the median of a range of size k is equivalent to selecting the k/2 th element.
The basic idea of the algorithm is similar to Quickselect: Bisect the element range and recurse only into the range containing the element you are looking for.
Let's say we want to find the median of the range starting at the second 2 (inclusive) and ending at the 1 (exclusive).
These are 7 elements, thus the median has rank 4 (fourth-smallest element) in that range.
Now using a rank0/1 call in the root bitvector at the beginning and end of this range, we find the corresponding ranges in the children of the root:
As you can see, the left range (which contains only smaller elements) has only 3 elements, thus the element with rank 4 must be contained in the right child of the root. We can now recursively search for the element with rank 4 - 3 = 1 in that right child. By recursively descending the wavelet tree until you reach a leaf, you can thus identify the median with only two rank operations (à O(1) time) per level of the Wavelet tree, thus the whole range median query takes O(log N) time where N is the maximum number in your input sequence.
If you want to see a practical implementation of these Wavelet trees, have a look at the Succinct Data Structures Library (SDSL) which implements the aforementioned bitvectors and different WT variants.

Check for duplicate subsequences of length >= N in sequence

I have a sequence of values and I want to know if it contains an repeated subsequences of a certain minimum length. For instance:
1, 2, 3, 4, 5, 100, 99, 101, 3, 4, 5, 100, 44, 99, 101
Contains the subsequence 3, 4, 5, 100 twice. It also contains the subsequence 99, 101 twice, but that subsequence is two short to care about.
Is there an efficient algorithm for checking the existence of such a subsequence? I'm not especially interested in location the sequences (though that would be helpful for verification), I'm primarily just interested in a True/False answer, given a sequence and a minimum subsequence length.
My only approach so far is to brute force search it: for each item in the sequence, find all the other locations where the item occurs (already at O(N^2)), and then walk forward one step at a time from each location and see if the next item matches, and keep going until I find a mismatch or find a matching subsequence of sufficient length.
Another thought I had but haven't been able to develop into an actual approach is to build a tree of all the sequences, so that each number is a node, and a child of its the number that preceded it, whereever that node happens to already be in the tree.
There are O(k) solutions (k - the length of the whole sequence) for any value of N.
Solution #1: Build a suffix tree for the input sequence(using Ukkonen's algorithm). Iterate over the nodes with two or more children and check if at least one of them has depth >= N.
Solution #2: Build a suffix automaton for the input sequence.Iterate over all the states which right context contains at least two different strings and check if at least one of those nodes has distance >= N from the initial state of the automaton.
Solution #3:Suffix array and the longest common prefix technique can also be used(build the suffix array for input sequence , compute the longest common prefix array, check that there is a pair of adjacent suffices with common prefix with length at least N).
These solutions have O(k) time complexity under the assumption that alphabet size is constant(alphabet consists of all elements of the input sequence).
If it is not the case, it is still possible to obtain O(k log k) worst case time complexity(by storing all transitions in a tree or in an automaton in a map) or O(k) on average using hashmap.
P.S I use terms string and sequence interchangeably here.
If you only care about subsequences of length exactly N (for example, if just want to check that there are no duplicates), then there is a quadratic solution: use the KMP algorithm for every subsequence.
Let's assume that there are k elements in the whole sequence.
For every subsequence of length N (O(k) of them):
Build its failure function (takes O(N))
Search for it in the remainder of the sequence (takes O(k))
So, assuming N << k, the whole algorithm is indeed O(k^2).
Since your list is unordered, you're going to have to visit every item at least once.
What I'm thinking is that you first go through your list and create a dictionary where you store the number as a key along with all the indices it appears in your sequence. Like:
Key: Indices
1: 0
2: 1
3: 2, 8
....
Where the number 1 appears at index 0, the number 2 appears at index 1, the number 3 appears at indices 2 and 8, and so on.
With that created you can then go through the dictionary keys and start comparing it against the sequences at the other locations. This should save on some of the brute force since you don't have to revisit each number through the initial sequence each time.

Minimum interval containing all the objects

Given a set of objects, each of which is placed at several locations on a Natural number line: Find the smallest interval [a, b] containing all the objects.
Example: Consider 3 objects A, B, C
A is placed at 1, 5, 7
B is placed at 2, 4, 6
C is placed at 4, 8, 9
The smallest interval that encompasses all the three objects is [4, 5].
I can only think of O(S^2) solution where S is the minimal interval containing all the object locations i.e, [1, 9].
Is there a better way to do this ?
PS : Note that multiple objects can be placed at the same location.
Sort all the data points in ascending order (nlogn time).
Traverse these data points from the left.
Keep track of the following:
1. For each type of object, maintain an entry of the coordinate of last object found (maybe through a hashmap for fast operation).
2. Minimum interval length found till now.
3. The coordinate of the earliest element in the list. This is to keep track of the start of current interval.
Whenever you encounter an object,
1. Update its entry in the maintained list.
2. Check whether the coordinate of earliest element has been updated. If so, then calculate the new interval length and update the minimum interval length if the new one is smaller.
You will first need to ascertain that you have encountered all types of objects to calculate the first valid minimum interval length. You can do that by a counter.
If the number of different types of elements is bounded and small, then the order of complexity is O(nlogn) where n is the total number of data points.
You can do this in O(N) using 2 indices while going through the list. (Lets call them left and right).
You start with both of them in position 1, and then increment right until [left,right] has all the elements. You know that this is the minimum interval starting in left that has all the elements. Now increment left. Now increment right again until you have all the elements. (NOTE, many times you don't even have to increment). Get the minimum out of all the complete intervals and you have your answer.
This works because if you know [left,right] is the minimum interval starting at left, the interval starting at left+1 will have it's right >= the last right.
This is O(N) because you add the elements of a location once, and delete them once.
You'll need to use a hash to count the unique elements.

Finding closest number in a range

I thought a problem which is as follows:
We have an array A of integers of size n, and we have test cases t and in every test cases we are given a number m and a range [s,e] i.e. we are given s and e and we have to find the closest number of m in the range of that array(A[s]-A[e]).
You may assume array indexed are from 1 to n.
For example:
A = {5, 12, 9, 18, 19}
m = 13
s = 4 and e = 5
So the answer should be 18.
Constraints:
n<=10^5
t<=n
All I can thought is an O(n) solution for every test case, and I think a better solution exists.
This is a rough sketch:
Create a segment tree from the data. At each node, besides the usual data like left and right indices, you also store the numbers found in the sub-tree rooted at that node, stored in sorted order. You can achieve this when you construct the segment tree in bottom-up order. In the node just above the leaf, you store the two leaf values in sorted order. In an intermediate node, you keep the numbers in the left child, and right child, which you can merge together using standard merging. There are O(n) nodes in the tree, and keeping this data should take overall O(nlog(n)).
Once you have this tree, for every query, walk down the path till you reach the appropriate node(s) in the given range ([s, e]). As the tutorial shows, one or more different nodes would combine to form the given range. As the tree depth is O(log(n)), that is the time per query to reach these nodes. Each query should be O(log(n)). For all the nodes which lie completely inside the range, find the closest number using binary search in the sorted array stored in those nodes. Again, O(log(n)). Find the closest among all these, and that is the answer. Thus, you can answer each query in O(log(n)) time.
The tutorial I link to contains other data structures, such as sparse table, which are easier to implement, and should give O(sqrt(n)) per query. But I haven't thought much about this.
sort the array and do binary search . complexity : o(nlogn + logn *t )
I'm fairly sure no faster solution exists. A slight variation of your problem is:
There is no array A, but each test case contains an unsorted array of numbers to search. (The array slice of A from s to e).
In that case, there is clearly no better way than a linear search for each test case.
Now, in what way is your original problem more specific than the variation above? The only added information is that all the slices come from the same array. I don't think that this additional constraint can be used for an algorithmic speedup.
EDIT: I stand corrected. The segment tree data structure should work.

Designing a data structure that works in O(logn) time

I'm auditing this algorithms class for work and I'm trying to do some practice problems given in class. This problem has me stumped and I just can't wrap my head around it. None of my solutions come out in O(logn) time. Can anyone help me with this problem??
Question:
Suppose that we are given a sequence of n values x1, x2, ... , xn in an arbitrary order and
seek to quickly answer repeated queries of the form: given an arbitrary pair i and j with
1 ≤ i < j ≤ n, find the smallest value in x1, ... , xj . Design a data structure that uses O(n) space and answers each query in O(log n) time.
For input of a1,a2,a3,...an , construct a node that contains minimum of (a1,..,ak) and minimum of (ak+1,..,an) where k = n/2.
Recursively construct the rest of the tree.
Now, if you want to find the minimum between ai and aj:
Identify the lowest common ancestor of i,j. Let it be k
Start with i and keep moving until you hit k. AT every iteration check if the child node was left node. If yes, then compare the right subtree's min and update current min accordingly.
Similarly, for j, check if it is right node....
At node k compare values returned by each subtree and return the min
People are overthinking this. Suppose that you start with the list:
47, 13, 55, 29, 56, 9, 17, 48, 69, 15
Make the following list of lists:
47, 13, 55, 29, 56, 9, 17, 48, 69, 15
13, 29, 9, 17, 15
13, 9, 15
9, 15
9
I leave the construction of these lists, correct usage, and proof that they provide an answer to the original question as exercises for the reader. (It might not be homework for you, but it could easily be for someone, and I don't like giving complete answers to homework questions.)
I think the crucial step is that you'll need to sort the data before hand. Then you can store the data in an array/list. Then you can run through a quick binary search in O(logn), picking out the first value that satisfies the condition (I'm assuming you meant between xi and xj, not x1 and xj).
edit: on second thought, ensuring that the value satisfies the condition may not be as trivial as I thought
The question was asked before in a slightly different way: What data structure using O(n) storage with O(log n) query time should I use for Range Minimum Queries?
Nevertheless, to quickly answer, the problem you're facing it's a well studied one - Range Minimum Query. A Segment Tree is a Data Structure that can solve the problem with O(N) space and O(logN) time requirements. You can see more details in here, where there's an explanation of the structure and the complexities involved.
Trying to explain the suggested data structure:
For every pair of numbers, calculate and keep the value of the smaller one.
For every four consecutive numbers, calculate and keep the value of the smallest of the four. This is done quickly by picking the smaller of the two pair values.
For every eight consecutive numbers, calculate and keep the value of the smallest of the eight.
And so on.
Let's say we want the smallest value of x19 to x65.
We look at the following stored values:
Smallest of x32 to x63.
Smallest of x24 to x31.
Smallest of x20 to x23.
x19.
Smallest of x64 to x65.
Then we pick the smallest of these.

Resources