Designing a data structure that works in O(logn) time - algorithm

I'm auditing this algorithms class for work and I'm trying to do some practice problems given in class. This problem has me stumped and I just can't wrap my head around it. None of my solutions come out in O(logn) time. Can anyone help me with this problem??
Question:
Suppose that we are given a sequence of n values x1, x2, ... , xn in an arbitrary order and
seek to quickly answer repeated queries of the form: given an arbitrary pair i and j with
1 ≤ i < j ≤ n, find the smallest value in x1, ... , xj . Design a data structure that uses O(n) space and answers each query in O(log n) time.

For input of a1,a2,a3,...an , construct a node that contains minimum of (a1,..,ak) and minimum of (ak+1,..,an) where k = n/2.
Recursively construct the rest of the tree.
Now, if you want to find the minimum between ai and aj:
Identify the lowest common ancestor of i,j. Let it be k
Start with i and keep moving until you hit k. AT every iteration check if the child node was left node. If yes, then compare the right subtree's min and update current min accordingly.
Similarly, for j, check if it is right node....
At node k compare values returned by each subtree and return the min

People are overthinking this. Suppose that you start with the list:
47, 13, 55, 29, 56, 9, 17, 48, 69, 15
Make the following list of lists:
47, 13, 55, 29, 56, 9, 17, 48, 69, 15
13, 29, 9, 17, 15
13, 9, 15
9, 15
9
I leave the construction of these lists, correct usage, and proof that they provide an answer to the original question as exercises for the reader. (It might not be homework for you, but it could easily be for someone, and I don't like giving complete answers to homework questions.)

I think the crucial step is that you'll need to sort the data before hand. Then you can store the data in an array/list. Then you can run through a quick binary search in O(logn), picking out the first value that satisfies the condition (I'm assuming you meant between xi and xj, not x1 and xj).
edit: on second thought, ensuring that the value satisfies the condition may not be as trivial as I thought

The question was asked before in a slightly different way: What data structure using O(n) storage with O(log n) query time should I use for Range Minimum Queries?
Nevertheless, to quickly answer, the problem you're facing it's a well studied one - Range Minimum Query. A Segment Tree is a Data Structure that can solve the problem with O(N) space and O(logN) time requirements. You can see more details in here, where there's an explanation of the structure and the complexities involved.

Trying to explain the suggested data structure:
For every pair of numbers, calculate and keep the value of the smaller one.
For every four consecutive numbers, calculate and keep the value of the smallest of the four. This is done quickly by picking the smaller of the two pair values.
For every eight consecutive numbers, calculate and keep the value of the smallest of the eight.
And so on.
Let's say we want the smallest value of x19 to x65.
We look at the following stored values:
Smallest of x32 to x63.
Smallest of x24 to x31.
Smallest of x20 to x23.
x19.
Smallest of x64 to x65.
Then we pick the smallest of these.

Related

Create a binary tree in O(n)

I have a sequence with n numbers and i want to create a data structure to answer the following question:
sequence n = [5 ,7, 4, 24, 8, 3, 12, 34]
I want the min(2,5) then the answer is 3 because a2=7, a3=4, a4=24, a5=8. So the min(i,j) returns the position of minimum number between (i,j).
I thought that a good data structure to save this sequence would be a complete binary tree to save the sequence numbers at leaves. But how can i implement this structure in O(n)?
All you need is a Segment Tree with range minimum query. Here is detailed explanation of it. Building time is O(n), because there are in tree no more than 2 * n nodes, so final time complexity will be O(n).
If you need to find not only the minimum value, but also the position, then inside the vertex you need to store not only the minimum, but also where it was reached. How to update such a structure seems clear: when you recalculate the minimum in the father, you need to see from which son it is received and take the corresponding position of the minimum from the son. For leaves, the positions are equal to the positions of the leaves themselves.

Sorting algorithm based on subset inversion

I'm looking for a sorting algorithm based on subset inversion. It's like pancake sort, only instead of taking all the pancakes on top of the spatula, you can just invert any subset you want. Length of the subset doesn't matter.
Like this:
http://www.yourgenome.org/sites/default/files/illustrations/diagram/dna_mutations_inversion_yourgenome.png
So we can't simply swap numbers without inverting everything in between.
We're doing this to determine how one subspecies of fruitfly can mutate into the other. Both have the same genes but in a different order. The second subspecies' genome is 'sorted', i.e. the gene numbers are 1-25. The first subspecies genome is unsorted. Hence, we're looking for a sorting algorithm.
This is the "genome" we're looking at (though we should be able to have this work on all lists of numbers):
[23, 1, 2, 11, 24, 22, 19, 6, 10, 7, 25, 20, 5, 8, 18, 12, 13, 14, 15, 16, 17, 21, 3, 4, 9];
We're looking at two separate problems:
1) To sort a list of 25 numbers with the least amount of inversions
2) To sort a list of 25 numbers with the least amount of numbers moved
We also want to establish both upper and lower bounds for both.
We've already found a way to sort like this by just going from left to right, searching for the next lowest value and inverting everything in between, but we're absolutely certain we should be able to do this faster. However, we still haven't found any other methods so I'm asking for your help!
UPDATE: the method we currently use is based on the above method
but instead works both ways. It looks at the next elements needed
for both ends (e.g. 1 and 25 at the beginning) and then calculates
which inversion would be cheapest. All values at the ends can be
ignored for the rest of the algorithm because they get put into the
correct place immediately. Our first method took 18/19 steps and 148
genes, and this one does it in 17 steps and 101 genes. For both
optimalisation tactics (the two mentioned above), this is a better
method. It is however not cheaper in terms of code and processing.
Right now, we're working in Python because we have most experience with that, but I'd be happy with any pseudocode ideas on how we can more efficiently tackle this. If you think another language might be better suited, please let me know. Pseudocode, ideas, thoughts and actual code are all welcome!
Thanks in advance!
Regarding the first question: Do you know (and care about) which of the two strands the genes are on?
If so, you're in luck: This is called the inversion distance between signed permutations problem, and there is a linear-time algorithm for it: http://www.ncbi.nlm.nih.gov/pubmed/11694179. I haven't looked at the details.
If not, then unfortunately (as described on p. 2 of that paper) the problem is NP-hard, so it's very unlikely that any algorithm exists that is efficient (polynomial-time) in the worst case.
Regarding the second question: Assuming you mean that you want to find the minimum number of swaps needed to sort a list of numbers, you should be able to find solutions to this by searching here on SO and elsewhere. I think this is a clear and concise explanation. You can also use the optimal solution to this problem to get an upper bound for your first question: Any swap of positions i and j can be simulated using the two interval reversals (i, j) and (i+1, j-1). (This upper bound might be very bad, though, and in particular could be worse than your existing greedy algorithm.)
I think what you're looking for for the second question is the minimum number of swaps of adjacent elements to sort a sequence, which is equal to the number of inversions in the sequence (where a[i] > a[j] and i < j).
The first question seems quite a bit more complicated to me. One potential heuristic might be to think of the subset inversion as similar to the adjacent swap of more than one element. For example, if you've managed to get a sequence to this position,
5,6,1,2,3,4,7,8
we can "adjacent swap" indexes [0,1] with [2,3] (so inverting [0,1,2,3]),
2,1,6,5,3,4,7,8
and then [2,3] with [4,5] (inverting [2,3,4,5]),
2,1,4,3,5,6,7,8
and arrive at a sequence that now has significantly less element inversions, meaning less single adjacent swaps are needed to now complete the sort.
So maybe attempting to quantify inversions (in the sense of a[i] > a[j] and i < j) of sections rather than single elements could help move in the direction of estimating or building a method for the first question.

Search in ordered list

Assume we have a list 0, 10, 30, 45, 60, 70 sorted in ascending order. Given a number X how to find the number in the list immediately below it?
I am looking for the most efficient (faster) algorithm to do this, without of course having to iterate through the whole list.
Ex: [0, 10, 30, 45, 60, 70]
Given the number 34, I want to return 30.
Given the number 30, I want to return 30.
Given the number 29, I want to return 10.
And so forth.
If your list is indeed that small, most efficient way would be to
create an array of size 71, initialize it once with arr[i] = answer, and in constant query time - just get the answer. The idea is since your possible set of queries is so limited, there is no reason not to pre-calculate it and get the result from the pre-calculated data.
If you cannot pre-process, and the array is that small - linear scan
will be most efficient for such a small array, the overhead of using
complex algorithm does not worth it for such small arrays. Any
overhead for more complex algorithms (like binary search) that add a
lot of instructions per iteration, is nullified for small arrays.
Note that log_2(6) < 3, and this is also the expected time
(assuming uniform distribution) to get the result in a linear search,
but linear search is so much simpler, each iteration is much faster
than in binary search.
Pseudo code:
prev = -infinity
for (x in arr):
if x>arr:
return prev
prev = x
If the array is getting larger, use binary search. This
algorithm is designed to find a value (or the first value closest to
it) in a sorted array, and runs in O(logn) time, needing to
traverse significantly fewer elements than the entire list.
It will achieve much better results (in terms of time performance) compared to the naive linear scan, assuming uniform distribution of queries.
Is the list always sorted? Fast to get written or fast in execution time?
Look at this: http://epaperpress.com/sortsearch/download/sortsearch.pdf
Implement the Binary Search Algorithm where, in case the element is not found, you return the element in the last visited position (if it's smaller than or equal to the given number) or the element in the last visited position - 1 (in case the element in the last visited position is greater than the given number).

Check for duplicate subsequences of length >= N in sequence

I have a sequence of values and I want to know if it contains an repeated subsequences of a certain minimum length. For instance:
1, 2, 3, 4, 5, 100, 99, 101, 3, 4, 5, 100, 44, 99, 101
Contains the subsequence 3, 4, 5, 100 twice. It also contains the subsequence 99, 101 twice, but that subsequence is two short to care about.
Is there an efficient algorithm for checking the existence of such a subsequence? I'm not especially interested in location the sequences (though that would be helpful for verification), I'm primarily just interested in a True/False answer, given a sequence and a minimum subsequence length.
My only approach so far is to brute force search it: for each item in the sequence, find all the other locations where the item occurs (already at O(N^2)), and then walk forward one step at a time from each location and see if the next item matches, and keep going until I find a mismatch or find a matching subsequence of sufficient length.
Another thought I had but haven't been able to develop into an actual approach is to build a tree of all the sequences, so that each number is a node, and a child of its the number that preceded it, whereever that node happens to already be in the tree.
There are O(k) solutions (k - the length of the whole sequence) for any value of N.
Solution #1: Build a suffix tree for the input sequence(using Ukkonen's algorithm). Iterate over the nodes with two or more children and check if at least one of them has depth >= N.
Solution #2: Build a suffix automaton for the input sequence.Iterate over all the states which right context contains at least two different strings and check if at least one of those nodes has distance >= N from the initial state of the automaton.
Solution #3:Suffix array and the longest common prefix technique can also be used(build the suffix array for input sequence , compute the longest common prefix array, check that there is a pair of adjacent suffices with common prefix with length at least N).
These solutions have O(k) time complexity under the assumption that alphabet size is constant(alphabet consists of all elements of the input sequence).
If it is not the case, it is still possible to obtain O(k log k) worst case time complexity(by storing all transitions in a tree or in an automaton in a map) or O(k) on average using hashmap.
P.S I use terms string and sequence interchangeably here.
If you only care about subsequences of length exactly N (for example, if just want to check that there are no duplicates), then there is a quadratic solution: use the KMP algorithm for every subsequence.
Let's assume that there are k elements in the whole sequence.
For every subsequence of length N (O(k) of them):
Build its failure function (takes O(N))
Search for it in the remainder of the sequence (takes O(k))
So, assuming N << k, the whole algorithm is indeed O(k^2).
Since your list is unordered, you're going to have to visit every item at least once.
What I'm thinking is that you first go through your list and create a dictionary where you store the number as a key along with all the indices it appears in your sequence. Like:
Key: Indices
1: 0
2: 1
3: 2, 8
....
Where the number 1 appears at index 0, the number 2 appears at index 1, the number 3 appears at indices 2 and 8, and so on.
With that created you can then go through the dictionary keys and start comparing it against the sequences at the other locations. This should save on some of the brute force since you don't have to revisit each number through the initial sequence each time.

Positioning an ordered sequence of intervals for maximum alignment with another sequence of fixed intervals

I have two sequences of intervals.
The first is fixed and non-overlapping, so something like:
[1..10], [12..15], [23..56], [72..89], ...
The second is not fixed, so it's just an ordered list of interval lengths:
[7, 2, 5, 26, ...]
The task at hand is to:
Place every interval from the second list at a given starting point, so that the second list becomes a list of fixed, non-overlapping intervals much like the first, while preserving its order
Find the alignment that minimizes the amount of integers that are in some interval from one of the lists but not in any interval from the other list
Very simple example:
[25..26], [58..68], [74..76], [78..86]
[10, 12]
The optimal solution is to place the interval of length 10 at [58..68] and the interval of length 12 at [74..86] which results in only the numbers 25, 26, and 77 being in one list but not the other.
The only thing I've come up with that seems mildly helpful is that if I lay down the intervals in order, I know how many 'penalties' the interval I've already created, so I have an upper bound for the score, which means I have an admissible heuristic and I can do A* search instead of looking at the entire tree. However, the total range of numbers spans from 0 to about 34M, so I'd like something better.
Any help would be hot!
OK, here's a half-thought-out answer. It should work in polynomial time, but I haven't bothered checking what the index is. It may well be possible to get a better index than the answer as outlined here. The details are left as an exercise to the reader :-) I hope it's not too unclear.
I'll define the score of a solution as the number of integers which appear in both lists of intervals. Let f(i,m) be the highest score it's possible to get using just the first i interval lengths, subject to the condition that none of your intervals goes above m. The function f, for fixed i, is essentially a (non-strictly) increasing function from the integers to a bounded subset of the integers. Therefore:
all values of f(i,m), for m > 0, are equal, with finitely many exceptions;
all values of f(i,m), for m < 0, are equal, with finitely many exceptions.
This means it's possible to represent all values of f(i,m) using a finite data structure (still considering a fixed value of i).
Now let F(i) be the value of this data structure representing all values of f(i,m). I claim that, given F(i), it is possible to calculate F(i+1). To do this, we only need to answer the following question for all x: If I place the new interval at x, how good is the best solution I can get? But we know what this is - it's just f(i,x) + the score we've got from this interval.
So if n is the number of intervals in the second list, the score of the best solution will be F(n).
To actually find the solution, you could work backwards from this.
You know what's the best score you can get. Say it's s_0. Then put the last interval as far left as possible, subject to the condition that it allows you to score s_0. That is, find the smallest m such that f(n,m) = s_0; and place the interval such that it only just stays inside the bound at m.
Then, let s_1 be the score you need to get from all the other intervals in order to get a total of s_0. Place the next-last interval as far left as possible, subject to the condition that you can still score s_1. That is, find the smallest m such that f(n,m) = s_1; and place the interval such that it only just stays inside the bound at m.
And so on...

Resources