Given a set of intervals S what is the efficient way to find the number of subsets assigned to each interval from the set S.
say for example
S = (11,75), (14,62), (17,32), (24,48), (31,71), (34,74), (40,97), (41,58)
as for output
(11, 75) => 6 -> (14,62), (17,32), (24,48), (31,71), (34,74), (41,58)
(14, 62) => 3 -> (17,32), (24,48), (41,58)
(17, 32) => 0
(24, 48) => 0
(31, 71) => 1 -> (41,58)
(34, 74) => 1 -> (41,58)
(40, 97) => 1 -> (41,58)
(41, 58) => 0
Is it possible to get this mapping in o(nlogn) or substantially less than o(n2)?
There seems to be an O(n*log(n)) way to do this. The intuition is that we need some sort of way to organize the intervals in a way where, at the current step, all intervals that the current one could contain have been considered.
The algorithm is as follows:
Sort the intervals by end times in ascending order, and sort tied end times by their start times in descending order.
Iterate over the intervals and maintain a sorted set of all start times seen. The idea is that, when looking at the current interval, all intervals that it could contain have already been examined, and the number of intervals that the current one does contain is simply the number of elements in our built set that have a start time later than the current one.
Walking through the example, we first find that
sortedIntervals = [(17,32), (24,48), (41,58), (14,62), (31,71), (34,74), (11,75),(40,97)]
And let our sorted set of intervals (sorting now by start time) be
examinedIntervals = []
Now let's step through sortedIntervals
Consider (17,32). examinedIntervals is empty, so it doesn't contain anything.
Consider (24, 48). examinedIntervals = [(17, 32)] . Because there are no intervals that start after 24, we have that (24, 48) contains 0 intervals.
Consider (41, 58). examinedIntervals = [(17, 32), (24, 48)]. No intervals have a start time after 41, so (41, 58) contains 0 intervals
Consider (14, 62). examinedIntervals = [(17, 32), (24, 48), (41, 58)]. Now all three intervals have a start time after 14, so (14, 62) contains 3 intervals
Consider (31, 71). examinedIntervals = [(14, 62), (17, 32), (24, 48), (41, 58)]. Only one interval comes after 31, so (31, 71) contains 1 interval
Consider (34, 74). examinedIntervals = [(14, 62), (17, 32), (24, 48), (31, 71), (41, 58)]. One interval comes after 34, so (34, 74) contains 1 interval
Consider (11, 75). examinedIntervals = [(14, 62), (17, 32), (24, 48), (31, 71), (34, 74), (41, 58)], and all 6 intervals have a start time after 14.
Consider (40, 97). examinedIntervals = [(11, 75), (14, 62), (17, 32), (24, 48), (31, 71), (34, 74), (41, 58)]. Only one interval comes after 40, so (40, 97) contains 1 interval.
Summarizing we do indeed get the correct results:
(40, 97) -> 1
(11, 75) -> 6
(34, 74) -> 1
(31, 71) -> 1
(14, 62) -> 3
(41, 58) -> 0
(24, 48) -> 0
(17, 32) -> 0
It can also be verified easily that runtime is O(n*log(n)), assuming the use of an efficient sort and a balanced tree in the second part. The initial sort runs in the given amount of time. The second portion involves n insertions into a binary tree of height O(log(n)), giving a runtime of O(nlog(n)). Because we're summing two steps that run in O(nlog(n)), the overall runtime is O(nlog(n)).
Related
Possible Interview Question: How to Find All Overlapping Intervals => provide us a solution to find all the overlapping intervals. On top of this problem, imagine each interval has a weight. I am aiming to find those overlap intervals summed weight, when a new interval is inserted.
Condition: Newly inserted interval's end value is always larger than the previously inserted interval's end point, this will lead us to have already sorted end points.
When a new interval and its weight is inserted, all the overlapped intervals summed weight should be checked that does it exceeds the limit or not. For example when we insert [15, 70] 2, [15, 20] 's summed weight will be 130 and it should give an error since it exceed the limit=128, if not the newly inserted interval will be append to the list.
int limit = 128;
Inserted itervals in order:
order_come | start | end | weight
0 [10, 20] 32
1 [15, 25] 32
2 [5, 30] 32
3 [30, 40] 64
4 [1, 50] 16
5 [1, 60] 16
6 [15, 70] 2 <=should not append to the list.
Final overall summed weight view of the List after `[15, 70] 2` is inserted:
[60, 70, 2]
[50, 60, 18]
[40, 50, 34]
[30, 40, 98]
[25, 30, 66]
[20, 25, 98]
[15, 20, 130] <= exceeds the limit=128, throw an error.
[10, 15, 96]
[5, 10, 64]
[1, 5, 32]
[0, 0, 0]
Thank you for your valuable time and help.
O(log n)-time inserts are doable with an augmented binary search tree. To store
order_come | start | end | weight
0 [10, 20] 32
1 [15, 25] 32
2 [5, 30] 32
3 [30, 40] 64
4 [1, 50] 16
5 [1, 60] 16
we have a tree shaped like
25
/ \
/ \
10 50
/ \ / \
5 20 40 60
/ / /
1 15 30 ,
where each number represents the interval from it to its successor. Associated with each tree node are two numbers. The first we call ∆weight, defined to be the weight of the node's interval minus the weight of the node's parent's interval, if extent (otherwise zero). The second we call ∆max, defined to be the maximum weight of an interval corresponding to a descendant of the node, minus the node's weight.
For the above example,
interval | tree node | total weight | ∆weight | ∆max
[1, 5) 1 32 -32 0
[5, 10) 5 64 -32 0
[10, 15) 10 96 32 32
[15, 20) 15 128 32 0
[20, 25) 20 96 0 32
[25, 30) 25 64 64 64
[30, 40) 30 96 64 0
[40, 50) 40 32 16 64
[50, 60) 50 16 -48 80
[60, ∞) 60 0 -16 0
Binary search tree operations almost invariably require rotations. When we rotate a tree like
p c
/ \ / \
c r => l p
/ \ / \
l g g r
we modify
c.∆weight += p.∆weight
g.∆weight += c.∆weight
g.∆weight -= p.∆weight
p.∆weight -= c.∆weight
p.∆max = max(0, g.∆max + g.∆weight, r.∆max + r.∆weight)
c.∆max = max(0, l.∆max + l.∆weight, p.∆max + p.∆weight).
The point of the augmentations is as follows. To find the max weight in the tree, compute r.∆max + r.∆value where r is the root. To increase every weight in a subtree by a given quantity ∂, increase the subtree root's ∆weight by ∂. By changing O(log n) nodes with inclusion-exclusion, we can increase a whole interval. Together, these operations allow us to evaluate an insertion in time O(log n).
To find the total weight of an interval, search for that interval as normal while also adding up the ∆weight values of that interval's ancestors. For example, to find the weight of [15, 30], we look for 15, traversing 25 (∆weight = 64), 10 (∆weight = 32), 20 (∆weight = 0), and 15 (∆weight = 32), for a total weight of 64 + 32 + 0 + 32 = 128.
To find the maximum total weight along a hypothetical interval, we do a modified search something like this. Using another modified search, compute the greatest tree value less than or equal to start (predstart; let predstart = -∞ if start is all tree values are greater than start) and pass it to this maxtotalweight.
maxtotalweight(root, predstart, end):
if root is nil:
return -∞
if end <= root.value:
return maxtotalweight(root.leftchild, predstart, end) + root.∆weight
if predstart > root.value:
return maxtotalweight(root.rightchild, predstart, end) + root.∆weight
lmtw = maxtotalweight1a(root.leftchild, predstart)
rmtw = maxtotalweight1b(root.rightchild, end)
return max(lmtw, 0, rmtw) + root.∆weight
maxtotalweight1a(root, predstart):
if root is nil:
return -∞
if predstart > root.value:
return maxtotalweight1a(root.rightchild, predstart) + root.∆weight
lmtw = maxtotalweight1a(root.leftchild, predstart)
return max(lmtw, 0, root.rightchild.∆max + root.rightchild.∆weight) + root.∆weight
maxtotalweight1b(root, end):
if root is nil:
return -∞
if end <= root.value:
return maxtotalweight1b(root.leftchild, end) + root.∆weight
rmtw = maxtotalweight1b(root.rightchild, end)
return max(root.leftchild.∆max + root.leftchild.∆weight, 0, rmtw) + root.∆weight
We assume that nil has ∆weight = 0 and ∆max = -∞. Sorry for all of the missing details.
Using the terminology of original answer when you have
'1E 2E 3E ... (n-1)E nE'
end-points already sorted and your (n+1)st end-point is grater than all previous end-points you only need to find intervals with end-point value greater then (n+1)st start-point (greater or equal in case of closed intervals).
In other words - iterate over intervals starting from most-right end-point to the left until you reach the interval with end-point lesser or equal than (n+1)st start-point and keep track of sum of weights. Then check if the sum fits into the limit. Worst case time-complexity is O(n) when all previous intervals have end-point grater then (n+1)st start-point.
Yesterday, I got asked the following question during a technical interview.
Imagine that you are working for a news agency. At every discrete point of time t, a story breaks. Some stories are more interesting than others. This "hotness" is expressed as a natural number h, with greater numbers representing hotter news stories.
Given a stream S of n stories, your job is to find the hottest story out of the most recent k stories for every t >= k.
So far, so good: this is the moving maximum problem (also known as the sliding window maximum problem), and there is a linear-time algorithm that solves it.
Now the question gets harder. Of course, older stories are usually less hot compared to newer stories. Let the age a of the most recent story be zero, and let the age of any other story be one greater than the age of its succeeding story. The "improved hotness" of a story is then defined as max(0, min(h, k - a)).
Here's an example:
n = 13, k = 4
S indices: 0 1 2 3 4 5 6 7 8 9 10
S values: 1 3 1 7 1 3 9 3 1 3 1
mov max hot indices: 3 3 3 6 6 6 6 9
mov max hot values: 7 7 7 9 9 9 9 3
mov max imp-hot indices: 3 3 5 6 7 7 9 9
mov max imp-hot values: 4 3 3 4 3 3 3 3
I was at a complete loss with this question. I thought about adding the index to every element before computing the maximum, but that gives you the answer for when the hotness of a story decreases by one at every step, regardless of whether it reached the hotness bound or not.
Can you find an algorithm for this problem with sub-quadratic (ideally: linear) running time?
I'll sketch a linear-time solution to the original problem involving a double-ended queue (deque) and then extend it to improved hotness with no loss of asymptotic efficiency.
Original problem: keep a deque that contains the stories that are (1) newer or hotter than every other story so far (2) in the window. At any given time, the hottest story in the queue is at the front. New stories are pushed onto the back of the deque, after popping every story from the back until a hotter story is found. Stories are popped from the front as they age out of the window.
For example:
S indices: 0 1 2 3 4 5 6 7 8 9 10
S values: 1 3 1 7 1 3 9 3 1 3 1
deque: (front) [] (back)
push (0, 1)
deque: [(0, 1)]
pop (0, 1) because it's not hotter than (1, 3)
push (1, 3)
deque: [(1, 3)]
push (2, 1)
deque: [(1, 3), (2, 1)]
pop (2, 1) and then (1, 3) because they're not hotter than (3, 7)
push (3, 7)
deque: [(3, 7)]
push (4, 1)
deque: [(3, 7), (4, 1)]
pop (4, 1) because it's not hotter than (5, 3)
push (5, 3)
deque: [(3, 7), (5, 3)]
pop (5, 3) and then (3, 7) because they're not hotter than (6, 9)
push (6, 9)
deque: [(6, 9)]
push (7, 3)
deque: [(6, 9), (7, 3)]
push (8, 1)
deque: [(6, 9), (7, 3), (8, 1)]
pop (8, 1) and (7, 3) because they're not hotter than (9, 3)
push (9, 3)
deque: [(6, 9), (9, 3)]
push (10, 1)
pop (6, 9) because it exited the window
deque: [(9, 3), (10, 1)]
To handle the new problem, we modify how we handle aging stories. Instead of popping stories as they slide out of the window, we pop the front story whenever its improved hotness becomes less than or equal to its hotness. When determining the top story, only the most recently popped story needs to be considered.
In Python:
import collections
Elem = collections.namedtuple('Elem', ('hot', 't'))
def winmaximphot(hots, k):
q = collections.deque()
oldtop = 0
for t, hot in enumerate(hots):
while q and q[-1].hot <= hot:
del q[-1]
q.append(Elem(hot, t))
while q and q[0].hot >= k - (t - q[0].t) > 0:
oldtop = k - (t - q[0].t)
del q[0]
if t + 1 >= k:
yield max(oldtop, q[0].hot) if q else oldtop
oldtop = max(0, oldtop - 1)
print(list(winmaximphot([1, 3, 1, 7, 1, 3, 9, 3, 1, 3, 1], 4)))
Idea is the following: for each breaking news, it will beat all previous news after k-h steps. It means for k==30 and news hotness h==28, this news will be hotter than all previous news after 2 steps.
Let's keep all moments of time when next news will be the hottest. At step i we get moment of time when current news will beat all previous ones equal to i+k-h.
So we will have such sequence of objects {news_date | news_beats_all_previous_ones_date}, which is in increasing order by news_beats_all_previous_ones_date:
{i1 | i1+k-h} {i3 | i3+k-h} {i4 | i4+k-h} {i7 | i7+k-h} {i8 | i8+k-h}
At current step we get i9+k-h, we are adding it to the end of this list, removing all values which are bigger (since sequence is increasing this is easy).
Once first element's news_beats_all_previous_ones_date becomes equal current date (i), this news becomes answer to the sliding window query and we remove this item from the sequence.
So, you need a data structure with ability to add to the end, and remove from beginning and from the end. This is Deque. Time complexity of solution is O(n).
I'm trying to exhaustively permutate a vector of size 20, but when I tried to use perms(v), I get the error
Error using perms (line 23)
Maximum variable size allowed by the program is exceeded.
I've read from the documentation that the memory required for vectors longer than 10 is astronomical. So I'm looking for an alternative.
What I'm trying to do is the following (using a smaller scale example, where the vector here is only of size 3 instead of 20) - find all vectors, x, of length 3 where (x_i)^2 = 1, e.g.
(1, 1, 1),
(-1, 1, 1), (1, -1, 1), (1, 1, -1),
(-1, -1, 1), (-1, 1, -1), (1, -1, -1),
(-1, -1, -1)
I was trying to iteratively create the "base vector", where the number of '-1' elements increased from 0 to 20, then use perms(v) to permutate each "base vector", but I ran into the memory problem.
Is there any alternative to do this?
There are 2^20 such vectors (about 1 million). So you can make a cycle with counter in range 0..2^20-1 and map counter value (binary representation) to needed vector (zero bit to -1, one bit to +1 or vice versa). Simple mapping formula:
Vector_Element = bit * 2 - 1
Example for length 4:
i=10
binary form 1 0 1 0
+/-1 vector: 1 -1 1 -1
Data is entered as the number followed by the weight. For example, if the data structure has data entered (1, 9) and (2, 1) then it should return the number 1 90% of the time and the number 2 10% of the time.
Which data structure should be used for this implementation? Also, what would the basic code for the query function look like?
Edit: I was considering a tree which stores the cumulative sum for every subtree. Say I have (1, 4), (2, 7), (3, 1), and (4, 11).
The tree would look like:
23
/ \
11 12
/ \ / \
4 7 1 11
I do not know if this tree should be binary. Also, does it make sense to store the weights in the tree and map them to the number or somehow use the numbers given as data input?
Make from value/weight tuples (1, 4), (2, 7), (3, 1), (4, 11) an array with cumulative weight sums
[(1, 4), (2, 11), (3, 12), (4, 23)]
and get value with binary search for cumulative weight field.
It is not clear from question, how the query should work - randomly?
I have an optimisation problem where I need to optimize the lengths of a fixed number of bins over a known period of time. The bins should contain minimal overlapping items with the same tag (see definition of items and tags later).
If the problem can be solved heuristically that is fine, the exact optimum is not important.
I was wondering if anybody had any suggestions as to approaches to try out for this or at had any ideas as to what the name of the problem would be.
The problem
Lets say we have n number of items that have two attributes: tag and time range.
For an example we have the following items:
(tag: time range (s))
(1: 0, 2)
(1: 4, 5)
(1: 7, 8)
(1: 9, 15)
(2: 0, 5)
(2: 7, 11)
(2: 14, 20)
(3: 4, 6)
(3: 7, 11)
(4: 5, 15)
When plotted this would look as follows:
Lets say we have to bin this 20 second period of time into 4 groups. We could do this by having 4 groups of length 5.
And would look something like this:
The number of overlapping items with the same tag would be
Group 1: 1 (tag 1)
Group 2: 2 (tag 1 and tag 3)
Group 3: 2 (tag 2)
Group 4: 0
Total overlapping items: 5
Another grouping selection for 4 groups would then be of lengths 4, 3, 2 and 11 seconds.
The number of overlapping items with the same tag would be :
Group 1: 0
Group 2: 0
Group 3: 0
Group 4: 1 (tag 2)
Attempts to solve (brute force)
I can find the optimum solution by binning the whole period of time into small segments (say 1 seconds, for the above example there would be 20 bins).
I can then find all the integer compositions for the integer 20 that use 4 components. e.g.
This would provide 127 different compositions
(1, 1, 4, 14), (9, 5, 5, 1), (1, 4, 4, 11), (13, 3, 3, 1), (3, 4, 4, 9), (10, 5, 4, 1), (7, 6, 6, 1), (1, 3, 5, 11), (2, 4, 4, 10) ......
For (1, 1, 4, 14) the grouping would be 4 groups of 1, 1, 4 and 14 seconds.
I then find the composition with the best score (smallest number of overlapping tags).
The problem with this approach is that it can only be done on relatively small numbers as the number of compositions of an integer gets incredibly large when the size of the integer increases.
Therefore, if my data is 1000 seconds and I have to put bins of size 1 second the run time would be too long.
Attempts to solve (heuristically)
I have tried using a genetic algorithm type approach.
Where chromosomes are a composition of lengths which are created randomly and genes are the individual lengths of each group. Due to the nature of the data I am struggling to do any meaningful crossover/mutations though.
Does anyone have any suggestions?