Moving maximum variant - algorithm

Yesterday, I got asked the following question during a technical interview.
Imagine that you are working for a news agency. At every discrete point of time t, a story breaks. Some stories are more interesting than others. This "hotness" is expressed as a natural number h, with greater numbers representing hotter news stories.
Given a stream S of n stories, your job is to find the hottest story out of the most recent k stories for every t >= k.
So far, so good: this is the moving maximum problem (also known as the sliding window maximum problem), and there is a linear-time algorithm that solves it.
Now the question gets harder. Of course, older stories are usually less hot compared to newer stories. Let the age a of the most recent story be zero, and let the age of any other story be one greater than the age of its succeeding story. The "improved hotness" of a story is then defined as max(0, min(h, k - a)).
Here's an example:
n = 13, k = 4
S indices: 0 1 2 3 4 5 6 7 8 9 10
S values: 1 3 1 7 1 3 9 3 1 3 1
mov max hot indices: 3 3 3 6 6 6 6 9
mov max hot values: 7 7 7 9 9 9 9 3
mov max imp-hot indices: 3 3 5 6 7 7 9 9
mov max imp-hot values: 4 3 3 4 3 3 3 3
I was at a complete loss with this question. I thought about adding the index to every element before computing the maximum, but that gives you the answer for when the hotness of a story decreases by one at every step, regardless of whether it reached the hotness bound or not.
Can you find an algorithm for this problem with sub-quadratic (ideally: linear) running time?

I'll sketch a linear-time solution to the original problem involving a double-ended queue (deque) and then extend it to improved hotness with no loss of asymptotic efficiency.
Original problem: keep a deque that contains the stories that are (1) newer or hotter than every other story so far (2) in the window. At any given time, the hottest story in the queue is at the front. New stories are pushed onto the back of the deque, after popping every story from the back until a hotter story is found. Stories are popped from the front as they age out of the window.
For example:
S indices: 0 1 2 3 4 5 6 7 8 9 10
S values: 1 3 1 7 1 3 9 3 1 3 1
deque: (front) [] (back)
push (0, 1)
deque: [(0, 1)]
pop (0, 1) because it's not hotter than (1, 3)
push (1, 3)
deque: [(1, 3)]
push (2, 1)
deque: [(1, 3), (2, 1)]
pop (2, 1) and then (1, 3) because they're not hotter than (3, 7)
push (3, 7)
deque: [(3, 7)]
push (4, 1)
deque: [(3, 7), (4, 1)]
pop (4, 1) because it's not hotter than (5, 3)
push (5, 3)
deque: [(3, 7), (5, 3)]
pop (5, 3) and then (3, 7) because they're not hotter than (6, 9)
push (6, 9)
deque: [(6, 9)]
push (7, 3)
deque: [(6, 9), (7, 3)]
push (8, 1)
deque: [(6, 9), (7, 3), (8, 1)]
pop (8, 1) and (7, 3) because they're not hotter than (9, 3)
push (9, 3)
deque: [(6, 9), (9, 3)]
push (10, 1)
pop (6, 9) because it exited the window
deque: [(9, 3), (10, 1)]
To handle the new problem, we modify how we handle aging stories. Instead of popping stories as they slide out of the window, we pop the front story whenever its improved hotness becomes less than or equal to its hotness. When determining the top story, only the most recently popped story needs to be considered.
In Python:
import collections
Elem = collections.namedtuple('Elem', ('hot', 't'))
def winmaximphot(hots, k):
q = collections.deque()
oldtop = 0
for t, hot in enumerate(hots):
while q and q[-1].hot <= hot:
del q[-1]
q.append(Elem(hot, t))
while q and q[0].hot >= k - (t - q[0].t) > 0:
oldtop = k - (t - q[0].t)
del q[0]
if t + 1 >= k:
yield max(oldtop, q[0].hot) if q else oldtop
oldtop = max(0, oldtop - 1)
print(list(winmaximphot([1, 3, 1, 7, 1, 3, 9, 3, 1, 3, 1], 4)))

Idea is the following: for each breaking news, it will beat all previous news after k-h steps. It means for k==30 and news hotness h==28, this news will be hotter than all previous news after 2 steps.
Let's keep all moments of time when next news will be the hottest. At step i we get moment of time when current news will beat all previous ones equal to i+k-h.
So we will have such sequence of objects {news_date | news_beats_all_previous_ones_date}, which is in increasing order by news_beats_all_previous_ones_date:
{i1 | i1+k-h} {i3 | i3+k-h} {i4 | i4+k-h} {i7 | i7+k-h} {i8 | i8+k-h}
At current step we get i9+k-h, we are adding it to the end of this list, removing all values which are bigger (since sequence is increasing this is easy).
Once first element's news_beats_all_previous_ones_date becomes equal current date (i), this news becomes answer to the sliding window query and we remove this item from the sequence.
So, you need a data structure with ability to add to the end, and remove from beginning and from the end. This is Deque. Time complexity of solution is O(n).

Related

Mapping an index to each pair of unique points

Given an sample size N, I want to map each pair of unique points to an index, is such a way that I generate the pair using the index.
example - let's say N=5, so the number of unique pairs is N*(N-1)/2=10.
The pairs are -
0: (0, 1)
1: (0, 2)
2: (0, 3)
3: (0, 4)
4: (1, 2)
5: (1, 3)
6: (1, 4)
7: (2, 3)
8: (2, 4)
9: (3, 4)
so given a specific i, let's say i=4, a mapping function should should return (1,2).
The original ordering of the pairs can be changed, if that helps.
I like to order the pairs (or, in general, ordered tuples) in what's called "colex" order, which is lexicographic order of the reversed tuple. Or, in other words, sorted by the largest element (and using the next largest element as a tie-breaker if the tuples are bigger than pairs.) This results in the ordering
0: (0, 1)
1: (0, 2)
2: (1, 2)
3: (0, 3)
4: (1, 3)
5: (2, 3)
6: (0, 4)
7: (1, 4)
8: (2, 4)
9: (3, 4)
The advantage of this ordering is that it doesn't depend on N, which is extremely helpful if you might later need to increase N without invalidating any existing index.
You can then compute (x, y) as:
n = floor((sqrt(8 * i + 1) - 1) / 2)
x = i - n * (n + 1) / 2
y = n + 1

What is the field of study of this algorithm problem?

In my app people can give a mark to other people out of ten points. At midnight, every day, I would like to implement an algorithm that compute the best "match" for each person.
At the end of the day, I will have, for exemple :
(ID_person_who_gave_mark, ID_person_who_received_mark, mark)
(1, 2, 7.5) // 1 gave to 2 a 7.5/10
(1, 3, 9) // etc..
(1, 4, 6)
(2, 1, 5.5)
(2, 3, 4)
(2, 4, 8)
(3, 1, 3)
(3, 2, 10)
(3, 4, 9)
(4, 1, 2)
// no (4, 2, xx) because 4 didn't gave any mark to 2
(4, 3, 6.5)
At the end of algo, I would like that every person has the best match, that is the best compromise to "make everyone happy".
In my exemple, I would say that person 1 gave a 9/10 to 3 but 3 gave a 3/10 to 1 so they definitely can't match, 1 gave a 7.5/10 to 2 and 2 gave a 5.5/10 to 1, so why not, and finally, 1 gave a 6/10 to 4 but 4 gave a 2/10 to 1 so they can't match (under 5/10 = they can't match). So for person 1, the only match would be with 2, but I have to check if it's good for 2 to have 1 as match too.
2 gave a 4/10 to 3 so (2,3) is over (under 5/10), but 2 gave a 8/10 to 4, so (2,4) would be much more cool for 2 than (2,1).
Let's see 4 : 4 didn't gave any mark to 2, so they can't match : one possibility is left for 2 : we make the match (2-1)
let's see for 3 : with 1 it's over (3/10), with 2 it would be super cool for 3 (10/10) but 2 gave 3 a 4/10 so it's over. 3 gave a cool 9/10 to 4, so that would be nice too. Let's check 4 : 4 gave 2/10 to 1 so it's over with 1 (under 5/10), and gave a pretty nice 6.5 to 3. So the best match is finally between 4 and 3.
So our final matchs are in this exemple : (1-2) and (3-4)
Even when I do that intuitively like I just did, it's complicated to find an appropriate behaviour to compute all these informations.
Can you help me "mathematise" such a purpose, in order to have an algo that could do such calculation for let's say 50000 people ? Or at least, what is the field of study where I could get more information to solve this effectively ?

How to implement a data structure which takes an integer and its weight such that a query to it returns that integer its weight% times?

Data is entered as the number followed by the weight. For example, if the data structure has data entered (1, 9) and (2, 1) then it should return the number 1 90% of the time and the number 2 10% of the time.
Which data structure should be used for this implementation? Also, what would the basic code for the query function look like?
Edit: I was considering a tree which stores the cumulative sum for every subtree. Say I have (1, 4), (2, 7), (3, 1), and (4, 11).
The tree would look like:
23
/ \
11 12
/ \ / \
4 7 1 11
I do not know if this tree should be binary. Also, does it make sense to store the weights in the tree and map them to the number or somehow use the numbers given as data input?
Make from value/weight tuples (1, 4), (2, 7), (3, 1), (4, 11) an array with cumulative weight sums
[(1, 4), (2, 11), (3, 12), (4, 23)]
and get value with binary search for cumulative weight field.
It is not clear from question, how the query should work - randomly?

Suggestions for optimizing length of bins for a time period

I have an optimisation problem where I need to optimize the lengths of a fixed number of bins over a known period of time. The bins should contain minimal overlapping items with the same tag (see definition of items and tags later).
If the problem can be solved heuristically that is fine, the exact optimum is not important.
I was wondering if anybody had any suggestions as to approaches to try out for this or at had any ideas as to what the name of the problem would be.
The problem
Lets say we have n number of items that have two attributes: tag and time range.
For an example we have the following items:
(tag: time range (s))
(1: 0, 2)
(1: 4, 5)
(1: 7, 8)
(1: 9, 15)
(2: 0, 5)
(2: 7, 11)
(2: 14, 20)
(3: 4, 6)
(3: 7, 11)
(4: 5, 15)
When plotted this would look as follows:
Lets say we have to bin this 20 second period of time into 4 groups. We could do this by having 4 groups of length 5.
And would look something like this:
The number of overlapping items with the same tag would be
Group 1: 1 (tag 1)
Group 2: 2 (tag 1 and tag 3)
Group 3: 2 (tag 2)
Group 4: 0
Total overlapping items: 5
Another grouping selection for 4 groups would then be of lengths 4, 3, 2 and 11 seconds.
The number of overlapping items with the same tag would be :
Group 1: 0
Group 2: 0
Group 3: 0
Group 4: 1 (tag 2)
Attempts to solve (brute force)
I can find the optimum solution by binning the whole period of time into small segments (say 1 seconds, for the above example there would be 20 bins).
I can then find all the integer compositions for the integer 20 that use 4 components. e.g.
This would provide 127 different compositions
(1, 1, 4, 14), (9, 5, 5, 1), (1, 4, 4, 11), (13, 3, 3, 1), (3, 4, 4, 9), (10, 5, 4, 1), (7, 6, 6, 1), (1, 3, 5, 11), (2, 4, 4, 10) ......
For (1, 1, 4, 14) the grouping would be 4 groups of 1, 1, 4 and 14 seconds.
I then find the composition with the best score (smallest number of overlapping tags).
The problem with this approach is that it can only be done on relatively small numbers as the number of compositions of an integer gets incredibly large when the size of the integer increases.
Therefore, if my data is 1000 seconds and I have to put bins of size 1 second the run time would be too long.
Attempts to solve (heuristically)
I have tried using a genetic algorithm type approach.
Where chromosomes are a composition of lengths which are created randomly and genes are the individual lengths of each group. Due to the nature of the data I am struggling to do any meaningful crossover/mutations though.
Does anyone have any suggestions?

How to iterate through array combinations with constant sum efficiently?

I have an array and its length is X. Each element of the array has range 1 .. L. I want to iterate efficiently through all array combinations that has sum L.
Correct solutions for: L = 4 and X = 2
1 3
3 1
2 2
Correct solutions for: L = 5 and X = 3
1 1 3
1 3 1
3 1 1
1 2 2
2 1 2
2 2 1
The naive implementation is (no wonder) too slow for my problem (X is up to 8 in my case and L is up to 128).
Could anybody tell me how is this problem called or where to find a fast algorithm for the problem?
Thanks!
If I understand correctly, you're given two numbers 1 ≤ X ≤ L and you want to generate all sequences of positive integers of length X that sum to L.
(Note: this is similar to the integer partition problem, but not the same, because you consider 1,2,2 to be a different sequence from 2,1,2, whereas in the integer partition problem we ignore the order, so that these are considered to be the same partition.)
The sequences that you are looking for correspond to the combinations of X − 1 items out of L − 1. For, if we put the numbers 1 to L − 1 in order, and pick X − 1 of them, then the lengths of intervals between the chosen numbers are positive integers that sum to L.
For example, suppose that L is 16 and X is 5. Then choose 4 numbers from 1 to 15 inclusive:
Add 0 at the beginning and 16 at the end, and the intervals are:
and 3 + 4 + 1 + 6 + 2 = 16 as required.
So generate the combinations of X − 1 items out of L − 1, and for each one, convert it to a partition by finding the intervals. For example, in Python you could write:
from itertools import combinations
def partitions(n, t):
"""
Generate the sequences of `n` positive integers that sum to `t`.
"""
assert(1 <= n <= t)
def intervals(c):
last = 0
for i in c:
yield i - last
last = i
yield t - last
for c in combinations(range(1, t), n - 1):
yield tuple(intervals(c))
>>> list(partitions(2, 4))
[(1, 3), (2, 2), (3, 1)]
>>> list(partitions(3, 5))
[(1, 1, 3), (1, 2, 2), (1, 3, 1), (2, 1, 2), (2, 2, 1), (3, 1, 1)]
There are (L − 1)! / (X − 1)!(L − X)! combinations of X − 1 items out of L − 1, so the runtime of this algorithm (and the size of its output) is exponential in L. However, if you don't count the output, it only needs O(L) space.
With L = 128 and X = 8, there are 89,356,415,775 partitions, so it'll take a while to output them all!
(Maybe if you explain why you are computing these partitions, we might be able to suggest some way of meeting your requirements without having to actually produce them all.)

Resources