Suggestions for optimizing length of bins for a time period - algorithm

I have an optimisation problem where I need to optimize the lengths of a fixed number of bins over a known period of time. The bins should contain minimal overlapping items with the same tag (see definition of items and tags later).
If the problem can be solved heuristically that is fine, the exact optimum is not important.
I was wondering if anybody had any suggestions as to approaches to try out for this or at had any ideas as to what the name of the problem would be.
The problem
Lets say we have n number of items that have two attributes: tag and time range.
For an example we have the following items:
(tag: time range (s))
(1: 0, 2)
(1: 4, 5)
(1: 7, 8)
(1: 9, 15)
(2: 0, 5)
(2: 7, 11)
(2: 14, 20)
(3: 4, 6)
(3: 7, 11)
(4: 5, 15)
When plotted this would look as follows:
Lets say we have to bin this 20 second period of time into 4 groups. We could do this by having 4 groups of length 5.
And would look something like this:
The number of overlapping items with the same tag would be
Group 1: 1 (tag 1)
Group 2: 2 (tag 1 and tag 3)
Group 3: 2 (tag 2)
Group 4: 0
Total overlapping items: 5
Another grouping selection for 4 groups would then be of lengths 4, 3, 2 and 11 seconds.
The number of overlapping items with the same tag would be :
Group 1: 0
Group 2: 0
Group 3: 0
Group 4: 1 (tag 2)
Attempts to solve (brute force)
I can find the optimum solution by binning the whole period of time into small segments (say 1 seconds, for the above example there would be 20 bins).
I can then find all the integer compositions for the integer 20 that use 4 components. e.g.
This would provide 127 different compositions
(1, 1, 4, 14), (9, 5, 5, 1), (1, 4, 4, 11), (13, 3, 3, 1), (3, 4, 4, 9), (10, 5, 4, 1), (7, 6, 6, 1), (1, 3, 5, 11), (2, 4, 4, 10) ......
For (1, 1, 4, 14) the grouping would be 4 groups of 1, 1, 4 and 14 seconds.
I then find the composition with the best score (smallest number of overlapping tags).
The problem with this approach is that it can only be done on relatively small numbers as the number of compositions of an integer gets incredibly large when the size of the integer increases.
Therefore, if my data is 1000 seconds and I have to put bins of size 1 second the run time would be too long.
Attempts to solve (heuristically)
I have tried using a genetic algorithm type approach.
Where chromosomes are a composition of lengths which are created randomly and genes are the individual lengths of each group. Due to the nature of the data I am struggling to do any meaningful crossover/mutations though.
Does anyone have any suggestions?

Related

Efficient algorithm for finding the right elements combinations

The problem is the following:
1) Total load is given as input
2) Number of steps over which the load is divided is also given as input
3) Each step can have different discrete number of elements, which is multiple of 3 for example (i.e. 3, 6, 9, 12, 15 elements ...).
4) Elements are given as input.
5) Acceptable solutions are within a certain range "EPSILON" from the total load (equal to total load or greater but within certain margin, for example up to +2)
Example:
Total load: 50
Number of steps: 4
Allowed elements that can be used are: 0.5, 1, 1.5, 2.5, 3, 4
Acceptable margin: +2 (i.e. total load between 50 and 52).
Example of solutions are:
For simplicity here, each step has uniform elements, although we can have different elements in the same step (but should be grouped into 3, i.e. we can have 3 elements of 1, and 3 other elements of 2, in the same step, so total of 9).
Solution 1: total of 51
Step 1: 3 Elements of 4 (So total of 12), (this step can be for example 3 elements of 3, and 3 elements of 1, i.e. 3 x 3 + 3 x 1).
Step 2: 3 Elements of 4 (total of 12),
Step 3: 9 Elements of 1.5 (total of 13.5),
Step 4: 9 Elements of 1.5 (total of 13.5),
Solution 2: total of 51
Step 1: 3 Elements of 4 (total of 12)
Step 2: 3 Elements of 4 (total of 12)
Step 3: 6 Elements of 2 (total of 12)
Step 4: 15 Elements of 1 (total of 15)
The code that I used takes the above input, and writes another code depending on the number of steps.
The second code basically loops over the number of steps (loops inside each other's) and checks for all the possible elements combinations.
Example of loops for 2 steps solution:
Code:
For NumberofElementsA = 3 To 18 Step 3
'''''18 here is the maximum number of elements per step, since I cannot let it go to infinity, so I need to define a maximum for elemnt
For NumberofElementsB = 3 To 18 Step 3
For AllowedElementsA = 1 To 6
For AllowedElementsB = AllowedElementsA To 6
''''Allowed elements in this example were 6: [0.5, 1, 1.5, 2.5, 3, 4]
LoadDifference = -TotalLoad + NumberofElementsA * ElementsArray(AllowedElementsA) + NumberofElementsB * ElementsArray(AllowedElementsB)
''''basically it just multiplies the number of elements (here 3, 6, 9, ... to 18) to the value of the element (0.5, 1, 1.5, 2.5, 3, 4) in each loop and subtracts the total load.
If LoadDifference <= 2 And LoadDifference >= 0
'''Solution OK
End If
Next AllowedElementsB
Next AllowedElementsA
Next NumberofElementsB
Next NumberofElementsA
So basically the code loops over all the possible number of elements and possible elements values, and checks each result.
Is there an algorithm that solves in a more efficient way the above problem ? Other than looping over all possible outcomes.
Since you're restricted to groups of 3, this transforms immediately to a problem with all weights tripled:
1.5, 3, 4.5, 7.5, 9, 12
Your range is a target value +2, or within 1 either way from the midpoint of that range (51 +- 1).
Since you've listed no requirement on balancing step loads, this is now an instance of the target sum problem -- with a little processing before and after the central solution.

What is the field of study of this algorithm problem?

In my app people can give a mark to other people out of ten points. At midnight, every day, I would like to implement an algorithm that compute the best "match" for each person.
At the end of the day, I will have, for exemple :
(ID_person_who_gave_mark, ID_person_who_received_mark, mark)
(1, 2, 7.5) // 1 gave to 2 a 7.5/10
(1, 3, 9) // etc..
(1, 4, 6)
(2, 1, 5.5)
(2, 3, 4)
(2, 4, 8)
(3, 1, 3)
(3, 2, 10)
(3, 4, 9)
(4, 1, 2)
// no (4, 2, xx) because 4 didn't gave any mark to 2
(4, 3, 6.5)
At the end of algo, I would like that every person has the best match, that is the best compromise to "make everyone happy".
In my exemple, I would say that person 1 gave a 9/10 to 3 but 3 gave a 3/10 to 1 so they definitely can't match, 1 gave a 7.5/10 to 2 and 2 gave a 5.5/10 to 1, so why not, and finally, 1 gave a 6/10 to 4 but 4 gave a 2/10 to 1 so they can't match (under 5/10 = they can't match). So for person 1, the only match would be with 2, but I have to check if it's good for 2 to have 1 as match too.
2 gave a 4/10 to 3 so (2,3) is over (under 5/10), but 2 gave a 8/10 to 4, so (2,4) would be much more cool for 2 than (2,1).
Let's see 4 : 4 didn't gave any mark to 2, so they can't match : one possibility is left for 2 : we make the match (2-1)
let's see for 3 : with 1 it's over (3/10), with 2 it would be super cool for 3 (10/10) but 2 gave 3 a 4/10 so it's over. 3 gave a cool 9/10 to 4, so that would be nice too. Let's check 4 : 4 gave 2/10 to 1 so it's over with 1 (under 5/10), and gave a pretty nice 6.5 to 3. So the best match is finally between 4 and 3.
So our final matchs are in this exemple : (1-2) and (3-4)
Even when I do that intuitively like I just did, it's complicated to find an appropriate behaviour to compute all these informations.
Can you help me "mathematise" such a purpose, in order to have an algo that could do such calculation for let's say 50000 people ? Or at least, what is the field of study where I could get more information to solve this effectively ?

Moving maximum variant

Yesterday, I got asked the following question during a technical interview.
Imagine that you are working for a news agency. At every discrete point of time t, a story breaks. Some stories are more interesting than others. This "hotness" is expressed as a natural number h, with greater numbers representing hotter news stories.
Given a stream S of n stories, your job is to find the hottest story out of the most recent k stories for every t >= k.
So far, so good: this is the moving maximum problem (also known as the sliding window maximum problem), and there is a linear-time algorithm that solves it.
Now the question gets harder. Of course, older stories are usually less hot compared to newer stories. Let the age a of the most recent story be zero, and let the age of any other story be one greater than the age of its succeeding story. The "improved hotness" of a story is then defined as max(0, min(h, k - a)).
Here's an example:
n = 13, k = 4
S indices: 0 1 2 3 4 5 6 7 8 9 10
S values: 1 3 1 7 1 3 9 3 1 3 1
mov max hot indices: 3 3 3 6 6 6 6 9
mov max hot values: 7 7 7 9 9 9 9 3
mov max imp-hot indices: 3 3 5 6 7 7 9 9
mov max imp-hot values: 4 3 3 4 3 3 3 3
I was at a complete loss with this question. I thought about adding the index to every element before computing the maximum, but that gives you the answer for when the hotness of a story decreases by one at every step, regardless of whether it reached the hotness bound or not.
Can you find an algorithm for this problem with sub-quadratic (ideally: linear) running time?
I'll sketch a linear-time solution to the original problem involving a double-ended queue (deque) and then extend it to improved hotness with no loss of asymptotic efficiency.
Original problem: keep a deque that contains the stories that are (1) newer or hotter than every other story so far (2) in the window. At any given time, the hottest story in the queue is at the front. New stories are pushed onto the back of the deque, after popping every story from the back until a hotter story is found. Stories are popped from the front as they age out of the window.
For example:
S indices: 0 1 2 3 4 5 6 7 8 9 10
S values: 1 3 1 7 1 3 9 3 1 3 1
deque: (front) [] (back)
push (0, 1)
deque: [(0, 1)]
pop (0, 1) because it's not hotter than (1, 3)
push (1, 3)
deque: [(1, 3)]
push (2, 1)
deque: [(1, 3), (2, 1)]
pop (2, 1) and then (1, 3) because they're not hotter than (3, 7)
push (3, 7)
deque: [(3, 7)]
push (4, 1)
deque: [(3, 7), (4, 1)]
pop (4, 1) because it's not hotter than (5, 3)
push (5, 3)
deque: [(3, 7), (5, 3)]
pop (5, 3) and then (3, 7) because they're not hotter than (6, 9)
push (6, 9)
deque: [(6, 9)]
push (7, 3)
deque: [(6, 9), (7, 3)]
push (8, 1)
deque: [(6, 9), (7, 3), (8, 1)]
pop (8, 1) and (7, 3) because they're not hotter than (9, 3)
push (9, 3)
deque: [(6, 9), (9, 3)]
push (10, 1)
pop (6, 9) because it exited the window
deque: [(9, 3), (10, 1)]
To handle the new problem, we modify how we handle aging stories. Instead of popping stories as they slide out of the window, we pop the front story whenever its improved hotness becomes less than or equal to its hotness. When determining the top story, only the most recently popped story needs to be considered.
In Python:
import collections
Elem = collections.namedtuple('Elem', ('hot', 't'))
def winmaximphot(hots, k):
q = collections.deque()
oldtop = 0
for t, hot in enumerate(hots):
while q and q[-1].hot <= hot:
del q[-1]
q.append(Elem(hot, t))
while q and q[0].hot >= k - (t - q[0].t) > 0:
oldtop = k - (t - q[0].t)
del q[0]
if t + 1 >= k:
yield max(oldtop, q[0].hot) if q else oldtop
oldtop = max(0, oldtop - 1)
print(list(winmaximphot([1, 3, 1, 7, 1, 3, 9, 3, 1, 3, 1], 4)))
Idea is the following: for each breaking news, it will beat all previous news after k-h steps. It means for k==30 and news hotness h==28, this news will be hotter than all previous news after 2 steps.
Let's keep all moments of time when next news will be the hottest. At step i we get moment of time when current news will beat all previous ones equal to i+k-h.
So we will have such sequence of objects {news_date | news_beats_all_previous_ones_date}, which is in increasing order by news_beats_all_previous_ones_date:
{i1 | i1+k-h} {i3 | i3+k-h} {i4 | i4+k-h} {i7 | i7+k-h} {i8 | i8+k-h}
At current step we get i9+k-h, we are adding it to the end of this list, removing all values which are bigger (since sequence is increasing this is easy).
Once first element's news_beats_all_previous_ones_date becomes equal current date (i), this news becomes answer to the sliding window query and we remove this item from the sequence.
So, you need a data structure with ability to add to the end, and remove from beginning and from the end. This is Deque. Time complexity of solution is O(n).

Exhaustively permutate a vector of size 20 in Matlab

I'm trying to exhaustively permutate a vector of size 20, but when I tried to use perms(v), I get the error
Error using perms (line 23)
Maximum variable size allowed by the program is exceeded.
I've read from the documentation that the memory required for vectors longer than 10 is astronomical. So I'm looking for an alternative.
What I'm trying to do is the following (using a smaller scale example, where the vector here is only of size 3 instead of 20) - find all vectors, x, of length 3 where (x_i)^2 = 1, e.g.
(1, 1, 1),
(-1, 1, 1), (1, -1, 1), (1, 1, -1),
(-1, -1, 1), (-1, 1, -1), (1, -1, -1),
(-1, -1, -1)
I was trying to iteratively create the "base vector", where the number of '-1' elements increased from 0 to 20, then use perms(v) to permutate each "base vector", but I ran into the memory problem.
Is there any alternative to do this?
There are 2^20 such vectors (about 1 million). So you can make a cycle with counter in range 0..2^20-1 and map counter value (binary representation) to needed vector (zero bit to -1, one bit to +1 or vice versa). Simple mapping formula:
Vector_Element = bit * 2 - 1
Example for length 4:
i=10
binary form 1 0 1 0
+/-1 vector: 1 -1 1 -1

Reordering items with multiple order criteria

Scenario:
list of photos
every photo has the following properties
id
sequence_number
main_photo_bit
the first photo has the main_photo_bit set to 1 (all others are 0)
photos are ordered by sequence_number (which is arbitrary)
the main photo does not necessarily have the lowest sequence_number (before sorting)
See the following table:
id, sequence_number, main_photo_bit
1 10 1
2 5 0
3 20 0
Now you want to change the order by changing the sequence number and main photo bit.
Requirements after sorting:
the sequence_number of the first photo is not changed
the sequence_number of the first photo is the lowest
as less changes as possible
Examples:
Example #1 (second photo goes to the first position):
id, sequence_number, main_photo_bit
2 10 1
1 15 0
3 20 0
This is what happened:
id 1: new sequence_number and main_photo_bit set to 0
id 2: old first photo (id 2) sequence_number and main_photo_bit set to 1
id 3: nothing happens
Example #2 (third photo to first position):
id, sequence_number, main_photo_bit
3 10 1
1 20 0
2 30 0
This is what happened:
id 1: new sequence_number bigger than first photo and main_photo_bit to 0
id 2: new sequence_number bigger than newly generated second sequence_number
id 3: old first photo sequence_number and main_photo_bit set to 1
What is the best approach to calculate the steps needed to save the new order?
Edit:
The reason that I want as less updates as possible is because I want to sync it to an external service, which is a quite costly operation.
I already got a working prototype of the algorithm, but it fails in some edge cases. So instead of patching it up (which might work -- but it will become even more complex than it is already), I want to know if there are other (better) ways to do it.
In my version (in short) it orders the photos (changing sequence_number's), and swaps the main_photo_bit, but it isn't sufficient to solve every scenario.
From what I understood, a good solution would not only minimize changes (since updating is the costly operation), but also try to minimize future changes, as more and more photos are reordered. I'd start by adding a temporary field dirty, to indicate if the row must change or not:
id, sequence_number, main_photo_bit, dirty
1 10 1 false
2 5 0 false
3 20 0 false
4 30 0 false
5 31 0 false
6 33 0 false
If there are rows which sequence_number is smaller than the first, they will surely have to change (either to get a higher number, or to become the first). Let's mark them as dirty:
id, sequence_number, main_photo_bit, dirty
2 5 0 true
(skip this step if it's not really important that the first has the lowest sequence_number)
Now let's see the list of photos, as they should be in the result (as per the question, only one photo changed places, from anywhere to anywhere). Dirty ones in bold:
[1, 2, 3, 4, 5, 6] # Original ordering
[2, 1, 3, 4, 5, 6] # Example 1: 2nd to 1st place
[3, 1, 2, 4, 5, 6] # Example 2: 3rd to 1st place
[1, 2, 4, 3, 5, 6] # Example 3: 3rd to 4th place
[1, 3, 2, 4, 5, 6] # Example 4: 3rd to 2nd place
The first thing to do is ensure the first element has the lowest sequence_number. If it hasn't changed places, then it has by definition, otherwise the old first should be marked as dirty, have its main_photo_bit cleared, and the new one should receive those values to itself.
At this point, the first element should have a fixed sequence_number, and every dirty element can have its value changed at will (since it will have to change anyway, so it's better to change for an useful value). Before proceeding, we must ensure that it's possible to solve it with only changing the dirty rows, or if more rows will have to be dirtied as well. This is simply a matter of determining if the interval between every pair of clean rows is big enough to fit the number of dirty rows between them:
[10, D, 20, 30, 31, 33] # Original ordering (the first is dirty, but fixed)
[10, D, 20, 30, 31, 33] # Example 1: 2nd to 1st place (ok: 10 < ? < 20)
[10, D, D, 30, 31, 33] # Example 2: 3rd to 1st place (ok: 10 < ? < ? < 30)
[10, D, 30, D, 31, 33] # Example 3: 3rd to 4th place (NOT OK: 30 < ? < 31)
[10, D, 30, D, D, 33] # must mark 5th as dirty too (ok: 30 < ? < ? < 33)
[10, D, D, 30, 31, 33] # Example 4: 3rd to 2nd place (ok)
Now it's just a matter of assigning new sequence_numbers to the dirty rows. A naïve solution would be to just increment the previous one, but a better approach would be setting them as equally spaced as possible. This way, there are better odds that a future reorder would require less changes (in other words, to avoid problems like Example 3, where more rows than necessary had to be updated since some sequence_numbers were too close to each other):
[10, 15, 20, 30, 31, 33] # Example 1: 2nd to 1st place
[10, 16, 23, 30, 31, 33] # Example 2: 3rd to 1st place
[10, 20, 30, 31, 32, 33] # Example 3: 3rd to 4th place
[10, 16, 23, 30, 31, 33] # Example 4: 3rd to 2nd place
Bonus: if you really want to push the solution to its limits, do the computation twice - one moving the photo, other having it fixed and moving the surrounding photos - and see which one resulted in less changes. Take example 3A, where instead of "3rd to 4th place" we treat it as "4th to 3rd place" (same sorting results, but different changes):
[1, 2, 4, 3, 5, 6] # Example 3A: 4th to 3rd place
[10, D, D, 20, 31, 33] # (ok: 10 < ? < ? < 20)
[10, 13, 16, 20, 31, 33] # One less change
In most cases it can be done (ex.: 2nd to 4th position == 3rd/4th to 2nd/3rd position), whether or not the added complexity is worth the small gain, it's up to you to decide.
Use a linked list instead of sequence numbers. Then you can remove a picture from anywhere in the list and reinsert it anywhere in the list, and you only need to change 3 lines in your database file. Main photo bit should be unneccessary, the first photo being implicitly defined by not having any pointers to it.
id next
1 3
2 1
3
the order is: 2, 1, 3
user moves picture 3 to position 1:
id next
1
2 1
3 2
new order is: 3, 2, 1

Resources