Detecting Conflicts on a Timeline, Part 2: Isolate "True" Overlaps - algorithm

This is a continuation of my original question about a Timeline-Scheduler Algorithm for plotting overlapping time conflicts: PART 1: Detecting Conflicts on a Scheduler Timeline (Algorithm)
I was given the correct algorithm, shown below, to split up "conflicting" events on 24-hr timeline such that each item in the conflict group occupies N% of the window.
My current problem (PART 2) is that conflicting events are treated as a group and always divided equally, but the real goal is to only isolate "true conflicts" which are not necessarily the whole group.
Consider the following picture.
Here, the original algorithm from Part 1 gave a 3-way split for the events
12:30am - 1:30am
1:00am - 2:30am
2:00am - 4:00am
But this result is slightly incorrect. There are only 2 overlaps, and there should be 2 columns shown. Event #3 can be brought over to Column 1 since it doesn't conflict with Event #1. The only conflict (a max 2-way split) is that #1 conflicts with #2, and #3 also conflicts with #2. As the gray arrow shows, there should be 2 columns for this case.
Original Conflict-Detection Algorithm from Part 1:
* 1) First sort all events by StartTime
* 2) Initialize "lastMaxEndTime" to EndTime of First Event (#1)
* 3) LOOP: For each Event: look at Current Event and Next Event (n+1)
* If Next Event Exists
* if (lastMaxEndTime > NextEvent StartTime) --> CONFLICT!
* - set Overlap mode
* - push conflicting Current Event's StartTime into conflict array
* - UPDATE: lastMaxEndTime = MAX(lastMaxEndTime, NextEvent EndTime)
* else --> NO CONFLICT
* - if we are in Overlap Mode, this is the last overlap
* - push this final conflicting Current Event's StartTime into conflict array
* - draw overlaps now
* - reset Overlap Mode and clear conflict array
* - else
* - this is a normal event, draw at 100%
* - UPDATE: lastMaxEndTime = endTimeNext
*
* Else (No Next Event, this is the last event)
* - if we are in Overlap Mode, this is the last overlap
* - push this final conflicting Current Event's StartTime into conflict array
* - draw overlaps now
* - reset Overlap Mode and clear conflict array
* - else
* - this is a normal event, draw at 100%
Or, a slightly different view of this pseudocode from Patrick's answer,
// first event is the current event
lastMaxEndTime = CurrentEvent EndTime
if NextEvent exists {
// if the maximum end time considered in
// the conflicting component currently
// under consideration extends beyond the
// the next event's start time, then this
// and everything that "conflicts" with it
// is also defined to "conflict" with NextEvent
if (lastMaxEndTime > NextEvent StartTime) { // CONFLICT!
overlappingMode = true;
overlappingEvents.add(currentEvent); // Add to array
lastMaxEndTime = max(lastMaxEndTime, NextEvent EndTime)
}
else { // NO CONFLICT
if (overlappingMode is TRUE) {
// Resolve Now
redrawOverlappingEvents(overlappingEvents);
// Reset
overlappingMode = false;
EMPTY overlappingEvents;
}
// everything that starts earlier than me,
// ends before I start. so start over
lastMaxEndTime = NextEvent EndTime
}
}

You need to partition the events into "lanes", sequences of non-overlapping events. This is generally easy with a "greedy" algorithm. Consider the events in order. For each event, place that event in the first "lane" (vertical column on your chart) where there is no overlap. If the current event overlaps with all columns, then place it into a new column.

Prune's answer is correct. Here is a proof.
In the base case of one event, the algorithm obviously gives an optimal solution of one lane with no overlaps.
Assume the algorithm gives an optimal solution for all numbers of events up to and including k.
We must show that the algorithm gives a correct result for k + 1 events. After k of these k + 1 events, the algorithm has built a schedule with optimally many lanes and no overlaps. It must now place the (k + 1)st event in some lane. Suppose that this event fits into some lane with no overlaps. In that case, place the event there, and the number of lanes must still be optimal (adding more events cannot result in needing fewer lanes). What if the (k + 1)st event overlaps with events in every existing lane?
The only way the (k + 1)st element can overlap with events in all existing lanes is if all existing lanes' latest running events overlap with each other. To see this must be true, consider that the start times are in ascending sorted order, so if any two of the existing lanes' latest running events didn't overlap with each other, the (k + 1)st event wouldn't overlap with the one of the two which finished earlier. But if we have a set of L + 1 events which all overlap with each other, we must have at least L + 1 lanes; one more than L, the optimal number of lanes given k events; and this is what the algorithm guarantees by placing the (k + 1)st element in a new lane in this instance.
Here is an alternative idea - you could fill up the lanes backwards using iterations of optimal event scheduling run in reverse; that is, add events with the latest start time to each lane while avoiding conflicts. This will give you as many non-overlapping events as possible in the first lane. Repeat the process iteratively on new lanes until you run out of events.
(Optimal event scheduling adds events to the schedule by choosing earliest stop time first, and then eliminating remaining events whose start times occur before the stop time chosen during the round. We can imagine time flowing in reverse and using latest start time while eliminating events with stop times after the chosen start time, taking the already sorted list in reverse order. Also, this iterated application of the optimal method should really be proven optimal in its own right, if it even is, but I digress.)

I tried to implement this algorithm here.
I am considering a double-array lanes[x][y] where e.g.
lanes[0] = ["event1", "event4", "event7"]
lanes[1] = ["event2"]
etc.
Algorithm:
// Organize overlapping events into lanes where each lane has non-overlapping events from the conflict group
var lanes = [];
for (var i = 0; i < overlappingEventIDs.length; i++) {
var currlane = 0;
var laneFound = false;
for (var j = 0; j < lanes.length; j++) {
if (!laneFound) {
var conflictInLaneFound = false;
for (var k = 0; k < lanes[j].length; k++) {
if (!laneFound) {
var testEventID = lanes[j][k];
var testEventStartTime = getTime(testEventID.startTime);
var testEventEndTime = getTime(testEventID.endTime);
var thisStartTime = getTime(overlappingEventIDs[i].startTime);
var thisEndTime = getTime(overlappingEventIDs[i].endTime);
if (thisStartTime < testEventEndTime) {
conflictInLaneFound = true;
}
}
}
if (!conflictInLaneFound) {
// Found a lane for this event, Lane #j
lanes[j].push(overlappingEventIDs[i]);
laneFound = true;
}
else {
// Increment currlane
currlane++;
}
}
}
if (!laneFound) { // Need to put this in a new lane
if (lanes[currlane] == undefined) {
lanes.push([]);
}
lanes[currlane].push(overlappingEventIDs[i]);
}
}

Related

Using the first row in bin (instead of average) to calculate percentage gain

In the dc.js Nasdaq example, percentageGain is calculated as:
(p.absGain / p.avgIndex) * 100
Here avgIndex is the average of all the day-averages.
I'm more familiar with the equation:
A. (Price - Prev period's Close) / Prev period's Close * 100
I'm not sure whether this is possible (with filters set and so on), the way crossfilter/dc works. Therefor, an alternative and different equation ,that might fit crossfilter/dc better and would still be meaningful, could be:
B. absGain of group / open of first day of group * 100
B would also mean that: If only a filter is set on for example Q1, then only the absGain of Q1 is taken into account. The first day in this group is the the oldest Q1 date in the oldest year. Also, charts other than "yearly" with groups like quarter, month or day of the week should be able to display the value of this equation. For example in a month chart, the value of the month "June" is calculated by taking the open of the first day in the first June. The absGain is taken from all June months. (of course working with all current filters in place)
Question: Can A and/or B be solved the crossfilter/dc way and how (example)?
Even if only B could be solved (naturally with crossfilter/dc), that would already be great. I want to use the dc.js example for other stocks that have the same underlying data structure (open, close, high, low, volume)
thanks!
I agree that Equation B is easier to define using crossfilter, so I figured out one way to do it.
Equation A could probably work but it's unclear which day's close should be used under filtering - the last day which is not in the current bin? The day before the first day in the current bin?
Equation B needs the earliest row for the current bin, and that requires maintaining the array of all rows for each bin. This is not built into crossfilter but it's a feature which we have talked about adding.
The complex reduce example does this, and we can reuse some of its code. It calculates the median/mode/min/max value from the arrays of rows which fall in each bin, using these functions to generate those arrays:
function groupArrayAdd(keyfn) {
var bisect = d3.bisector(keyfn);
return function(elements, item) {
var pos = bisect.right(elements, keyfn(item));
elements.splice(pos, 0, item);
return elements;
};
}
function groupArrayRemove(keyfn) {
var bisect = d3.bisector(keyfn);
return function(elements, item) {
var pos = bisect.left(elements, keyfn(item));
if(keyfn(elements[pos])===keyfn(item))
elements.splice(pos, 1);
return elements;
};
}
It's somewhat inefficient to maintain all these arrays, so you might test if it has an impact on your application. JS is pretty fast so it probably doesn't matter unless you have a lot of data.
Unfortunately there is no other way to compute the minimum for a bin other than to keep an array of all the items in it. (If you tried to keep track of just the lowest item, or lowest N items, what would you do when they are removed?)
Using these arrays inside the group reduce-add function:
(p, v) => {
++p.count;
p.rowsByDate = rbdAdd(p.rowsByDate, v);
p.absGain += v.close - v.open;
// ...
p.percentageGain = p.rowsByDate.length ? (p.absGain / p.rowsByDate[0].open) * 100 : 0;
return p;
},
In the reduce-remove function it's
p.rowsByDate = rbdRemove(p.rowsByDate, v);
and the same percentageGain change.
Here is a demo in a notebook: https://jsfiddle.net/gordonwoodhull/08bzcd4y/17/
I only see slight changes in the Y positions of the bubbles; the changes are more apparent in the values printed in the tooltip.

Standard Algorithm for subdividing a grid into smaller and smaller parts

I'm running a simulation over a grid of parameters and I'd like to run it for as long as possible, but I don't know yet when the simulation will be terminated (think power cut). So what I'd like to do is specify the min and max values for each parameter and then let the loop pick the next best point on the grid, regularly saving the current result.
So given in 1d space a parameter a from 0 to 1 I'd like the loop to simulate for values 0, 1, 0.5, 0.75, 0.25, 0.875, 0.625, 0.375, 0.125, ... The exact order does not matter too much, as long as the next point always lies in between the previous ones.
So probably I could come up with some piece of code that generates this sequence, but I'm wondering if there are standard formulations for such an algorithm, especially for higher dimensional spaces?
One way to achieve this in one dimension is to maintain a binary tree, where each node keeps track of an interval, and its midpoint.
The left child of a node contains the left half of its interval, and the right child contains the right half.
Performing a breadth-first search in such a tree and keeping track of all the mid points of the traversed nodes, will yield the sequence you are after.
For several dimensions, depending on your needs, you can e.g. keep track of one such tree for each dimension, and generate your parameters in the order you like.
In practice this can be implemented using lazy initialisation and a queue to perform the BFS.
To demonstrate (but in practice, you would do it in a more memory-efficient way), I've added a simple binary tree DFS implementation in JavaScript (since it can be tried in the browser):
class Node {
constructor(min, max) {
this.min = min;
this.max = max;
this.mid = (min + max) / 2;
}
get left() { return new Node(this.min, this.mid); }
get right() { return new Node(this.mid, this.max); }
}
function getSequence(start, end, n) {
const res = [start, end];
const queue = [new Node(start, end)];
for (let i=0; i<n; ++i) {
const n = queue.shift();
res.push(n.mid);
queue.push(n.right, n.left);
}
return res;
}
getSequence(0, 1, 100);

Activity selection algorithm with 2 types of activities

I'm trying to solve this problem using dynamic programming and I can't figure how to define the sub-problems and the relation between them.
So it's basically the same as the regular activity selection problem only that you have 2 type of activities (let's call them yellow and gray) so each activity has a starting time finish time and a color.
yellow activities have an higher priority than gray ones so if a yellow activity overlaps with 2 gray activities the yellow one will get into the solution and the gray activities won't.
thanks.
Split them into 2 collections, try select 2 from Grays and 1 from Yellows like you do with regular selection problem, compare -> get new time stamp -> remove invalid activities (from both collection)
A = { }, Selection
loop {
G1 = Select(1 from Gray)
Y1 = Select(1 from Yellow)
if (Y1.finish < G1.finish) {
Selection = Y1;
}
else {
G2 = Select(2 from Gray)
if (Y1.finish < G2.finish)
Selection = Y1;
else
Selection = {G1, G2}
}
A = A U Selection;
LowerBound = Selection.Last.Finish;
RemoveFrom(Yellow : have starting time < LowerBound)
RemoveFrom(Gray : have starting time < LowerBound)
}
Note:
Gray and Yellow should be sorted like in the original problem
Select is the normal select operation you would do in the original
problem
-> So Select(2...) doesnt mean select the first 2 in the
collection, but the first 2 do-able activities
For easy implementation, consider using 2 queues, and some other technique to get the LowerBound, should be O(n) time and space (minus the sorting) if implemented correctly

Weighted Interval Scheduling: How to capture *all* maximal fits, not just a single maximal fit?

In the weighted interval scheduling problem, one has a sequence of intervals {i_1, i_2, ..., i_n} where each interval i_x represents a contiguous range (in my case, a range of non-negative integers; for example i_x = [5,9)). The usual goal is to set the weight of each interval equal to its width, and then determine the subset of non-overlapping intervals whose total weight is a maximum. An excellent solution is given at the link I just provided.
I have implemented the solution in C++, starting with the algorithm provided at the given link (which is written in Python in a GitHub repository here.
However, the current solution at the link given - and everywhere else I have seen it discussed - only provides a way to capture a single maximal fit. Of course, in some cases there can be multiple maximal fits, each with the same total (globally maximal) weight.
I have implemented a "brute force" approach to capturing all maximal fits, which I describe below.
However, before discussing the specific details of the brute force approach I've used, the key problem in my brute force approach that I'd like resolved is that my brute-force approach captures many false positives, in addition to the true maximal fits. It is not necessary to delve into the specifics of my brute-force approach if you can just answer the following question:
I'd like to know what is the (or a) most efficient enhancement to the basic O(n log(n)) solution that supports capturing all maximal fits, rather than just one maximal fit (but if anyone can answer how to avoid false positives, that will also satisfy me).
I am making no progress on this, and the brute force approach I'm using starts to explode unmanageably in cases where there are in excess of thousands (perhaps less) maximal fits.
Thank you!
Details of the brute force approach I am using, only if interested or useful:
There is a single line of code in the existing source code I've linked above that is responsible for the fact that the algorithm selects a single maximal fit, rather than proceeding down a path where it could capture all maximal fits. Click here to see that line of code. Here it is:
if I[j].weight + OPT[p[j]] > OPT[j - 1]:
Notice the > (greater than sign). This line of code successfully guarantees that any interval combination with a higher total weight than any other interval combination for the given sub-problem is kept. By changing > to >=, it is possible to capture scenarios where the current interval set under consideration has an equal total weight to the highest previous total weight, which would make it possible to capture all maximal fits. I wish to capture this scenario, so in my C++ migration I used the >= and, in the case where equality holds, I proceed down both paths in the fork via a recursive function call.
Below is the C++ code for the (critical) function that captures all optimum interval sets (and weights) for each sub-problem (noting that the final solution is obtained at the last index where the sub-problem corresponds to the entire problem).
Please note that OPTs is a list of all potential solutions (maximal interval sets) (i.e., each element of OPTs is itself a single complete solution of all sub-problems consisting of a set of intervals and a corresponding weight for every sub-problem), while OPT is used to describe a single such complete solution - a potential maximal fit with all intervals used to construct it, one for each sub-problem.
For the standard solution of the weighted interval scheduling problem that I've indicated above, the solution obtained is just OPT (a single one, not a list).
The RangeElement type in the code below is simply metadata unrelated to the problem I'm discussing.
RangesVec contains the set of intervals that is the input to the problem (properly sorted by ending value). PreviousIntervalVec corresponds to compute_previous_intervals discussed at the link above.
(Note: For anybody who is looking at the Python code linked above, please note that I think I have found a bug in it related to saving intervals in the maximal set; please see here for a comment about this bug, which I've fixed in my C++ code below.)
Here is my 'brute-force' implementation that captures all maximal fits. My brute force approach also captures some false positives that need to be removed at the end, and I would be satisfied with any answer that gives a most efficient approach to exclude false positives but otherwise uses an algorithm equivalent to the one below.
void CalculateOPTs(std::vector<std::pair<INDEX_TYPE, std::vector<RangeElement const *>>> & OPT, size_t const starting_index = 0)
{
++forks;
for (size_t index = starting_index; index < RangesVec.size(); ++index)
{
INDEX_TYPE max_weight_to_be_set_at_current_index {};
INDEX_TYPE max_weight_previous_index {};
INDEX_TYPE max_weight_previously_calculated_at_previous_interval {};
INDEX_TYPE current_index_weight = RangesVec[index]->range.second - RangesVec[index]->range.first;
if (index > 0)
{
max_weight_previous_index = OPT[index - 1].first;
}
size_t previous_interval_plus_one = PreviousIntervalVec[index];
if (previous_interval_plus_one > 0)
{
max_weight_previously_calculated_at_previous_interval = OPT[previous_interval_plus_one - 1].first;
}
INDEX_TYPE weight_accepting_current_index = current_index_weight + max_weight_previously_calculated_at_previous_interval;
INDEX_TYPE weight_rejecting_current_index = max_weight_previous_index;
max_weight_to_be_set_at_current_index = std::max(weight_accepting_current_index, weight_rejecting_current_index);
//if (false && weight_accepting_current_index == weight_rejecting_current_index)
if (weight_accepting_current_index == weight_rejecting_current_index)
{
// ***************************************************************************************** //
// Fork!
// ***************************************************************************************** //
// ***************************************************************************************** //
// This is one of the two paths of the fork, accessed by calling the current function recursively
// ***************************************************************************************** //
// There are two equal combinations of intervals with an equal weight.
// Follow the path that *rejects* the interval at the current index.
if (index == 0)
{
// The only way for the previous weight to equal the current weight, given that the current weight cannot be 0,
// is if previous weight is also not 0, which cannot be the case if index == 0
BOOST_THROW_EXCEPTION(std::exception((boost::format("Logic error: Forking a maximal fitting path at index == 0")).str().c_str()));
}
std::vector<std::pair<INDEX_TYPE, std::vector<RangeElement const *>>> newOPT = OPT;
OPTs.emplace_back(newOPT);
OPTs.back().push_back(std::make_pair(weight_rejecting_current_index, std::vector<RangeElement const *>())); // std::max returns first value if the two values are equal; so here create a fork using the second value
OPTs.back()[index].second = OPTs.back()[index-1].second; // The current index is being rejected, so the current set of intervals remains the same for this index as for the previous
CalculateOPTs(OPTs.back(), index + 1);
}
// ***************************************************************************************** //
// If we forked, this is the other path of the fork, which is followed after the first fork, above, exits.
// If we didn't fork, we proceed straight through here anyways.
// ***************************************************************************************** //
OPT.push_back(std::make_pair(max_weight_to_be_set_at_current_index, std::vector<RangeElement const *>()));
if (max_weight_to_be_set_at_current_index == weight_accepting_current_index)
{
// We are accepting the current interval as part of a maximal fitting, so track it.
//
// Note: this also works in the forking case that hit the previous "if" block,
// because this code represents the alternative fork.
//
// We here set the intervals associated with the current index
// equal to the intervals associated with PreviousIntervalVec[index] - 1,
// and then append the current interval.
//
// If there is no preceding interval, then leave the "previous interval"'s
// contribution empty (from the line just above where an empty vector was added),
// and just append the current interval (as the first).
if (previous_interval_plus_one > 0)
{
OPT.back().second = OPT[previous_interval_plus_one - 1].second;
}
OPT.back().second.push_back(RangesVec[index]); // We are accepting the current interval as part of the maximal set, so add the corresponding interval here
}
else
{
if (index == 0)
{
// If index is 0, we should always accept the current interval, not reject, so we shouldn't be here in that case
BOOST_THROW_EXCEPTION(std::exception((boost::format("Logic error: Rejecting current interval at index == 0")).str().c_str()));
}
// We are rejecting the current interval, so set the intervals associated with this index
// equal to the intervals associated with the previous index
OPT.back().second = OPT[index - 1].second;
}
}
}
When there is an equal weight optimal subsolution, you need to add the next interval to every subsolution, I don't see this happening. The general form would look like this
function go(lastend){
for(i=0;i<n;i++){
if(interval[i].start>lastend){
optimalsubs = go(interval[i].end)
if optimalsubs.cost + interval[i].cost > optimal.cost {
for(os in optimalsubs){
os.add(interval[i])
}
optimal = optimalsubs
optimal.cost = optimalsubs.cost + interval[i].cost
}
else if equal{
for(os in optimalsubs){
os.add(interval[i])
}
optimal.append(optimalsubs)
}
}
}
return optimal
}

Removing items from unevenly distributed set

I have a website where users submit questions (zero, one or multiple per day), vote on them and answer one question per day (more details here). A user can see the question only once either by submitting, voting or answering it.
I have a pool of questions that players have already seen. I need to remove 30 questions from the pool each month. I need to pick questions to remove in such way that I maximize the number of available questions left in the pool for player with least available questions.
Example with pool of 5 questions (and need to remove 3):
player A has seen questions 1, 3 and 5
player B has seen questions 1 and 4
player C has seen questions 2 and 4
I though about removing the questions that top player has seen, but the position would change. Following the above example, player A has only got 2 questions left to play (2 and 4). However, if I remove 1, 3 and 5, the situation would be:
player A can play questions 2 and 4
player B can play question 2
player C cannot play anything because 1,3,5 are removed and he has already seen 2 and 4.
The score for this solution is zero, i.e. the player with least amount of available questions has zero available questions to play.
In this case it would be better to remove 1, 3 and 4, giving:
player A can play question 2
player B can play questions 2 and 5
player C can play question 5
The score for this solution is one, because the two players with least amount of available questions to play have one available question.
If the data size was small, I would be able to brute-force the solution. However, I have hundreds of players and questions, so I'm looking for some algorithm to solve this.
Let's suppose that you have a general efficient algorithm for this. Concentrate on the questions left, rather than the questions removed.
You could use such an algorithm to solve the problem - can you choose at most T questions such that every user has at least one question to answer? I think that this is http://en.wikipedia.org/wiki/Set_cover, and I think solving your problem in general allows you to solve set cover, so I think it is NP-complete.
There is at least a linear programming relaxation. Associate each question with a variable Qi in the range 0<= Qi <= 1. Choosing questions Qi such that each user has at least X questions available amounts to the constraint SUM Uij Qj >= X, which is linear in Qj and X, so you can maximise for the objective function X with the linear variables X and Qj. Unfortunately, the result need not give you integer Qj - consider for example the case when all possible pairs of questions are associated with some user and you want each user to be able to answer at least 1 question, using at most half of the questions. The optimum solution is Qi = 1/2 for all i.
(But given a linear programming relaxation you could use it as the bound in http://en.wikipedia.org/wiki/Branch_and_bound).
Alternatively you could just write down the problem and throw it at an integer linear programming package, if you have one handy.
For completeness of the thread, here is a simple greedy, aproximating approach.
Place the solved questions in the previously discussed matrix form:
Q0 X
Q1 XX
Q2 X
Q3 X
Q4 XX
223
Sort by the number of questions solved:
Q0 X
Q1 XX
Q2 X
Q3 X
Q4 XX
322
Strike out a question with the most Xs among the players with most problems solved. (This is guaranteed to decrease our measure if anything is):
=======
Q1 XX
Q2 X
Q3 X
Q4 XX
222
Sort again:
=======
Q1 XX
Q2 X
Q3 X
Q4 XX
222
Strike again:
=======
=======
Q2 X
Q3 X
Q4 XX
211
Sort again:
=======
=======
Q2 X
Q3 X
Q4 XX
211
Strike again:
=======
=======
Q2 X
Q3 X
=======
101
It's O(n^2logn) without optimizations, so it is plenty fast for some hundreds of questions. It's also easy to implement.
It's not optimal as can be seen from this counter example with 2 strikes:
Q0 X
Q1 X
Q2 XXX
Q3 XXX
Q4 XXXX
Q5 222222
Here the greedy approach is going to remove Q5 and Q2 (or Q3) instead of Q2 and Q3 which would be optimal for our measure.
I propose a bunch of optimizations based on the idea that you really want to maximize the number of unseen questions for the player with the minimum number of questions, and do not care if there is 1 player with the minimum number of questions or 10000 players with that same number of questions.
Step 1: Find the player with the minimum number of questions unseen (In your example, that would be player A) Call this player p.
Step 2: Find all players with within 30 of the number of questions unseen by player p. Call this set P. P are the only players who need to be considered, as removing 30 unseen questions from any other player would still leave them with more unseen questions than player p, and thus player p would still be worse off.
Step 3: Find the intersection of all sets of problems seen by players in P, you may remove all problems within this set, hopefully dropping you down from 30 to some smaller number of problems to remove, that we will call r. r <= 30
Step 4: Find the union of all sets of problems seen by players in P, Call this set U. If the size of U is <= r, you are done, remove all problems in U, and then remove remaining problems arbitrarily from your set of all problems, player p will lose r - size of U and remain with the fewest unseen problems, but this is the best you can do.
You are now left with your original problem, but likely with vastly smaller sets.
Your problem set is U, your player set is P, and you must remove r problems.
The brute force approach takes time (size(U) choose r) * size (P). If those numbers are reasonable, you can just brute force it. This approach is to choose each set of r problems from U and evaluate it against all players in P.
Since your problem does appear to be NP-Complete, the best you can probably hope for is an approximation. The easiest way to do this is to set some max number of tries, then randomly choose and evaluate sets of problems to remove. As such, a function to perform U choose r randomly becomes necessary. This can be done in time O(r), (In fact, I answered how to do this earlier today!)
Select N random elements from a List<T> in C#
You can also put any of the heuristics suggested by other users into your choices by weighting each problem's chance to be selected, I believe the link above shows how to do that in the selected answer.
Linear programming models.
Variant 1.
Sum(Uij * Qj) - Sum(Dij * Xj) + 0 = 0 (for each i)
0 + Sum(Dij * Xj) - Score >= 0 (for each i)
Sum(Qj) = (Number of questions - 30)
Maximize(Score)
Uij is 1 if user i has not seen question j, otherwise it is 0
Dij is element of identity matrix (Dij=1 if i=j, otherwise it is 0)
Xj is auxiliary variable (one for each user)
Variant 2.
Sum(Uij * Qj) >= Score (for each i)
Sum(Qj) = (Number of questions - 30)
No objective function, just check feasibility
In this case, LP problem is simpler, but Score should be determined by binary and linear search. Set current range to [0 .. the least number of unseen questions for a user], set Score to the middle of the range, apply integer LP algorithm (with small time limit). If no solution found, set range to [begin .. Score], otherwise set it to [Score .. end] and continue binary search.
(Optionally) use binary search to determine upper bound for exact solution's Score.
Starting from the best Score, found by binary search, apply integer LP algorithm with Score, increased by 1, 2, ...
(and limiting computation time as necessary). At the end, you get either exact solution, or some good approximation.
Here is sample code in C for GNU GLPK (for variant 1):
#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>
int main(void)
{
int ind[3000];
double val[3000];
int row;
int col;
glp_prob *lp;
// Parameters
int users = 120;
int questions = 10000;
int questions2 = questions - 30;
int time = 30; // sec.
// Create GLPK problem
lp = glp_create_prob();
glp_set_prob_name(lp, "questions");
glp_set_obj_dir(lp, GLP_MAX);
// Configure rows
glp_add_rows(lp, users*2 + 1);
for (row = 1; row <= users; ++row)
{
glp_set_row_bnds(lp, row, GLP_FX, 0.0, 0.0);
glp_set_row_bnds(lp, row + users, GLP_LO, 0.0, 0.0);
}
glp_set_row_bnds(lp, users*2 + 1, GLP_FX, questions2, questions2);
// Configure columns
glp_add_cols(lp, questions + users + 1);
for (col = 1; col <= questions; ++col)
{
glp_set_obj_coef(lp, col, 0.0);
glp_set_col_kind(lp, col, GLP_BV);
}
for (col = 1; col <= users; ++col)
{
glp_set_obj_coef(lp, questions + col, 0.0);
glp_set_col_kind(lp, questions + col, GLP_IV);
glp_set_col_bnds(lp, questions + col, GLP_FR, 0.0, 0.0);
}
glp_set_obj_coef(lp, questions+users+1, 1.0);
glp_set_col_kind(lp, questions+users+1, GLP_IV);
glp_set_col_bnds(lp, questions+users+1, GLP_FR, 0.0, 0.0);
// Configure matrix (question columns)
for(col = 1; col <= questions; ++col)
{
for (row = 1; row <= users*2; ++row)
{
ind[row] = row;
val[row] = ((row <= users) && (rand() % 2))? 1.0: 0.0;
}
ind[users*2 + 1] = users*2 + 1;
val[users*2 + 1] = 1.0;
glp_set_mat_col(lp, col, users*2 + 1, ind, val);
}
// Configure matrix (user columns)
for(col = 1; col <= users; ++col)
{
for (row = 1; row <= users*2; ++row)
{
ind[row] = row;
val[row] = (row == col)? -1.0: ((row == col + users)? 1.0: 0.0);
}
ind[users*2 + 1] = users*2 + 1;
val[users*2 + 1] = 0.0;
glp_set_mat_col(lp, questions + col, users*2 + 1, ind, val);
}
// Configure matrix (score column)
for (row = 1; row <= users*2; ++row)
{
ind[row] = row;
val[row] = (row > users)? -1.0: 0.0;
}
ind[users*2 + 1] = users*2 + 1;
val[users*2 + 1] = 0.0;
glp_set_mat_col(lp, questions + users + 1, users*2 + 1, ind, val);
// Solve integer GLPK problem
glp_iocp param;
glp_init_iocp(&param);
param.presolve = GLP_ON;
param.tm_lim = time * 1000;
glp_intopt(lp, &param);
printf("Score = %g\n", glp_mip_obj_val(lp));
glp_delete_prob(lp);
return 0;
}
Time limit is not working reliably in my tests. Looks like some bug in GLPK...
Sample code for variant 2 (only LP algorithm, no automatic search for Score):
#include <stdio.h>
#include <stdlib.h>
#include <glpk.h>
int main(void)
{
int ind[3000];
double val[3000];
int row;
int col;
glp_prob *lp;
// Parameters
int users = 120;
int questions = 10000;
int questions2 = questions - 30;
double score = 4869.0 + 7;
// Create GLPK problem
lp = glp_create_prob();
glp_set_prob_name(lp, "questions");
glp_set_obj_dir(lp, GLP_MAX);
// Configure rows
glp_add_rows(lp, users + 1);
for (row = 1; row <= users; ++row)
{
glp_set_row_bnds(lp, row, GLP_LO, score, score);
}
glp_set_row_bnds(lp, users + 1, GLP_FX, questions2, questions2);
// Configure columns
glp_add_cols(lp, questions);
for (col = 1; col <= questions; ++col)
{
glp_set_obj_coef(lp, col, 0.0);
glp_set_col_kind(lp, col, GLP_BV);
}
// Configure matrix (question columns)
for(col = 1; col <= questions; ++col)
{
for (row = 1; row <= users; ++row)
{
ind[row] = row;
val[row] = (rand() % 2)? 1.0: 0.0;
}
ind[users + 1] = users + 1;
val[users + 1] = 1.0;
glp_set_mat_col(lp, col, users + 1, ind, val);
}
// Solve integer GLPK problem
glp_iocp param;
glp_init_iocp(&param);
param.presolve = GLP_ON;
glp_intopt(lp, &param);
glp_delete_prob(lp);
return 0;
}
It appears that variant 2 allows to find pretty good approximation quite fast.
And approximation is better than for variant 1.
Let's say you want to delete Y questions from the pool. The simple algorithm would be to sort questions by the amount of views they had. Then you remove Y of the top viewed questions. For your example: 1: 2, 2: 1, 3: 1, 4: 2, 5: 1. Clearly, you better off removing questions 1 and 4. But this algorithm doesn't achieve the goal. However, it is a good starting point. To improve it, you need to make sure that every user will end up with at least X questions after the "cleaning".
In addition to the above array (which we can call "score"), you need a second one with questions and users, where crossing will have 1 if user have seen the question, and 0 if he didn't. Then, for every user you need to find X questions with lowest score edit: that he hasn't seen yet (the less their score the better, since the less people saw the question, the more "valuable" it is for the system overall). You combine all the found X questions from every user into third array, let's call it "safe", since we won't delete any from it.
As the last step you just delete Y top viewed questions (the ones with the highest score), which aren't in the "safe" array.
What that algorithm achieves also is that if deleting say 30 questions will make some users have less than X questions to view, it won't remove all 30. Which is, I guess, good for the system.
Edit: Good optimization for this would be to track not every user, but have some activity benchmark to filter people that saw only a few questions. Because if there are too many people that saw only say 1 rare different question, then nothing can be deleted. Filtering theese kind of users or improving the safe array functionality can solve it.
Feel free to ask questions if I didn't describe the idea deep enough.
Have you considered viewing this in terms of a dynamic programming solution?
I think you might be able to do it by maximizing on the number of available questions left open
to all players such that no single player is left with zero open questions.
The following link provides a good overview of how to construct dynamic programming
solutions to these sort of problems.
Presenting this in terms of questions still playable. I'll number the questions from 0 to 4 instead of 1 to 5, as this is more convenient in programming.
01234
-----
player A x x - player A has just 2 playable questions
player B xx x - player B has 3 playable questions
player C x x x - player C has 3 playable questions
I'll first describe what might appear to be a very naive algorithm, but at the end I'll show how it can be improved significantly.
For each of the 5 questions, you'll need to decide whether to keep it or discard it. This will require a recursive functions that will have a depth of 5.
vector<bool> keep_or_discard(5); // an array to store the five decisions
void decide_one_question(int question_id) {
// first, pretend we keep the question
keep_or_discard[question_id] = true;
decide_one_question(question_id + 1); // recursively consider the next question
// then, pretend we discard this question
keep_or_discard[question_id] = false;
decide_one_question(question_id + 1); // recursively consider the next question
}
decide_one_question(0); // this call starts the whole recursive search
This first attempt will fall into an infinite recursive descent and run past the end of the array. The obvious first thing we need to do is to return immediately when question_id == 5 (i.e. when all questions 0 to 4 have been decided. We add this code to the beginning of decide_one_question:
void decide_one_question(int question_id) {
{
if(question_id == 5) {
// no more decisions needed.
return;
}
}
// ....
Next, we know how many questions we are allowed to keep. Call this allowed_to_keep. This is 5-3 in this case, meaning we are to keep exactly two questions. You might set this as a global variable somewhere.
int allowed_to_keep; // set this to 2
Now, we must add further checks to the beginning of decide_one_question, and add another parameter:
void decide_one_question(int question_id, int questions_kept_so_far) {
{
if(question_id == 5) {
// no more decisions needed.
return;
}
if(questions_kept_so_far > allowed_to_keep) {
// not allowed to keep this many, just return immediately
return;
}
int questions_left_to_consider = 5 - question_id; // how many not yet considered
if(questions_kept_so_far + questions_left_to_consider < allowed_to_keep) {
// even if we keep all the rest, we'll fall short
// may as well return. (This is an optional extra)
return;
}
}
keep_or_discard[question_id] = true;
decide_one_question(question_id + 1, questions_kept_so_far + 1);
keep_or_discard[question_id] = false;
decide_one_question(question_id + 1, questions_kept_so_far );
}
decide_one_question(0,0);
( Notice the general pattern here: we allow the recursive function call to go one level 'too deep'. I find it easier to check for 'invalid' states at the start of the function, than to attempt to avoid making invalid function calls in the first place. )
So far, this looks quite naive. This is checking every single combination. Bear with me!
We need to start keeping track of the score, in order to remember the best (and in preparation for a later optimization). The first thing would be to write a function calculate_score. And to have a global called best_score_so_far. Our goal is to maximize it, so this should be initialized to -1 at the start of the algorithm.
int best_score_so_far; // initialize to -1 at the start
void decide_one_question(int question_id, int questions_kept_so_far) {
{
if(question_id == 5) {
int score = calculate_score();
if(score > best_score_so_far) {
// Great!
best_score_so_far = score;
store_this_good_set_of_answers();
}
return;
}
// ...
Next, it would be better to keep track of how the score is changing as we recurse through the levels. Let's start of by being optimistic; let's pretend we can keep every question and calculate the score and call it upper_bound_on_the_score. A copy of this will be passed into the function every time it calls itself recursively, and it will be updated locally every time a decision is made to discard a question.
void decide_one_question(int question_id
, int questions_kept_so_far
, int upper_bound_on_the_score) {
... the checks we've already detailed above
keep_or_discard[question_id] = true;
decide_one_question(question_id + 1
, questions_kept_so_far + 1
, upper_bound_on_the_score
);
keep_or_discard[question_id] = false;
decide_one_question(question_id + 1
, questions_kept_so_far
, calculate_the_new_upper_bound()
);
See near the end of that last code snippet, that a new (smaller) upper bound has been calculated, based on the decision to discard question 'question_id'.
At each level in the recursion, this upper bound be getting smaller. Each recursive call either keeps the question (making no change to this optimistic bound), or else it decides to discard one question (leading to a smaller bound in this part of the recursive search).
The optimization
Now that we know an upper bound, we can have the following check at the very start of the function, regardless of how many questions have been decided at this point:
void decide_one_question(int question_id
, int questions_kept_so_far
, upper_bound_on_the_score) {
if(upper_bound_on_the_score < best_score_so_far) {
// the upper bound is already too low,
// therefore, this is a dead end.
return;
}
if(question_id == 5) // .. continue with the rest of the function.
This check ensures that once a 'reasonable' solution has been found, that the algorithm will quickly abandon all the 'dead end' searches. It will then (hopefully) quickly find better and better solutions, and it can then be even more aggresive in pruning dead branches. I have found that this approach works quite nicely for me in practice.
If it doesn't work, there are many avenues for further optimization. I won't try to list them all, and you could certainly try entirely different approaches. But I have found this to work on the rare occasions when I have to do some sort of search like this.
Here's an integer program. Let constant unseen(i, j) be 1 if player i has not seen question j and 0 otherwise. Let variable kept(j) be 1 if question j is to be kept and 0 otherwise. Let variable score be the objective.
maximize score # score is your objective
subject to
for all i, score <= sum_j (unseen(i, j) * kept(j)) # score is at most
# the number of questions
# available to player i
sum_j (1 - kept(j)) = 30 # remove exactly
# 30 questions
for all j, kept(j) in {0, 1} # each question is kept
# or not kept (binary)
(score has no preset bound; the optimal solution chooses score
to be the minimum over all players of the number of questions
available to that player)
If there are too many options to brute force and there are likely many solutions that are near-optimal (sounds to be the case), consider monte-carlo methods.
You have a clearly defined fitness function, so just make some random assignments score the result. Rinse and repeat until you run out of time or some other criteria is met.
the question first seems easy, but after thinking deeper you realize the hardness.
the simplest option would be removing the questions that have been seen by maximum number of users. but this does not take the number of remaining questions for each user into consideration. some too few questions may be left for some users after removing.
a more complex solution would be computing the number of remaining questions for each user after deleting a question. You need to compute it for every question and every user. This task may be time consuming if you have many users and questions. Then you can sum up the number of questions left for all users. And select the question with the highest sum.
I think it would be wise to limit the number of remaining questions for a user to a reasonable value. You can think "OK, this user has enough questions to view if he has more than X questions". You need this because after deleting a question, only 15 questions may be left for an active user while 500 questions may be left for a rare-visiting user. It's not fair to sum 15 and 500. You can, instead, define a threshold value of 100.
To make it easier to compute, you can consider only the users who have viewed more than X questions.

Resources