How to optimize events scheduling with unknown future events? - algorithm

Scenario:
I need to give users opportunity to book different times for the service.
Caveat is that i dont have bookings in advance but i need to fill them as they come in.
Bookings can be represented as keyvalue pairs:
[startTime, duration]
So, for example, [9,3] would mean event starts at 9 o’clock and has duration of 3 hours.
Rules:
users come in one by one, there is never a batch of users requests
no bookings can overlap
service is available 24/7 so no need to worry about “working time”
users choose duration on their own
obviously, once user chooses&confirms his booking we cannot shuffle it anymore
we dont want gaps to be lesser than some amount of time. this one is based on probability that future users will fill in the gap. for example, if distribution of durations over users bookings is such that probability for future users filling the gap shorter than x hours is less than p then we want a rule that gap cannot be shorter than x. (for purpose of this question, we can assume x being hardcoded, here i just explain reasons)
the goal is to have service-busy-duration maximized
My thinking so far...
I keep the list of bookings made so far
I also keep track of gaps (as they are potential slots for new users booking)
When new user comes with his booking [startTime, duration] i first check for ideal case where gapLength = duration. if there is no such gaps, i find all slots (gaps) that satisfy condition gapLength - duration > minimumGapDuration and order them in descending order by that gapLength - duration value
I assign user to the first gap with maximum value of gapLength - duration since that gives me highest probability that gap remaining after this booking will also get filled in future
Questions:
Are there some problems with my approach that i am missing?
Are there some algorithms solving this particular problem?
Is there some usual approach (good starting point) which i could start with and optimize later? (i am actually trying to get enough infos to start but not making some critical mistake; optimizations can/should come later)
PS.
From research so far it sounds this might be the case for constraint programming. I would like to avoid it if possible as i have no clue about it (maybe its simple, i just dont know) but if it makes a real difference, i will go for its benefits and implement it.
I went through stackoverflow for similar problems but didnt find one with unknown future events. If there is such and this is direct duplicate, please refer to it.

Related

Maximise subset selection under constraints

I am working on a scheduling problem for a team of volunteers. I have boiled my problem down to the following algorithmic problem:
I have a matrix with ~60 rows representing volunteers and ~14 columns representing days. Each entry is an integer in the range 0 to 3 inclusive representing how free the volunteer is on that day. I want to choose exactly 4 entries from each column (4 volunteers a day) such that (in order of importance)
A 0-entry is never chosen.
The workload is as spread out as possible (first give everyone one shift, then start giving out second shifts, etc. We can expect that most volunteers will only have one shift per 2-week period, and some may even have none.)
The sum over selected entries is maximised (volunteers get days that they prefer).
I want to output a decision matrix that has a 1 whenever a volunteer is chosen for a day, and 0 otherwise. I believe this is an instance of the nurse-rostering problem, so I'm not expecting a fast solution, but I just want to make a brute force algorithm that will work in a reasonable time for my ~60 person team. I'm just really not sure how to start tackling this problem. Is it suited for backtracking, or is there some way to calculate the best placement of each volunteer based on the distribution of his/her day-scores?

What class of algorithms can be used to solve this?

EDIT: Just to make sure someone is not breaking their head on the problem... I am not looking for the best optimal algorithm. Some heuristic that makes sense is fine.
I made a previous attempt at formulating this and realized I did not do a great job at it so I removed that question. I have taken another shot at formulating my problem. Please feel free to provide any constructive criticism that can help me improve this.
Input:
N people
k announcements that I can make
Distance that my voice can be heard (say 5 meters) i.e. I may decide to announce or not depending on the number of people within these 5 meters
Goal:
Maximize the total number of people who have heard my k announcements and (optionally) minimize the time in which I can finish announcing all k announcements
Constraints:
Once a person hears my announcement, he is be removed from the total i.e. if he had heard my first announcement, I do not count him even if he hears my second announcement
I can see the same person as well as the same set of people within my proximity
Example:
Let us consider 10 people numbered from 1 to 10 and the following pattern of arrival:
Time slot 1: 1 (payoff = 1)
Time slot 2: 2 3 4 5 (payoff = 4)
Time slot 3: 5 6 7 8 (payoff = 4 if no announcement was made previously in time slot 2, 3 if an announcement was made in time slot 2)
Time slot 4: 9 10 (payoff = 2)
and I am given 2 announcements to make. Now if I were an oracle, I would choose time slots 2 and time slots 3 because then 7 people would have heard (because 5 already heard my announcement in Time slot 2, I do not consider him anymore). I am looking for an online algorithm that will help me make these decisions on whether or not to make an announcement and if so based on what factors. Does anyone have any ideas on what algorithms can be used to solve this or a simpler version of this problem?
There should be an approach relying upon a max-flow algorithm. In essence, you're trying to push the maximum amount of messages from start->end. Though it would be multidimensional, you could have a super-sink, which connects to each value of t, then have each value of t connect to the people you can reach at this time and then have a super-sink. This way, you simply have to compute a max-flow (with the added constraint of no more than k shouts, which should be solvable with a bit of dynamic programming). It's a terrifically dirty way to solve it, but it should get the job done deterministically and without the use of heuristics.
I don't know that there is really a way to solve this or an algorithm to do it the way you have formulated it.
It seems like basically you are trying to reach the maximum number of people with exactly 2 announcements. But without knowing any information about the groups of people in advance, you can't really make any kind of intelligent decision about whether or not to use your first announcement. Your second one at least has the benefit of knowing when not to be used (i.e. if the group has no new members then you can know its not worth wasting the announcement). But it still has basically the same problem.
The only real way to solve this is to use knowledge about the type of data or the desired outcome to make guesses. If you know that groups average 100 people with a standard deviation of 10, then you could just refuse to announce if less than 90 people are present. Or, if you know you need to reach at least 100 people with two announcements, you could choose never to announce to less than 50 at once. Obviously those approaches risk never announcing at all if the actual data does not meet what you would expect. But that's always going to be a risk, since you could get 1 person in the first group and then 0 in all of the rest, no matter what you do.
Or, you could try more clearly defining the problem, I have a hard time figuring out how to relate this to computers.
Lets start my trying to solve the simplest possible variant of the problem: Lets assume N people and K timeslots, but only one possible announcement. Lets also assume that each person will only ever stay for one timeslot and that each person who hasn't yet shown up has an equally probable chance of showing up at any future timeslot.
Given these simplifications, at each timeslot you look at the payoff of announcing at the current timeslot and compare to the chance of a future timeslot having a higher payoff, eg, lets assume 4 people 3 timeslots:
Timeslot 1: Person 1 shows up, so you know you could get a payoff of 1 by announcing, but then you have 3 people to show up in 2 remaining timeslots, so at least one of those timeslots is guaranteed to have 2 people, so don't announce..
So at each timeslot, you can calculate the chance that a later timeslot will have a higher payoff than the current by treating the remaining (N) people and (K) timeslots as being N independent random numbers each from 1..k, and calculate the chance of at least one value k being hit more than or equal to the current-payoff times. (Similar to the Birthday problem, but for more than 1 collision) and then you need to decide hwo much to discount based on expected variances. (bird in the hand, etc)
Generalization of this solution to the original problem is left as an exercise for the reader.

Black-box counting to 19 with only 2 bits, and only toggleable?

Some student asked this on another site, but got no answers. I had a few stabs at it, but found it pretty tricky.
Doing it with just the switches would require a 9:1 compression ratio, so I guess the trick is very much in the rules you assign to students. Perhaps every student needs a different set of rules?
I've thought about allowing many iterations where no answer comes up, by only paying attention to students in the correct sequence. I've also thought about encoding the student number as binary, and combining that with the bits from the switches, to get more bits to work with, but that's still a compression/validation problem: even if one of those bits was used for parity, you'd still have a big potential for false positives.
Presumably the problem wouldn't have been asked if there wasn't some way to do it. Maybe this is a common problem in comp-sci courses and well-known? Anyway, without further ado...
"Here is a problem I have for a computer class. It's seems kind of mathematical to me and could possibly involve the binary code. I'm not sure, all my ideas lead to dead ends.
Nineteen students are given the opportunity to win a prize by playing a game. After some time to decide on a strategy, all the students will be placed into separate soundproof isolation chambers with absolutely no way to communicate.
The game is played as follows. There are two light switches in a room that will begin in the "off" position. I will bring students into this room one at a time. Each time a student enters the room then he or she must flip one of the switches. All the students will eventually be brought into the room, but some students may be brought in more than one time.
If one person correctly tells me that everyone has been in the room, then everyone wins the prize. However, if someone incorrectly tells me that everyone has been in the room then everyone will be fed to the alligators! Note that either all the students win the prize or else everyone loses.
Your task is to determine a strategy that will be sure to allow everyone to win the prize (and not be eaten by alligators)."
This sounds like a variation of the Prisoners and the Light Switch riddle, where one prisoner is designated as a "counter" and everyone else "increments their count" only once.
Presumably the counter would turn on one switch, and if you had never been counted, you would turn off that switch; the other switch would be "garbage". Once the counter turned off the switch 18 times, he knows that all other students have been to the room.
The way the problem is worded, the organizer/teacher can ensure he never has to give out the prize: Allow each student into the room in turn - which only allows the Counter to count one other student. Then only iterate a subset of the students - say 3 of them.
Then either the Counter can count two other students then get stuck, or the Counter never goes back into the room.
This satisfies the specified conditions: Everyone goes into the room at least once, and some students go in multiple times.
To allow the students to win, you need to add the condition that there is a finite limit between any single student's visits to the switch room.

How to notice unusual news activity

Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.

Algorithm for most recently/often contacts for auto-complete?

We have an auto-complete list that's populated when an you send an email to someone, which is all well and good until the list gets really big you need to type more and more of an address to get to the one you want, which goes against the purpose of auto-complete
I was thinking that some logic should be added so that the auto-complete results should be sorted by some function of most recently contacted or most often contacted rather than just alphabetical order.
What I want to know is if there's any known good algorithms for this kind of search, or if anyone has any suggestions.
I was thinking just a point system thing, with something like same day is 5 points, last three days is 4 points, last week is 3 points, last month is 2 points and last 6 months is 1 point. Then for most often, 25+ is 5 points, 15+ is 4, 10+ is 3, 5+ is 2, 2+ is 1. No real logic other than those numbers "feel" about right.
Other than just arbitrarily picked numbers does anyone have any input? Other numbers also welcome if you can give a reason why you think they're better than mine
Edit: This would be primarily in a business environment where recentness (yay for making up words) is often just as important as frequency. Also, past a certain point there really isn't much difference between say someone you talked to 80 times vs say 30 times.
Take a look at Self organizing lists.
A quick and dirty look:
Move to Front Heuristic:
A linked list, Such that whenever a node is selected, it is moved to the front of the list.
Frequency Heuristic:
A linked list, such that whenever a node is selected, its frequency count is incremented, and then the node is bubbled towards the front of the list, so that the most frequently accessed is at the head of the list.
It looks like the move to front implementation would best suit your needs.
EDIT: When an address is selected, add one to its frequency, and move to the front of the group of nodes with the same weight (or (weight div x) for courser groupings). I see aging as a real problem with your proposed implementation, in that it requires calculating a weight on each and every item. A self organizing list is a good way to go, but the algorithm needs a bit of tweaking to do what you want.
Further Edit:
Aging refers to the fact that weights decrease over time, which means you need to know each and every time an address was used. Which means, that you have to have the entire email history available to you when you construct your list.
The issue is that we want to perform calculations (other than search) on a node only when it is actually accessed -- This gives us our statistical good performance.
This kind of thing seems similar to what is done by firefox when hinting what is the site you are typing for.
Unfortunately I don't know exactly how firefox does it, point system seems good as well, maybe you'll need to balance your points :)
I'd go for something similar to:
NoM = Number of Mail
(NoM sent to X today) + 1/2 * (NoM sent to X during the last week)/7 + 1/3 * (NoM sent to X during the last month)/30
Contacts you did not write during the last month (it could be changed) will have 0 points. You could start sorting them for NoM sent in total (since it is on the contact list :). These will be showed after contacts with points > 0
It's just an idea, anyway it is to give different importance to the most and just mailed contacts.
If you want to get crazy, mark the most 'active' emails in one of several ways:
Last access
Frequency of use
Contacts with pending sales
Direct bosses
Etc
Then, present the active emails at the top of the list. Pay attention to which "group" your user uses most. Switch to that sorting strategy exclusively after enough data is collected.
It's a lot of work but kind of fun...
Maybe count the number of emails sent to each address. Then:
ORDER BY EmailCount DESC, LastName, FirstName
That way, your most-often-used addresses come first, even if they haven't been used in a few days.
I like the idea of a point-based system, with points for recent use, frequency of use, and potentially other factors (prefer contacts in the local domain?).
I've worked on a few systems like this, and neither "most recently used" nor "most commonly used" work very well. The "most recent" can be a real pain if you accidentally mis-type something once. Alternatively, "most used" doesn't evolve much over time, if you had a lot of contact with somebody last year, but now your job has changed, for example.
Once you have the set of measurements you want to use, you could create an interactive apoplication to test out different weights, and see which ones give you the best results for some sample data.
This paper describes a single-parameter family of cache eviction policies that includes least recently used and least frequently used policies as special cases.
The parameter, lambda, ranges from 0 to 1. When lambda is 0 it performs exactly like an LFU cache, when lambda is 1 it performs exactly like an LRU cache. In between 0 and 1 it combines both recency and frequency information in a natural way.
In spite of an answer having been chosen, I want to submit my approach for consideration, and feedback.
I would account for frequency by incrementing a counter each use, but by some larger-than-one value, like 10 (To add precision to the second point).
I would account for recency by multiplying all counters at regular intervals (say, 24 hours) by some diminisher (say, 0.9).
Each use:
UPDATE `addresslist` SET `favor` = `favor` + 10 WHERE `address` = 'foo#bar.com'
Each interval:
UPDATE `addresslist` SET `favor` = FLOOR(`favor` * 0.9)
In this way I collapse both frequency and recency to one field, avoid the need for keeping a detailed history to derive {last day, last week, last month} and keep the math (mostly) integer.
The increment and diminisher would have to be adjusted to preference, of course.

Resources