A greedy approach - algorithm

A greedy approach - algorithm

I am trying to solve this question 8.2 from the book Grokking Algorithms, but I don't agree with the solution the author gave. The question from the book is:
You’re going to Europe, and you have seven days to see everything
you can. You assign a point value to each item (how much you want to see it) and estimate how long it takes. How can you maximize the
point total (seeing all the things you really want to see) during your
stay? Come up with a greedy strategy. Will that give you the optimal
solution?
The answer is also provided:
Keep picking the activity with the highest point value that you can still do in the time you have left. Stop when you can't do anything else. No, this wont give you the optimal solution.
Is it better to see the places that take minimum time or max time first? I am not convinced by the author's answer to this question. I can't see how it is best to visit the places which take longer in your schedule...

It seems you missed an aspect in the description of the challenge:
There are two metrics at play, not just one:
The time needed to visit a place
The point value a place has
These are independent metrics. The question is to maximize the sum of the second metric, while keeping the sum of the first metric within a given limit (7 days)
The answer suggest to select a next place which has the largest point value among those places whose time still fits within the available time.
This is not saying you should select the place that takes the most time, but the most points (after filtering on available time).
As the book explains, such greedy algorithms do not present the optimal solution, but can in practice come close without having to spend an unacceptable running time on them.

Related

Algorithm for assigning people based on multiple criteria

I have a list of users which need to be sorted into committees. The users can rank committees based on their particular preference, but must choose at least one to join. When they have all made their selections, the algorithm should sort them as evenly as possible taking into account their committee preference, gender, age, time zone and country (for now). I have looked at this question and its answer would seem like a good choice, but it is unclear to me how to add the various constraints to the algorithm for it to work.
Would anyone point me in the right direction on how to do this, please?

Looking for "clustering" will get you nowhere, because this is not a clustering type if task.
Instead, this is an assignment problem.
For further informarion, see:
Knapsack Problem
Generalized Assignment Problem
Usually, these are NP-hard to solve. Thus, one will usually choose a greedy optimization heuristic to find a reasonably good solution faster.
Think about how to best assign one person at a time.
Then, process the data as follows:
assign everybody that can only be assigned in a single way
find an unassigned person that is hard to assign, stop if everybody is assigned
assign the best possible way
remove preferences that are no longer admissible, and go to 1 again (there may be new person with only a single choice left)
For bonus points, add a source of randomness, and an overall quality measure. Then run the algorothm 10 times, and keep only the best result.
For further bonus, add an postprocessing optimization: when can you transfer one person to another group or swap to persons to improve the overall quality? Iterate over all persons to find such small improvements until you cannot find any.

Where do mathematical algorithms for Reddit's ranking, as an example, come from?

recently I was looking at Reddit's algorithm for determining what makes a post a "hot" topic and which content is suitable for the reddit homepage.
the article I was reading is here:
http://amix.dk/blog/post/19588
I've noticed they have mathematical logorithms and have created some kind of a mathematical function to determine the hotness/relevance of a post.
In the formulas used, where do each of the mathematical components come from and how do they know to use them?
thank you!
-- Bakz
EDIT: just to clarify, I just graduated high school and apologize if the answer to this question seems pretty obvious. thanks again!

I'll tackle the first formula, for "hotness" of posts. Formulas like this come from requirements. The designers of Reddit have thought about what they want to achieve, and designed formulas accordingly. I can't tell you exactly what requirements they had in mind, but I can look at the implementation and guess that they wanted a system along these lines:
Scores shouldn't need to be recomputed unless the number of votes change. This reduces the number of changes to the database, and makes it easier to achieve consistency if data is replicated. (So any scoring system based on scores getting lower as the article ages will be no good).
If two stories are equally old, the one with more upvotes should be higher. (So there needs to be a contribution from the votes.)
The more upvotes a story gets, the longer it should remain near the top of the ranking.
Old stories shouldn't stay at the top of the rankings for ever, even if they had lots of upvotes. Fairly soon (after a day or two), new stories need to outrank them. (So there needs to be a contribution from the date, and this must outweigh the score due to votes fairly soon, no matter how many votes something gets.)
Stories with more downvotes than upvotes should not appear in the rankings at all.
Now let's look at the formula: log z + yt / 45000 and see how it satisfies these requirements.
If the number of votes does not change, then z, y and t are all unchanged. So the score is unchanged. This satisfies requirement (1).
If two stories have the same age, then they have the same value for t. But the one with more upvotes has a higher value of z, and since log is monotonic, it has a higher score. This satisfies requirement (2).
The more upvotes a story has, the higher its z, so the longer it will be until another story with higher t can outrank it. This satisfies requirement (3).
Logarithm is a function that grows more slowly as it gets larger (take a look at its graph). So a story needs more and more upvotes over time to keep up with newer stories. This satisfies requirement (4).
If the story has more downvotes than upvotes, then z = 1 and y = −1 so the score is negative. This satisfies requirement (5).
The constant 45,000 is a scale factor that brings the upvotes and the age into balance. There are 86,400 seconds in a day, so t gets larger by this amount each day. Dividing t by 45,000 gives 1.92 which means that one day's relative newness is worth is 101.92 = 83 votes, and two days' relative newness are worth roughly 7,000 votes.

They don't come from anywhere. There is no absolute truth to them, nor anything to prove. It's simply a way to quantify an attribute in as most sensible a way as seemed to the development team.
You would use log when you want something to be a factor although a less important one (since large values indeed grow, although very slowly). But by the same token, they could have chosen cube root.
The formulae are simply a representation of those factors which we can presume are those which characteristically belong to something "hot", and a composition of them in such a manner that takes each into account in an appropriate proportion (for example, we'll square those values that have huge importance, and take log of those which are less).
Once they came up with the formula, they probably came up with 10 or 15 different types of posts and plugged the numbers in and saw that that made a lot of sense all round, so stuck with it. In fact, there first few attempts probably didn't come out so well, and after a little fiddling with the numbers arrived at that formula.

Best algorithm for optimizing the decisions in a simulation

I'm looking for the best algorithm to optimise the decisions made in a simultaion to find a fast result in a reasonable amount of time. The simultaion does a number of "ticks" and occasionaly needs to make a decision. Eventually a goal state is reached. ( It would be possible to never reach a goal state if you make very bad decisions )
There are many many goal states. I want to find the goal state with the least number of ticks ( a tick equates roughly to a second in real life." I basically want to decide which decisions to make to get to the goal in as few seconds as possible,
Some points about the problem domain:
Straight off the bat I can generate a series of choices that will lead to a solution. It won't be optimal.
I have a reasonable heuristic function to determine what would be a good decision
I have a reasonable function to determine the minimum possible time cost from a node to a goal.
Algorithms:
I need to process this problem for about 10 seconds and then give the best answer I can.
I believe A* would find me the optimal soluton. The problem is that the decision tree will be so large that I won't be able to calculate it quick enough.
IDA* would give me a good first few choices in 10 seconds but I need a path all the way to a goal.
At the moment I am thinking that I will start off with the known non optimal path to a goal and then perhaps use Simulated Anealing and attempt to improve it over 10 seconds.
What would be a good algorithm to research to try to solve this sort of problem?

Have a look at limited discrepancy search, repeating with increasingly loose limits on the maximum discrepancy search, or beam search.
If you have a good heuristic you should be able to use it to compare individual choices - for the limited discrepancy search, and compare partial solutions, for the beam search.
See if you can place an upper bound on how good any extension of a partial solution is. Then you can prune out partial solutions that can't possibly be extended to beat the result from the heuristic, or the best result found so far in a series of iterative searches with increasing depth.

Let's get a few facts out.
1) The only way to know for sure which decision is the best is to test every possible decision and evaluate the outcome based on some criteria.
2) We are highly unlikely to have the time to decide to go through every possible decision, so we have to limit how far in the future we will evaluate the decision.
3) We are highly unlikely to make the best move ~ever~. Not just often, but ever. Unless you have only a couple of decisions, chances are every time you make a decision, there was a better one you didn't get to.
4) We can use how our previous decisions worked out to our advantage.
Put all this together... Let's say when we have a decision, we evaluate what happens 30 ticks into the future, in 30 ticks we can check to see if what actually happened matches what we simulated 30 ticks ago. If it was, we know that decision leads to predictable outcomes and we should use that decision less. If we didn't, or if it turns out better than we hoped, we should use that decision more.
Ideally, you would use your logic in a ... simulation of your simulation ... for purposes of evaluating it. Then when you get to the 'real' simulation, you have a better chance at picking your better decisions earlier. Of course, give a higher weight to the results of your actual simulation results as opposed to your simulated simulation results.

Designing a twenty questions algorithm

I am interested in writing a twenty questions algorithm similar to what akinator and, to a lesser extent, 20q.net uses. The latter seems to focus more on objects, explicitly telling you not to think of persons or places. One could say that akinator is more general, allowing you to think of literally anything, including abstractions such as "my brother".
The problem with this is that I don't know what algorithm these sites use, but from what I read they seem to be using a probabilistic approach in which questions are given a certain fitness based on how many times they have lead to correct guesses. This SO question presents several techniques, but rather vaguely, and I would be interested in more details.
So, what could be an accurate and efficient algorithm for playing twenty questions?
I am interested in details regarding:
What question to ask next.
How to make the best guess at the end of the 20 questions.
How to insert a new object and a new question into the database.
How to query (1, 2) and update (3) the database efficiently.
I realize this may not be easy and I'm not asking for code or a 2000 words presentation. Just a few sentences about each operation and the underlying data structures should be enough to get me started.

Update, 10+ years later
I'm now hosting a (WIP, but functional) implementation here: https://twentyq.evobyte.org/ with the code here: https://github.com/evobyte-apps/open-20-questions. It's based on the same rough idea listed below.
Well, over three years later, I did it (although I didn't work full time on it). I hosted a crude implementation at http://twentyquestions.azurewebsites.net/ if anyone is interested (please don't teach it too much wrong stuff yet!).
It wasn't that hard, but I would say it's the non-intuitive kind of not hard that you don't immediately think of. My methods include some trivial fitness-based ranking, ideas from reinforcement learning and a round-robin method of scheduling new questions to be asked. All of this is implemented on a normalized relational database.
My basic ideas follow. If anyone is interested, I will share code as well, just contact me. I plan on making it open source eventually, but once I have done a bit more testing and reworking. So, my ideas:
an Entities table that holds the characters and objects played;
a Questions table that holds the questions, which are also submitted by users;
an EntityQuestions table holds entity-question relations. This holds the number of times each answer was given for each question in relation to each entity (well, those for which the question was asked for anyway). It also has a Fitness field, used for ranking questions from "more general" down to "more specific";
a GameEntities table is used for ranking the entities according to the answers given so far for each on-going game. An answer of A to a question Q pushes up all the entities for which the majority answer to question Q is A;
The first question asked is picked from those with the highest sum of fitnesses across the EntityQuestions table;
Each next question is picked from those with the highest fitness associated with the currently top entries in the GameEntities table. Questions for which the expected answer is Yes are favored even before the fitness, because these have more chances of consolidating the current top ranked entity;
If the system is quite sure of the answer even before all 20 questions have been asked, it will start asking questions not associated with its answer, so as to learn more about that entity. This is done in a round-robin fashion from the global questions pool right now. Discussion: is round-robin fine, or should it be fully random?
Premature answers are also given under certain conditions and probabilities;
Guesses are given based on the rankings in GameEntities. This allows the system to account for lies as well, because it never eliminates any possibility, just decreases its likeliness of being the answer;
After each game, the fitness and answers statistics are updated accordingly: fitness values for entity-question associations decrease if the game was lost, and increase otherwise.
I can provide more details if anyone is interested. I am also open to collaborating on improving the algorithms and implementation.

This is a very interesting question. Unfortunately I don't have a full answer, let me just write down the ideas I could come up with in 10 minutes:
If you are able to halve the set of available answers on each question, you can distinguish between 2^20 ~ 1 million "objects". Your set is probably going to be larger, so it's right to assume that sometimes you have to make a guess.
You want to maximize utility. Some objects are chosen more often than others. If you want to make good guesses you have to take into consideration the weight of each object (= the probability of that object being picked) when creating the tree.
If you trust a little bit of your users you can gain knowledge based on their answers. This also means that you cannot use a static tree to ask questions because then you'll get the answers for the same questions.. and you'll learn nothing new if you encounter with the same object.
If a simple question is not able to divide the set to two halves, you could combine them to get better results: eg: "is the object green or blue?". "green or has a round shape?"

I am trying try to write a python implementation using a naïve Bayesian network for learning and minimizing the expected entropy after the question has been answered as criterium for selecting a question (with an epsilon chance of selecting a random question in order to learn more about that question), following the ideas in http://lists.canonical.org/pipermail/kragen-tol/2010-March/000912.html. I have put what I got so far on github.
Preferably choose questions with low remaining entropy expectation. (For putting together something quickly, I stole from ε-greedy multi-armed bandit learning and use: With probability 1–ε: Ask the question with the lowest remaining entropy expectation. With probability ε: Ask any random question. However, this approach seems far from optimal.)
Since my approach is a Bayesian network, I obtain the probabilities of the objects and can ask for the most probable object.
A new object is added as new column to the probabilities matrix, with low a priori probability and the answers to the questions as given if given or as guessed by the Bayes network if not given. (I expect that this second part would work much better if I would add Bayes network structure learning instead of just using naive Bayes.)
Similarly, a new question is a new row in the matrix. If it comes from user input, probably only very few answer probabilities are known, the rest needs to be guessed. (In general, if you can get objects by asking for properties, you can obtain properties by asking if given objects have them or not, and the transformation between these is essentially Bayes' theorem and breaks down to transposition in the easiest case. The guessing quality should improve again once the network has an appropriate structure.)
(This is a problem, since I calculate lots of probabilities. My goal is to do it using database-oriented sparse tensor calculations optimized for working with weighted directed acyclic graphs.)

It would be interesting to see how good a decision tree based algorithm would serve you. The trick here is purely in the learning/sorting of the tree. I'd like to note that this is stuff I remember from AI class and student work in the AI working group and should be taken with a semi-large grain (or nugget) of salt.
To answer the questions:
You just walk the tree :)
This is a big downside of decision trees. You'd only have one guess that can be attached to the end nodes of the tree at depth 20 (or earlier, if the tree is still sparse).
There are whole books dedicated to this topic. As far as I remember from AI class you try minimize entropy at all times, so you want to ask questions that ideally divide the set of remaining objects into two sets of equal size. I'm afraid you'd have to look this up in AI books.
Decision trees are highly efficient during the query phase, as you literally walk the tree and follow the 'yes' or 'no' branch at each node. Update efficiency depends on the learning algorithm applied. You might be able to do this offline as in a nightly batched update or something like that.

Which algorithm for assigning shifts (discrete optimization problem)

I'm developing an application that optimally assigns shifts to nurses in a hospital. I believe this is a linear programming problem with discrete variables, and therefore probably NP-hard:
For each day, each nurse (ca. 15-20) is assigned a shift
There is a small number (ca. 6) of different shifts
There is a considerable number of constraints and optimization criteria, either concerning a day, or concerning an emplyoee, e.g.:
There must be a minimum number of people assigned to each shift every day
Some shifts overlap so that it's OK to have one less person in early shift if there's someone doing intermediate shift
Some people prefer early shift, some prefer late shift, but a minimum of shift changes is required to still get the higher shift-work pay.
It's not allowed for one person to work late shift one day and early shift the next day (due to minimum resting time regulations)
Meeting assigned working week lengths (different for different people)
...
So basically there is a large number (aout 20*30 = 600) variables that each can take a small number of discrete values.
Currently, my plan is to use a modified Min-conflicts algorithm
start with random assignments
have a fitness function for each person and each day
select the person or day with the worst fitness value
select at random one of the assignments for that day/person and set it to the value that results in the optimal fitness value
repeat until either a maximum number of iteration is reached or no improvement can be found for the selected day/person
Any better ideas? I am somewhat worried that it will get stuck in a local optimum. Should I use some form of simulated annealing? Or consider not only changes in one variable at a time, but specifically switches of shifts between two people (the main component in the current manual algorithm)? I want to avoid tailoring the algorithm to the current constraints since those might change.
Edit: it's not necessary to find a strictly optimal solution; the roster is currently done manual, and I'm pretty sure the result is considerably sub-optimal most of the time - shouldn't be hard to beat that. Short-term adjustments and manual overrides will also definitely be necessary, but I don't believe this will be a problem; Marking past and manual assignments as "fixed" should actually simplify the task by reducing the solution space.

This is a difficult problem to solve well. There has been many academic papers on this subject particularly in the Operations Research field - see for example nurse rostering papers 2007-2008 or just google "nurse rostering operations research". The complexity also depends on aspects such as: how many days to solve; what type of "requests" can the nurse's make; is the roster "cyclic"; is it a long term plan or does it need to handle short term rostering "repair" such as sickness and swaps etc etc.
The algorithm you describe is a heuristic approach.
You may find you can tweak it to work well for one particular instance of the problem but as soon as "something" is changed it may not work so well (e.g. local optima, poor convergence).
However, such an approach may be adequate depending your particular business needs - e.g. how important is it to get the optimal solution, is the problem outline you describe expected to stay the same, what is the potential savings (money and resources), how important is the nurse's perception of the quality of their rosters, what is the budget for this work etc.

Umm, did you know that some ILP-solvers do quite a good job? Try AIMMS, Mathematica or the GNU programming kit! 600 Variables is of course a lot more than the Lenstra theorem will solve easily, but sometimes these ILP solvers have a good handle and in AIMMS, you can modify the branching strategy a little. Plus, there's a really fast 100%-approximation for ILPs.

I solved a shift assignment problem for a large manufacturing plant recently. First we tried generating purely random schedules and returning any one which passed the is_schedule_valid test - the fallback algorithm. This was, of course, slow and indeterminate.
Next we tried genetic algorithms (as you suggested), but couldn't find a good fitness function that closed on any viable solution (because the smallest change can make the entire schedule RIGHT or WRONG - no points for almost).
Finally we chose the following method (which worked great!):
Randomize the input set (i.e. jobs, shift, staff, etc.).
Create a valid tuple and add it to your tentative schedule.
If not valid tuple can be created, rollback (and increment) the last tuple added.
Pass the partial schedule to a function that tests could_schedule_be_valid, that is, could this schedule be valid if the remaining tuples were filled in a possible way
If !could_schedule_be_valid, simply rollback (and increment) the tuple added in (2).
If schedule_is_complete, return schedule
Goto (2)
You incrementally build a partial shift this way. The benefit is that some tests for valid schedule can easily be done in Step 2 (pre-tests), and others must remain in Step 5 (post-tests).
Good luck. We wasted days trying the first two algorithms, but got the recommended algorithm generating valid schedules instantly in under 5 hours of development.
Also, we supported pre-fixing and post-fixing of assignments that the algorithm would respect. You simply don't randomize those slots in Step 1. You'll find that the solutions doesn't have to be anywhere near optimal. Our solution is O(N*M) at a minimum but executes in PHP(!) in less than half a second for an entire manufacturing plant. The beauty is in ruling out bad schedules quickly using a good could_schedule_be_valid test.
The people that are used to doing it manually don't care if it takes an hour - they just know they don't have to do it manually any more.

Mike,
Don't know if you ever got a good answer to this, but I'm pretty sure that constraint programming is the ticket. While a GA might give you an answer, CP is designed to give you many answers or tell you if there is no feasible solution. A search on "constraint programming" and scheduling should bring up lots of info. It's a relatively new area and CP methods work well on many types of problems where traditional optimization methods bog down.

Dynamic programming a la Bell? Kinda sounds like there's a place for it: overlapping subproblems, optimal substructures.

One thing you can do is to try to look for symmetries in the problem. E.g. can you treat all nurses as equivalent for the purposes of the problem? If so, then you only need to consider nurses in some arbitrary order -- you can avoid considering solutions such that any nurse i is scheduled before any nurse j where i > j. (You did say that individual nurses have preferred shift times, which contradicts this example, although perhaps that's a less important goal?)

I think you should use genetic algorithm because:
It is best suited for large problem instances.
It yields reduced time complexity on the price of inaccurate answer(Not the ultimate best)
You can specify constraints & preferences easily by adjusting fitness punishments for not met ones.
You can specify time limit for program execution.
The quality of solution depends on how much time you intend to spend solving the program..
Genetic Algorithms Definition
Genetic Algorithms Tutorial
Class scheduling project with GA
Also take a look at :a similar question and another one

Using CSP programming I made programms for automatic shitfs rostering. eg:
2-shifts system - tested for 100+ nurses, 30 days time horizon, 10+
rules
3-shifts system - tested for 80+ nurses, 30 days time horizon, 10+ rules
3-shifts system, 4-teams - tested for 365 days horizon, 10+ rules,
and a couple of similiar systems. All of them were tested on my home PC (1.8GHz, dual-core). Execution times always were acceptable ie. for 3/ it took around 5 min and 300MB RAM.
Most hard part of this problem was selecting proper solver and proper solving strategy.

Metaheuristics did very well at the International Nurse Rostering Competition 2010.
For an implementation, see this video with a continuous nurse rostering (java).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio