"Least frequently used" - algorithm - algorithm

I am building an application that is supposed to extract a mission for the user from a finite mission pool. The thing is that I want:
that the user won't get the same mission twice,
that the user won't get the same missions as his friends (in the application) until some time has passed.
To summarize my problem, I need to extract the least common mission out of the pool.
Can someone please reference me to known algorithms of finding least common something (LFU).
I also need the theoretical aspect, so if someone knows some articles or research papers about this (from known magazines like Scientific American) that would be great.

For getting the least frequently used mission, simply give every mission a counter that counts how many times it was used. Then search for the mission with the lowest counter value.
For getting the mission that was least frequently used by a group of friends, you can store for every user the missions he/she has done (and the number of times). This information is probably useful anyway. Then when a new mission needs to be chosen for a user, a (temporary) combined list of used missions and their frequencies by the users and all his friends can easily be created and sorted by frequency. This is not very expensive.

Base on your 2 requirements, I don't see what "LEAST" used mission has anything to do with this. You said you want non repeating missions.
OPTION 1:
What container do you use to hold all missions? Assume it's a list, when you or your friend chooses a mission move that mission to the end of the list (swap it with the missions there). Now you have split your initial list into 2 sublists. The first part holds unused missions, and the second part holds used missions. Keep track of the pivot/index which separates the 2 lists.
Now every time you or your friends choose a new mission it is choosen it from the first sublist. Then move it into the second sublist and update the pivot.
OPTION 2:
If you repeat missions eventually, but choose first the ones which have been chosen the least amount of time, then you can make your container a min heap. Add a usage counter to each mission and add them to the heap based on that. Extract a mission and increment its usage counter then put it back into the heap. This is a good solution, but depending on how simple your program is, you could even use a circular buffer.
It would be nice to know more about what you're building :)

I think the structure you need is a min-heap. It allows extraction of the minimum in O(Log(n)) and it allows you to increase the value of an item in O(Log(n)) too.

A good start is Edmond Blossom V algorithm for a perfect minimum matching in general graph. If you have a bipartite graph you can look for the Floyd-Warshall algorithmus to find the shortest path. Maybe you can use also a topological search but I don't know because these algorithm are really hard to learn.

Related

Collision Management in a Simulation with Discrete Motion

I am building a simulation in which items (like chess pieces) move on a discrete set of positions that do not follow a sequence (like positions on a chessboard) according to a schedule.
Each position can hold only one item at any given time. The schedule could ask multiple items to move at the same time. If the destination position is occupied, the scheduled movement is cancelled.
Here is the question: if item A and item B, originally situated at position 1 and position 2 respectively, are scheduled to move simultaneously to their next positions position 2 and position 3, how do I make sure that item A gets to position 2, hopefully in an efficient design?
The reason to ask this question is that naively I would check whether position 2 is being occupied for item 1 to move into. If the check happens before item B is moved out of the way, item 1 would not move while in fact it should. Because the positions do not follow a sequence, it is not obvious which one to check first. You could imagine things gets messy if many items want to move at the same time. In the extreme case, a full chessboard of items should be allowed to move/rearrange themselves but the naive check may not be able to facilitate that.
Is there a common practice to handle such "nonexistent collision"? Ideas and references are all welcomed.
Two researchers, Ahmed Al Rowaei and Arnold Buss, published a paper in 2010 investigating the impact that using discrete time steps has on model accuracy/fidelity when the real-world system is event-based. There was also some follow-on work in 2011 with their colleague Stephen Lieberman. A major finding was that if you use time stepped models, order of execution matters and can cause the models to deviate from real-world behaviors in significant ways. Time-stepped models generally require you to introduce tie-breaking logic which doesn't exist in the real system. Logic that is needed for the model but doesn't exist in reality is called a "modeling artifact," and can lead to increased model complexity and inaccuracies. Systematic collision resolution schemes can lead to systematic biases.
Their recommendation was to build models based on continuous time. Events are scheduled using the actual (continuous) event times, which determine the order of event execution as in the real-world system. This occasionally (but rarely) requires priority tie breaking based on event type, so that (for example) departure events occur before arrival events if both were to occur at the exact same time.
If you insist on sticking with time-stepped models, a different strategy is to use two or more passes at each time step. The first pass lays out the desired state transitions and identifies potential conflicts, the last pass applies the actual transitions after conflicts have been resolved. The resolution process might be do-able in the initial setup pass, or may require additional passes if it's sufficiently complex.

How to choose matchups in an ELO ratings system as matchups accumulate

I'm working on a crowdsourced app that will pit about 64 fictional strongmen/strongwomen from different franchises against one another and try and determine who the strongest is. (Think "Batman vs. Spiderman" writ large). Users will choose the winner of any given matchup between two at a time.
After researching many sorting algorithms, I found this fantastic SO post outlining the ELO rating system, which seems absolutely perfect. I've read up on the system and understand both how to award/subtract points in a matchup and how to calculate the performance rating between any two characters based on past results.
What I can't seem to find is any efficient and sensible way to determine which two characters to pit against one another at a given time. Naturally it will start off randomly, but quickly points will accumulate or degrade. We can expect a lot of disagreement but also, if I design this correctly, a large amount of user participation.
So imagine you arrive at this feature after 50,000 votes have been cast. Given that we can expect all sorts of non-transitive results under the hood, and a fair amount of deviance from the performance ratings, is there a way to calculate which matchups I most need more data on? It doesn't seem as simple as choosing two adjacent characters in a sorted list with the closest scores, or just focusing at the top of the list.
With 64 entrants (and yes, I did consider and reject a bracket!), I'm not worried about recomputing the performance ratings after every matchup. I just don't know how to choose the next one, seeing as we'll be ignorant of each voter's biases and favorite characters.
The amazing variation that you experience with multiplayer games is that different people with different ratings "queue up" at different times.
By the ELO system, ideally all players should be matched up with an available player with the closest score to them. Since, if I understand correctly, the 64 "players" in your game are always available, this combination leads to lack of variety, as optimal match ups will always be, well, optimal.
To resolve this, I suggest implementing a priority queue, based on when your "players" feel like playing again. For example, if one wants to take a long break, they may receive a low priority and be placed towards the end of the queue, meaning it will be a while before you see them again. If one wants to take a short break, maybe after about 10 matches, you'll see them in a match again.
This "desire" can be done randomly, and you can assign different characteristics to each character to skew this behaviour, such as, "winning against a higher ELO player will make it more likely that this player will play again sooner". From a game design perspective, these personalities would make the characters seem more interesting to me, making me want to stick around.
So here you have an ordered list of players who want to play. I can think of three approaches you might take for the actual matchmaking:
Peek at the first 5 players in the queue and pick the best match up
Match the first player with their best match in the next 4 players in the queue (presumably waited the longest so should be queued immediately, regardless of the fairness of the match up)
A combination of both, where if the person at the head of the list doesn't get picked, they'll increase in "entropy", which affects the ELO calculation making them more likely to get matched up
Edit
On an implementation perspective, I'd recommend using a delta list instead of an actual priority queue since players should be "promoted" as they wait.
To avoid obvious winner vs looser situation you group the players in tiers.
Obviously, initially everybody will be in the same tier [0 - N1].
Then within the tier you make a rotational schedule so each two parties can "match" at least once.
However if you don't want to maintain schedule ...then always match with the party who participated in the least amount of "matches". If there are multiple of those make a random pick.
This way you ensure that everybody participates fairly the same amount of "matches".

Algorithms for Minimum resource requirements

I have a question for which I have made some solutions, but I am not happy with the scalability. I'm looking for input of some different approaches / algorithms to solving it.
Problem:
Software can run on electronic controllers (ECUs) and requires
different resources to run a given feature. It may require a given
amount of storage or RAM or a digital or Analog Input or Output for
instance. If we have multiple features and multiple controller options
we want to find the combination that minimizes the hardware
requirements (cost). I'll simplify the resources to letters to
simplify the understanding.
Example 1:
Feature1(A)
ECU1(A,B,C)
First a trivial example. Lets assume that a feature requires 1 unit of resource A, and ECU has 1 unit of resources A, B and C available, it is obvious that the feature will fit in the ECU with resources B & C left over.
Example 2:
Feature2(A,B)
ECU2(A|B,B,C)
In this example, Feature 2 requires resources A and B, and the ECU has 3 resources, the first of which can be A or B. In this case, you can again see that the feature will fit in the ECU, but only if check in a certain order. If you assign F(A) to E(A|B), then F(B) to E(B) it works, but if you assign F(B) to E(A|B) then there is no resource left on the ECU for F(A) so it doesn't appear to fit. This would lead one to the observation that we should prefer non-OR'd resources first to avoid such a conflict.
An example of the above could be a an analog input could also be used as a digital input for instance.
Example 3
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C)
Now things are a little bit more complicated, but it is still quite obvious to a person that the feature will fit into the ECU.
My problems are simply more scaled up versions of these examples (i.e. multiple features per ECU with more ECUs to choose from.
Algorithms
GA
My first approach to this was to use a genetic algorithm. For a given set of features i.e. F(A,B,C,D), and a list of currently available ECUs find which single or combination of ECUs fit the requirements.
ECUs would initially be randomly selected and features checked they fitted and added to them. If a feature didn't fit another ECU was added to the architecture. A population of these architectures was created and ranked based on lowest cost of housing all the features. Architectures could then be mated in successive generations with mutations and such to improve fitness.
This approached worked quite well, but tended to get stuck in local minima (not the cheapest option) based on a golden example I had worked by hand.
Combinatorial / Permutations
My next approach was to work out all of the possible permutations (the ORs from above) for an ECU to see if the features fit.
If we go back to example 2 and expand the ORs we get 2 permutations;
Feature2(A,B)
ECU2(A|B,B,C) = (A,B,C), (B,B,C)
From here it is trivial to check that the feature fits in the first permutation, but not the second.
...and for example 3 there are 12 permutations
Feature3(A,B,C)
ECU3(A|B|C, B|C, A|C) = (A,B,A), (B,B,A), (C,B,A), (A,C,A), (B,C,A), (C,C,A), (A,B,C), (B,B,C), (C,B,C), (A,C,C), (B,C,C), (C,C,C)
Again it is trivial to check that feature 3 fits in at least one of the permutations (3rd, 5th & 7th).
Based on this approach I was also able to get a solution also, but I have ECUs with so many OR'd inputs that I have millions of ECU permutations which drastically increased the run time (minutes). I can live with this, but first wanted to see if there was a better way to skin the cat, apart from Parallelizing this approach.
So that is the problem...
I have more ideas on how to approach it, but assume that there is a fancy name for such a problem or the name of the algorithm that has been around for 20+ years that I'm not familiar with and I was hoping someone could point me in that direction to either some papers or the names of relevant algorithms.
The obvious remark of simply summing the feature resource requirements and creating a new monolithic ECU is not an option. Lastly, no, this is not in any way associated with any assignment or problem given by a school or university.
Sorry for the long question, but hopefully I've sufficiently described what I am trying to do and this peaks the interest of someone out there.
Sincerely, Paul.
Looks like individual feature plug can be solved as bipartite matching.
You make bipartite graph:
left side corresponds to feature requirements
right side corresponds to ECU subnodes
edges connect each left and right side vertixes with common letters
Let me explain by example 2:
Feature2(A,B)
ECU2(A|B,B,C)
How graph looks:
2 left vertexes: L1 (A), L2 (B)
3 right vertexes: R1 (A|B), R2 (B), R3 (C)
3 edges: L1-R1 (A-A|B), L2-R1 (B-A|B), L2-R2 (B-B)
Then you find maximal matching for unordered bipartite graph. There are few well-known algorithms for it:
https://en.wikipedia.org/wiki/Matching_(graph_theory)
If maximal matching covers every feature vertex, we can use it to plug feature.
If maximal matching does not cover every feature vertex, we are short of resources.
Unfortunately, this approach works like greedy algorithms. It does not know of upcoming features and does not tweak solution to fit more features later. Partially optimization for simple cases can work like you described in question, but in general it's dead end - only algorithm that accounts for every feature in whole feature set can make overall effective solution.
You can try to add several features to one ECU simultaneously. If you want to add new feature to given ECU, you can try all already assigned features plus candidate feature. In this case local optimum solution will be found for given feature set (if it's possible to plug them all to one ECU).
I've not enough reputation to comment, so here's what i wanted to propose for your problem:
Like GA there are some other Random Based approaches too e.g. Bayesian Apporaoch , Decision Tree etc.
In my opinion Decision Tree will suit your problem as it, against some input dataset/attributes, shows a path to each class(in your case ECUs) that helps to select right class/ECU. Train your system with some sample data sets so that it can decide right ECU for your actual data set/Features.
Check Decision Trees - Machine Learning for more information. Hope it helps!

Algorithm for assigning people based on multiple criteria

I have a list of users which need to be sorted into committees. The users can rank committees based on their particular preference, but must choose at least one to join. When they have all made their selections, the algorithm should sort them as evenly as possible taking into account their committee preference, gender, age, time zone and country (for now). I have looked at this question and its answer would seem like a good choice, but it is unclear to me how to add the various constraints to the algorithm for it to work.
Would anyone point me in the right direction on how to do this, please?
Looking for "clustering" will get you nowhere, because this is not a clustering type if task.
Instead, this is an assignment problem.
For further informarion, see:
Knapsack Problem
Generalized Assignment Problem
Usually, these are NP-hard to solve. Thus, one will usually choose a greedy optimization heuristic to find a reasonably good solution faster.
Think about how to best assign one person at a time.
Then, process the data as follows:
assign everybody that can only be assigned in a single way
find an unassigned person that is hard to assign, stop if everybody is assigned
assign the best possible way
remove preferences that are no longer admissible, and go to 1 again (there may be new person with only a single choice left)
For bonus points, add a source of randomness, and an overall quality measure. Then run the algorothm 10 times, and keep only the best result.
For further bonus, add an postprocessing optimization: when can you transfer one person to another group or swap to persons to improve the overall quality? Iterate over all persons to find such small improvements until you cannot find any.

How to check user choice algorithm

I have an algorithm that chooses a list of items that should fit the user's likings.
I'll skip the algorithm's details because of confidentiality issues...
Now, I'm trying to think of a way to check it statistically, with a group of people.
The way I'm checking it now is:
Algorithm gets best results per user.
shuffle top 5 results with lowest 5 results.
make person list the results he liked by order (0 = liked best, 9 = didn't like)
compare user results to algorithm results.
I'm doing this because i figured that to show that algorithm chooses good results, i need to put in some bad results and show that the algorithm knows its a bad result as well.
So, what I'm asking is:
Is shuffling top results with low results is a good idea ?
And if not, do you have an idea on how to get good statistics on how good an algorithm matches user preferences (we have users that can choose stuff) ?
First ask yourself:
What am I trying to measure?
Not to rag on the other submissions here, but while mjv and Sjoerd's answers offer some plausible heuristic reasons for why what you are trying to do may not work as you expect; they are not constructive in the sense that they do not explain why your experiment is flawed, and what you can do to improve it. Before either of these issues can be addressed, what you need to do is define what you hope to measure, and only then should you go about trying to devise an experiment.
Now, I can't say for certain what would constitute a good metric for your purposes, but I can offer you some suggestions. As a starting point, you could try using a precision vs. recall graph:
http://en.wikipedia.org/wiki/Precision_and_recall
This is a standard technique for assessing the performance of ranking and classification algorithms in machine learning and information retrieval (ie web searching). If you have an engineering background, it could be helpful to understand that precision/recall generalizes the notion of precision/accuracy:
http://en.wikipedia.org/wiki/Accuracy_and_precision
Now let us suppose that your algorithm does something like this; it takes as input some prior data about a user then returns a ranked list of other items that user might like. For example, your algorithm is a web search engine and the items are pages; or you have a movie recommender and the items are books. This sounds pretty close to what you are trying to do now, so let us continue with this analogy.
Then the precision of your algorithm's results on the first n is the number of items that the user actually liked out of your first to top n recommendations:
precision = #(items user actually liked out of top n) / n
And the recall is the number of items that you actually got right out of the total number of items:
recall = #(items correctly marked as liked) / #(items user actually likes)
Ideally, one would want to maximize both of these quantities, but they are in a certain sense competing objectives. To illustrate this, consider a few extremal situations: For example, you could have a recommender that returns everything, which would have perfect recall, but very low precision. A second possibility is to have a recommender that returns nothing or only one sure-fire hit, which would have (in a limiting sense) perfect precision, but almost no recall.
As a result, to understand the performance of a ranking algorithm, people typically look at its precision vs. recall graph. These are just plots of the precision vs the recall as the number of items returned are varied:
Image taken from the following tutorial (which is worth reading):
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html
Now to approximate a precision vs recall for your algorithm, here is what you can do. First, return a large set of say n, results as ranked by your algorithm. Next, get the user to mark which items they actually liked out of those n results. This trivially gives us enough information to compute the precision at every partial set of documents < n (since we know the number). We can also compute the recall (as restricted to this set of documents) by taking the total number of items liked by the user in the entire set. This, we can plot a precision recall curve for this data. Now there are fancier statistical techniques for estimating this using less work, but I have already written enough. For more information please check out the links in the body of my answer.
Your method is biased. If you use the top 5 and bottom 5 results, It is very likely that the user orders it according to your algorithm. Let's say we have an algorithm which rates music, and I present the top 1 and bottom 1 to the user:
Queen
The Cheeky Girls
Of course the user will mark it exactly like your algorithm, because the difference between the top and bottom is so big. You need to make the user rate randomly selected items.
Independently of the question of mixing top and bottom guesses, an implicit drawback of the experimental process, as described, is that the data related to the user's choice can only be exploited in the context of one particular version of the algorithm:
When / if the algorithm or its parameters are ever slightly tuned, the record of past user's choices cannot be reused to validate the changes to the algorithm.
On mixing high and low results:
The main drawback of producing sets of items by mixing the algorithm's top and bottom guesses is that it may further complicate the choice of the error/distance function used to measure how well the algorithm performed. Unless the two subsets of items (topmost choices, bottom most choices) are kept separately for the purpose of computing distinct measurements, typical statistical measures of the error (say RMSE) will not be a good measurement of the effective algorithm's quality.
For example, an algorithm which frequently suggests, low guesses items which end up being picked as top choices by the user may have the same averaged error rate than an algorithm which never confuses highs with lows, but where there the user tends to reorders the items more within their subset.
A second drawback is that the algorithm evaluation method may merely qualify its ability of filtering the relative like/dislike of users for items it [the algorithm] chooses rather than its ability of producing the user's actual top choices.
In other words the user's actual top choices may never be offered to him; so yeah the algorithm does a good job at guessing that user will like say Rock-and-Roll before Rap, but never guessing that in fact user prefers Classical Baroque music over all.

Resources