Let's say we have a version of the stable marriage problem with the addition that some pairs are restricted based on certain preferences, e.g. whether or not someone lives in a different city and whether or not that person is okay with long-distance (assume people cannot move). Assuming men proposing, Gale–Shapley produces a men-optimal/women-pessimal solution. Given this, what would be the best way to represent preference lists for each party? One idea I had was that if someone is incompatible, they would be left off the preference list altogether. I suppose that means if a woman received a proposal from a man not on her list, she wouldn't be able to be engaged to him, even if he's fine with being in different cities. However, I'm not sure if this would be allowed by the algorithm (at the end of the day, I'd still like all men to be matched, but this could interfere with the algorithm's guarantee of a stable matching). Obviously if everyone lived in different locations and were unwilling, this would be impossible, but given that a lot of people live in the same few locations in my situation, I'm uncertain as to how likely this would be.
So, instead of having variable-length preference lists, I also thought I could keep constant rankings (all lists have size n), but the top half of the list has location-compatible partners (i.e., either I'm in the same location or I don't mind long-distance, regardless of what the other person wants) and the bottom half of the list has location-incompatible partners (i.e., we are in different locations and I don't want long-distance). Given that men are proposing, this means we can presume they most likely receive a location-compatible matching, but if for example there are a few more women who are picky about location, and they receive their worst valid choice, does this increase the likelihood women will be matched with a location-incompatible partner? If that's the only option then I suppose I have to accept it, but I'm wondering if there's a way to modify the algorithm such that women avoid location-incompatibility as much as possible despite Gale–Shapley being women-pessimal.
Give incompatibility a preference penalty on both sides. So if you match my desires but I'm incompatible with you, give me a preference penalty so I'll look for people compatible with me first.
As you guessed, this will not guarantee no incompatible matchings. But it will make a good faith effort to avoid them.
Related
I'm working on a crowdsourced app that will pit about 64 fictional strongmen/strongwomen from different franchises against one another and try and determine who the strongest is. (Think "Batman vs. Spiderman" writ large). Users will choose the winner of any given matchup between two at a time.
After researching many sorting algorithms, I found this fantastic SO post outlining the ELO rating system, which seems absolutely perfect. I've read up on the system and understand both how to award/subtract points in a matchup and how to calculate the performance rating between any two characters based on past results.
What I can't seem to find is any efficient and sensible way to determine which two characters to pit against one another at a given time. Naturally it will start off randomly, but quickly points will accumulate or degrade. We can expect a lot of disagreement but also, if I design this correctly, a large amount of user participation.
So imagine you arrive at this feature after 50,000 votes have been cast. Given that we can expect all sorts of non-transitive results under the hood, and a fair amount of deviance from the performance ratings, is there a way to calculate which matchups I most need more data on? It doesn't seem as simple as choosing two adjacent characters in a sorted list with the closest scores, or just focusing at the top of the list.
With 64 entrants (and yes, I did consider and reject a bracket!), I'm not worried about recomputing the performance ratings after every matchup. I just don't know how to choose the next one, seeing as we'll be ignorant of each voter's biases and favorite characters.
The amazing variation that you experience with multiplayer games is that different people with different ratings "queue up" at different times.
By the ELO system, ideally all players should be matched up with an available player with the closest score to them. Since, if I understand correctly, the 64 "players" in your game are always available, this combination leads to lack of variety, as optimal match ups will always be, well, optimal.
To resolve this, I suggest implementing a priority queue, based on when your "players" feel like playing again. For example, if one wants to take a long break, they may receive a low priority and be placed towards the end of the queue, meaning it will be a while before you see them again. If one wants to take a short break, maybe after about 10 matches, you'll see them in a match again.
This "desire" can be done randomly, and you can assign different characteristics to each character to skew this behaviour, such as, "winning against a higher ELO player will make it more likely that this player will play again sooner". From a game design perspective, these personalities would make the characters seem more interesting to me, making me want to stick around.
So here you have an ordered list of players who want to play. I can think of three approaches you might take for the actual matchmaking:
Peek at the first 5 players in the queue and pick the best match up
Match the first player with their best match in the next 4 players in the queue (presumably waited the longest so should be queued immediately, regardless of the fairness of the match up)
A combination of both, where if the person at the head of the list doesn't get picked, they'll increase in "entropy", which affects the ELO calculation making them more likely to get matched up
Edit
On an implementation perspective, I'd recommend using a delta list instead of an actual priority queue since players should be "promoted" as they wait.
To avoid obvious winner vs looser situation you group the players in tiers.
Obviously, initially everybody will be in the same tier [0 - N1].
Then within the tier you make a rotational schedule so each two parties can "match" at least once.
However if you don't want to maintain schedule ...then always match with the party who participated in the least amount of "matches". If there are multiple of those make a random pick.
This way you ensure that everybody participates fairly the same amount of "matches".
I'm trying to understand how random forest works in plain English instead of mathematics. Can anybody give me a really simple explanation of how this algorithm works?
As far as I understand, we feed the features and labels without telling the algorithm which feature should be classified as which label? As I used to do Naive Bayes which is based on probability we need to tell which feature should be which label. Am I completely far off?
If I can get any very simple explanation I'd be really appreciated.
RandomForest uses a so-called bagging approach. The idea is based on the classic bias-variance trade off. Suppose that we have a set (say N) of overfitted estimators that have low bias but high cross-sample-variance. So low bias is good and we want to keep it, high variance is bad and we want to reduce it. RandomForest tries to achieve this by doing a so-called bootstraps/sub-sampling (as #Alexander mentioned, this is a combination of bootstrap sampling on both observations and features). The prediction is the average of individual estimators so the low-bias property is successfully preserved. And further by Central Limit Theorem, the variance of this sample average has a variance equal to variance of individual estimator divided by square root of N. So now, it has both low-bias and low-variance properties, and this is why RandomForest often outperforms stand-alone estimator.
Adding on to the above two answers, Since you mentioned a simple explanation. Here is a write up that I feel is the most simple way you can explain random forests.
Credits go to Edwin Chen for the simple explanation here in layman terms for random forests. Posting the same below.
Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.
Thus, Willow is a decision tree for your movie preferences.
But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).
Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.
By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.
There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardo DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.
And so your friends now form a random forest.
I will try to give another complementary explanation with simple words.
A random forest is a collection of random decision trees (of number n_estimators in sklearn).
What you need to understand is how to build one random decision tree.
Roughly speaking, to build a random decision tree you start from a subset of your training samples. At each node you will draw randomly a subset of features (number determined by max_features in sklearn). For each of these features you will test different thresholds and see how they split your samples according to a given criterion (generally entropy or gini, criterion parameter in sklearn). Then you will keep the feature and its threshold that best split your data and record it in the node.
When the construction of the tree ends (it can be for different reasons: maximum depth is reached (max_depth in sklearn), minimum sample number is reached (min_samples_leaf in sklearn) etc.) you look at the samples in each leaf and keep the frequency of the labels.
As a result, it is like the tree gives you a partition of your training samples according to meaningful features.
As each node is built from features taken randomly, you understand that each tree built in this way will be different. This contributes to the good compromise between bias and variance, as explained by #Jianxun Li.
Then in testing mode, a test sample will go through each tree, giving you label frequencies for each tree. The most represented label is generally the final classification result.
I would like to know the difference between uncertainty and randomness in mathematical fashion. I tried to find it but I get confused , as some people said they are the same? But can any one provide me logical reasoning behind it. If they are not same then please explain it why?
Don't get too hung up on it.
People use different words in different situations.
It's not so much that they have different meanings, as that their meanings are situation-dependent.
Randomness is just a fuzzy general term meaning something is random.
In statistics, uncertainty is used to mean that some property of a distribution, such as its mean, is itself unknown but can be given a distribution.
For example, suppose you want to know the average weight of all people.
You could find it out exactly if you could go around to all people, get their weight, add it all up, and divide by the number of people.
But that's too hard to do, so suppose you just pick 10 people at random and get their average weight, and pretend it's the same as the average of everybody.
That's called the sample mean, but you know it isn't accurate.
It has what is called a standard error, meaning it has uncertainty.
In fact, if you were to do that experiment many times over with different people, you would get a different sample mean every time, and those sample means would themselves form a bell-shaped distribution, the standard deviation of which would be called the standard error, representing its uncertainty.
In general, if you increased the number of people you look at by a factor of 100, you can reduce the standard error, the uncertainty, by a factor of 10.
I bet you can tell that people who take polls for a living care about this stuff very much.
EDIT for the downvoter: In case the downvote is because this doesn't look like a stackoverflow question or answer,
I've made a point of advocating the random pausing method of profiling.
Profiling in large part is perceived to be about measuring (statistically) the time that programming constructs are responsible for.
Often people are inhibited from using that method because they are afraid the results have too much uncertainty.
This post gets very specific about what that uncertainty actually is.
It shows that the bogey-man fear of uncertainty has the effect of preventing people from finding really substantial speedups in their code.
So naivete' about statistics is definitely a serious programming problem.
My view looks at a scenario using three different coloured balls:
I love some of the answers given here. My own view, based on my current research, is that these are two distinct terms. Uncertainty refers to not knowing in advance which ball could be selected when a person, for instance, is given a chance to select one ball from three different coloured balls.
This remains true when each ball has an equal chance of being selected i.e. equal probabilities. However, things soon get complex when each ball has it's own distinct probability. Chances are that the one with the highest probability will be selected. This seems especially true in algorithm development which would almost always select the highest probability compromising the meaning of randomness.
Having said all of this - I believe these concepts remain confusing which has just made me realise the time I need to dedicate on clearly distinguishing between the two to make sure my current research is not confusing. My own predicament is that I need to work on stochastic vs deterministic views. Based on the current view stochastic would be more uncertain than random whereas deterministic would be more probability based i.e. knowing for certain that the highest probability would be chosen; but this seems very far from the truth.
It seems as if uncertainty holds until just before a ball is selected/touched and soon looses its meaning as soon as the ball is picked which should result to its probability being revised. I personally think the terms have theoretical differences which perhaps allows them to be used interchangeably.
Uncertainty in math and science typically means there are a lack of facts, or the facts are unobtainable. Weather forecasting is a great example of uncertainty.
Randomness has many definitions. Commonly it's used in probability / statistics as a measure or quantification of uncertainty. So in my weather example, a 30% chance of rain is a measure of uncertainty. The more general definition (which also applies to math / science) is unpredictable, or lack of order.
There is definitely a fuzzy distinction between the two.
According to the Bayesian interpretation of probability, uncertainty and randomness are just two names for the same thing.
If an experiment is random, then it is uncertain to you. If something is uncertain to you, then it has the randomness property.
I was wondering if there's an efficient and easy way to determine waves in MQL4, just like zigzag indicator does it.
I was asked to help automate indicator, for that I need to determine 'waves', essentially max and min of a graph over some period of time (which is vague and all relative).
I don't have a clear image of how I want an indicator to work, but it would be something like that:
Find the last wave, i.e. where the direction of price last changed (neglecting the noise), and then for example reflect it with a trend line.
Is it possible to use zigzag structure to find that point, where direction changed. (Possibly not the only one, might need to find more that just the last point, but the preceding one. So i will want to adopt the algorithm)
I know it's a while since you asked this question and you probably already have an answer, but if not...
I dislike Zigzag and have not found a way to do what I want to do with it, so I will the last part of your questions with no, and believe me I tried.
The way I prefer it is to find bars that conform to the classic definition of fractals/swing points (i.e. a high with two lower highs on either side, or a low with two higher lows on either side), then try to make up for the shortcomings. E.g. Often there will be two high fractals/swings/waves in a row without an intermediate low fractal/swing/wave. So I add the best intermediate low point as a wave, or remove one of the highs (E.g. if the first one wasn't as subjectively significant). Some of the swing points that are identified are 'noisy', to use your term, and not ones that a human trader would have picked. So these need to be dealt with and so on. If you go down this route it is a long one, computers make many mistakes identifying appropriate swing points, so unfortunately not what I would call easy, but it is accurate, and how many easy indicators are there that actually make money over the long run?
I'd just like someone to verify whether the following problem is NP-complete or if there is actually a better/easier solution to it than simple brute-force combination checking.
We have a sort-of resource allocation problem in our software, and I'll explain it with an example.
Let's say we need 4 people to be at work during the day-shift. This number, and the fact that it is a "day-shift" is recorded in our database.
However, we don't require just anyone to fill those spots, there's some requirements that needs to be filled in order to fit the bill.
Of those 4, let's say 2 of them has to be a nurse, and 1 of them has to be doctors.
One of the doctors also has to work as part of a particular team.
So we have this set of information:
Day-shift: 4
1 doctor
1 doctor, need to work in team A
1 nurse
The above is not the problem. The problem comes when we start picking people to work the day-shift and trying to figure out if the people we've picked so far can actually fill the criteria.
For instance, let's say we pick James, John, Ursula and Mary to work, where James and Ursula are doctors, John and Mary are nurses.
Ursula also works in team A.
Now, depending on the order we try to fit the bill, we might end up deducing that we have the right people, or not, unless we start trying different combinations.
For instance, if go down the list and pick Ursula first, we could match her with the "1 doctor" criteria. Then we get to James, and we notice that since he doesn't work in team A, the other criteria about "1 doctor, need to work in team A", can't be filled with him. Since the other two people are nurses, they won't fit that criteria either.
So we backtrack and try James first, and he too can fit the first criteria, and then Ursula can fit the criteria that needs that team.
So the problem looks to us as we need to try different combinations until we've either tried them all, in which case we have some criteria that aren't filled yet, even if the total number of heads working is the same as the total number of heads needed, or we've found a combination that works.
Is this the only solution, can anyone think of a better one?
Edit: Some clarification.
Comments to this question mentions that with this few people, we should go with brute-force, and I agree, that's probably what we could do, and we might even do that, in the same lane that some sort optimizations look at the size of the data and picks different sort algorithms with less initial overhead if the data size is small.
The problem though is that this is part of a roster planning system, in which you might have quite a few number of people involved, both as "We need X people on the day shift" as well as "We have this pool of Y people that will be doing it", as well as potential for a large "We have this list of Z criteria for those X people that will have to somehow match up with these Y people", and then you add to the fact that we will have a number of days to do the same calculation for, in real-time, as the leader adjusts the roster, and then the need for a speedy solution has come up.
Basically, the leader will see a live sum information on-screen that says how many people are still missing, both on the day-shift as a whole, as well as how many people is fitting the various criteria, and how many people we actually ned in addition to the ones we have. This display will have to update semi-live while the leader adjusts the roster with "What if James takes the day-shift instead of Ursula, and Ursula takes the night-shift".
But huge thanks to the people that has answered this so far, the constraint satisfaction problem sounds like the way we need to go, but we'll definitely look hard at all the links and algorithm names here.
This is why I love StackOverflow :)
What you have there is a constraint satisfaction problem; their relationship to NP is interesting, because they're typically NP but often not NP-complete, i.e. they're tractable to polynomial-time solutions.
As ebo noted in comments, your situation sounds like it can be formulated as an exact cover problem, which you can apply Knuth's Algorithm X to. If you take this tack, please let us know how it works out for you.
It does look like you have a constraint satisfaction problem.
In your case I would particularly look at constraint propagation techniques first -- you may be able to reduce the problem to a manageable size that way.
What happens if no one fits the criteria?
What you are describing is the 'Roommate Problem' it is lightly described in this thesis.
Bear with me, I'm searching for better links.
EDIT
Here's another fairly dense thesis.
As for me I would most likely trying to find reduction to bipartite graph matching problem. Also to prove that problem is NP usually is much more complicated than staying you cannot find polynomial solution.
I am not sure your problem is NP, it does not smell that way, but what I would do if I was you would be to order the requirements for the positions such that you try to fill the most specific first since fewer people will be available fill these positions, so you are less likely to have to backtrack a lot. There is no reason why you should not combine this with algorithm X, an algorithm of pure Knuth-ness.
I'll leave the theory to others, since my mathematical savvy is not so great, but you may find a tool like Cassowary/Cassowary.net or NSolver useful to represent your problem declaratively as a constraint satisfaction problem and then solve the constraints.
In such tools, the simplex method combined with constraint propagation is frequently employed to deterministically reduce the solution space and then find an optimal solution given a cost function. For larger solution spaces (which don't seem to apply in the size of problem you specify), occasionally genetic algorithms are employed.
If I remember correctly, NSolver also includes in sample code a simplification of an actual Nurse-rostering problem that Dr. Chun worked on in Hong Kong. And there's a paper on the work he did.
It sounds to me like you have a couple of separable problems that would be a lot easier to solve:
-- select one doctor from team A
-- select another doctor from any team
-- select two nurses
So you have three independent problems.
A clarification though, do you have to have two doctors (one from the specified team) and two nurses, or one doctor from the specified team, two nurses, and one other that can be either doctor or nurse?
Some Questions:
Is the goal to satisfy the constraints exactly, or only approximately (but as much as possible)?
Can a person be a member of several teams?
What are all possible constraints? (For example, could we need a doctor which is a member of several teams?)
If you want to satisfy the constraints exactly, then I would order the constraints decreasingly by strictness, that is, the ones which are most hardest to achieve (e.g. doctor AND team A in your example above) should be checked first!
If you want to satisfy the constraints approximately, then its a different story... you would have to specify some kind of weighting/importance-function which determines what we rather would have, when we can't match exactly, and have several possibilities to choose from.
If you have several or many constraints, take a look at Drools Planner (open source, java).
Brute force, branch and bound and similar techniques take to long. Deterministic algorithms such as fill the largest shifts first are very suboptimal. Meta-heuristics are a very good way to deal with this.
Take a specific look at the real-world nurse rostering example of Drools Planner. It's easy to add many constraints, such as "young nurses don't want to work the Saturday night" or "some nurses don't want to work to many days in a row".