Determine the "difficulty" of quiz with multiple weights? - algorithm

Im trying to determine the "difficultly" of a quiz object.
My ultimate goal is to be able to create a "difficulty score" (DS) for any quiz. This would allow me to compare one quiz to another accurately, despite being made up of different questions/answers.
When creating my quiz object, I assign each question a "difficulty index" (DI), which is number on a scale from 1-15.
15 = most difficult
1 = least difficult
Now a strait forward way to measure this "difficulty score" could be to add up each question's "difficulty index" then divide by maximum possible "difficulty index" for the quiz. ( ex. 16/30 = 53.3% Difficulty )
However, I also have multiple "weighting" properties associated to each question. These weights are again one a scale of 1-5.
5 = most impact
1 = least impact
The reason I have (2) instead of the more common (1) is so I can accommodate a scenario as follows...
If presenting the student with a very difficult question (DI=15) and the student answers "incorrect", don't have it hurt their score so much BUT if they get it "correct" have it improve their score greatly. I call these my "positive" (PW) and "negative" (NW) weights.
Quiz Example A:
Question 1: DI = 1 | PW = 3 | NW = 3
Question 2: DI = 1 | PW = 3 | NW = 3
Question 3: DI = 1 | PW = 3 | NW = 3
Question 4: DI = 15 | PW = 5 | NW = 1
Quiz Example B:
Question 1: DI = 1 | PW = 3 | NW = 3
Question 2: DI = 1 | PW = 3 | NW = 3
Question 3: DI = 1 | PW = 3 | NW = 3
Question 4: DI = 15 | PW = 1 | NW = 5
Technically the above two quizzes are very similar BUT Quiz B should be more "difficult" because the hardest question will have the greatest impact on your score if you get it wrong.
My question now becomes how can I accurately determine the "difficulty score" when considering the complex weighting system?
Any help is greatly appreciated!

The challenge of course is to determine the difficulty score for each single question.
I suggest the following model:
Hardness (H): Define a hard question such that chances of answering it correctly are lower. The hardest question is such that (1) the chance of answering it correctly are equal to random choice (because it is inherently very hard), and (2) it has the largest number of possible answers. We'll define such question as (H = 15). On the other end of the scale, we'll define (H = 0) for a question where the chances of answering it correctly are 100% (because it is trivial) (I know - such question will never appear). Now - define the hardness of each question by subjective extrapolation (remember that one can always guess between the given options). For example, if a (H = 15) question has 4 answers, and another question with similar inherent hardness has 2 answers - it would be (H = 7.5). Another example: If you believe that an average student has 62.5% of answering a question correctly - it would also be a (H = 7.5) question (this is because a H = 15 has 25% of correct answer, while H = 0 has 100%. The average is 62.5%)
Effect (E): Now, we'll measure the effect of PW and NW. For questions with 50% chance of answering correctly - the effect is E = 0.5*PW - 0.5*NW. For questions with 25% chance of answering correctly - the effect is E = 0.25*PW - 0.75*NW. For trivial question NW doesn't matter so the effect is E = PW.
Difficulty (DI): The last step is to integrate the hardness and the effect - and call it difficulty. I suggest DI = H - c*E, where c is some positive constant. You may want to normalize again.
Edit: Alternatively, you may try the following formula: DI = H * (1 - c*E), where the effect magnitude is not absolute, but relative to the question's hardness.
Clarification:
The teacher needs to estimate only one parameter about each question: What is the probability that an average student would answer this question correctly. His estimation, e, will be in the range [1/k, 1], where k is the number of answers.
The hardness, H, is a linear function of e such that 1/k is mapped to 15 and 1 is mapped to 0. The function is: H = 15 * k / (k-1) * (1-e)
The effect E depends on e, PW and NW. The formula is E = e*PW - (1-e)*NW
Example based on OP comments:
Question 1:
k = 4, e = 0.25 (hardest). Therefore H = 15
PW = 1, NW = 5, e = 0.25. Therefore E = 0.25*1 - 0.75*5 = -3.5
c = 5. DI = 15 - 5*(-3.5) = 32.5
Question 2:
k = 4, e = 0.95 (very easy). Therefore H = 1
PW = 1, NW = 5, e = 0.95. Therefore E = 0.95*1 - 0.05*5 = 0.7
c = 5. DI = 1 - 5*(0.7) = -2.5

I'd say the core of the problem is that mathematically your example quizzes A and B are identical, except that quiz A awards the student 4 gratuitous bonus points (or, equivalently, quiz B arbitrarily takes 4 points away from them). If the same students take both of them, the score distribution will be the same, except shifted by 4 points. So while the two quizzes may feel different psychologically (because, let's face it, getting extra points feels good, and losing points feels bad, even if you technically did nothing to deserve it), finding an objective way to distinguish them seems tricky.
That said, one reasonable measure of "psychological difficulty" could simply be the average score (per question) that a randomly chosen student would be expected to get from the quiz. Of course, that's not something you can reliably calculate in advance, although you could estimate it from actual quiz results after the fact.
However, if you could somehow relate your (presumably arbitrary) difficulty ratings to the fraction of students likely to answer the question correctly, then you could use that to estimate the expected average score. So, for example, we might simply assume a linear relationship with the question difficulty as the success rate, with difficulty 1 corresponding to a 100% expected success rate, and difficulty 15 corresponding to a 0% expected success rate. Then the expected average score S per question for the quiz could be calculated as:
S = avg(PW × X − NW × (1 − X))
where the average is taken over all questions in the quiz, and where PW and NW are the point weights for a correct and an incorrect answer respectively, DI below is the difficulty rating for the question, and X = (15 − DI) / 14 is the estimated success rate.
Of course, we might want to also account for the fact that, even if a student doesn't know the answer to a question, they can still guess. Basically this means that the estimated success rate X should not range from 0 to 1, but from 1/N to 1, where N is the number of options for the question. So, taking that into account, we can adjust the formula for X to be:
X = (1 + (N − 1) × (15 − DI) / 14) / N
One problem with this estimated average score S as a difficulty measure is that it isn't bounded in either direction, and provides no obvious scale to indicate what counts as an "easy" quiz or a "hard" one. The fundamental problem here is that you haven't specified any limits for the question weights, so there's technically nothing to stop someone from making a question with, say, a positive or negative weight of one million points.
That said, if you do impose some reasonable limits on the weights (even if they're only recommendations), then you should be able to also establish reasonable thresholds on S for a quiz to be considered e.g. easy, moderate or hard. And even if you don't, you can still at least use it to rank quizzes relative to each other by difficulty.
Ps. One way to present the expected score in a UI might be to multiply it by the number of questions in the quiz, and display the result as "par" for the quiz. That way, students could roughly judge their own performance against the difficulty of the quiz by seeing whether they scored above or below par.

Related

Is there a bug in Algorithm 4.3.1D of Knuth's TAOCP?

You can find the explanation of Algorithm 4.3.1D, as it appears in the book Art of The Computer Programming Vol. 2 (pages 272-273) by D. Knuth in the appendix of this question.
It appears that, in the step D.6, qhat is expected to be off by one at most.
Lets assume base is 2^32 (i.e we are working with unsigned 32 bit digits). Let u = [238157824, 2354839552, 2143027200, 0] and v = [3321757696, 2254962688]. Expected output of this division is 4081766756 Link
Both u and v is already normalized as described in D.1(v[1] > b / 2 and u is zero padded).
First iteration of the loop D.3 through D.7 is no-op because qhat = floor((0 * b + 2143027200) / (2254962688)) = 0 in the first iteration.
In the second iteration of the loop, qhat = floor((2143027200 * b + 2354839552) / (2254962688)) = 4081766758 Link.
We don't need to calculate steps D.4 and D.5 to see why this is a problem. Since qhat will be decreased by one in D.6, result of the algorithm will come out as 4081766758 - 1 = 4081766757, however, result should be 4081766756 Link.
Am I right to think that there is a bug in the algorithm, or is there a fallacy in my reasoning?
Appendix
There is no bug; you're ignoring the loop inside Step D3:
In your example, as a result of this test, the value of q̂, which was initially set to 4081766758, is decreased two times, first to 4081766757 and then to 4081766756, before proceeding to Step D4.
(Sorry I did not have the time to make a more detailed / “proper” answer.)

Best way to distribute a given resource (eg. budget) for optimal output

I am trying to find a solution in which a given resource (eg. budget) will be best distributed to different options which yields different results on the resource provided.
Let's say I have N = 1200 and some functions. (a, b, c, d are some unknown variables)
f1(x) = a * x
f2(x) = b * x^c
f3(x) = a*x + b*x^2 + c*x^3
f4(x) = d^x
f5(x) = log x^d
...
And also, let's say there n number of these functions that yield different results based on its input x, where x = 0 or x >= m, where m is a constant.
Although I am not able to find exact formula for the given functions, I am able to find the output. This means that I can do:
X = f1(N1) + f2(N2) + f3(N3) + ... + fn(Nn) where (N1 + ... Nn) = N as many times as there are ways of distributing N into n numbers, and find a specific case where X is the greatest.
How would I actually go about finding the best distribution of N with the least computation power, using whatever libraries currently available?
If you are happy with allocations constrained to be whole numbers then there is a dynamic programming solution of cost O(Nn) - so you can increase accuracy by scaling if you want, but this will increase cpu time.
For each i=1 to n maintain an array where element j gives the maximum yield using only the first i functions giving them a total allowance of j.
For i=1 this is simply the result of f1().
For i=k+1 consider when working out the result for j consider each possible way of splitting j units between f_{k+1}() and the table that tells you the best return from a distribution among the first k functions - so you can calculate the table for i=k+1 using the table created for k.
At the end you get the best possible return for n functions and N resources. It makes it easier to find out what that best answer is if you maintain of a set of arrays telling the best way to distribute k units among the first i functions, for all possible values of i and k. Then you can look up the best allocation for f100(), subtract off the value this allocated to f100() from N, look up the best allocation for f99() given the resulting resources, and carry on like this until you have worked out the best allocations for all f().
As an example suppose f1(x) = 2x, f2(x) = x^2 and f3(x) = 3 if x>0 and 0 otherwise. Suppose we have 3 units of resource.
The first table is just f1(x) which is 0, 2, 4, 6 for 0,1,2,3 units.
The second table is the best you can do using f1(x) and f2(x) for 0,1,2,3 units and is 0, 2, 4, 9, switching from f1 to f2 at x=2.
The third table is 0, 3, 5, 9. I can get 3 and 5 by using 1 unit for f3() and the rest for the best solution in the second table. 9 is simply the best solution in the second table - there is no better solution using 3 resources that gives any of them to f(3)
So 9 is the best answer here. One way to work out how to get there is to keep the tables around and recalculate that answer. 9 comes from f3(0) + 9 from the second table so all 3 units are available to f2() + f1(). The second table 9 comes from f2(3) so there are no units left for f(1) and we get f1(0) + f2(3) + f3(0).
When you are working the resources to use at stage i=k+1 you have a table form i=k that tells you exactly the result to expect from the resources you have left over after you have decided to use some at stage i=k+1. The best distribution does not become incorrect because that stage i=k you have worked out the result for the best distribution given every possible number of remaining resources.

Algorithm to calculate sum of points for groups with varying member count [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
Let's start with an example. In Harry Potter, Hogwarts has 4 houses with students sorted into each house. The same happens on my website and I don't know how many users are in each house. It could be 20 in one house 50 in another and 100 in the third and fourth.
Now, each student can earn points on the website and at the end of the year, the house with the most points will win.
But it's not fair to "only" do a sum of the points, as the house with a 100 students will have a much higher chance to win, as they have more users to earn points. So I need to come up with an algorithm which is fair.
You can see an example here: https://worldofpotter.dk/points
What I do now is to sum all the points for a house, and then divide it by the number of users who have earned more than 10 points. This is still not fair, though.
Any ideas on how to make this calculation more fair?
Things we need to take into account:
* The percent of users earning points in each house
* Few users earning LOTS of points
* Many users earning FEW points (It's not bad earning few points. It still counts towards the total points of the house)
Link to MySQL dump(with users, houses and points): https://worldofpotter.dk/wop_points_example.sql
Link to CSV of points only: https://worldofpotter.dk/points.csv
I'd use something like Discounted Cumulative Gain which is used for measuring the effectiveness of search engines.
The concept is as it follows:
FUNCTION evalHouseScore (0_INDEXED_SORTED_ARRAY scores):
score = 0;
FOR (int i = 0; i < scores.length; i++):
score += scores[i]/log2(i);
END_FOR
RETURN score;
END_FUNCTION;
This must be somehow modified as this way of measuring focuses on the first result. As this is subjective you should decide on your the way you would modify it. Below I'll post the code which some constants which you should try with different values:
FUNCTION evalHouseScore (0_INDEXED_SORTED_ARRAY scores):
score = 0;
FOR (int i = 0; i < scores.length; i++):
score += scores[i]/log2(i+K);
END_FOR
RETURN L*score;
END_FUNCTION
Consider changing the logarithm.
Tests:
int[] g = new int[] {758,294,266,166,157,132,129,116,111,88,83,74,62,60,60,52,43,40,28,26,25,24,18,18,17,15,15,15,14,14,12,10,9,5,5,4,4,4,4,3,3,3,2,1,1,1,1,1};
int[] s = new int[] {612,324,301,273,201,182,176,139,130,121,119,114,113,113,106,86,77,76,65,62,60,58,57,54,54,42,42,40,36,35,34,29,28,23,22,19,17,16,14,14,13,11,11,9,9,8,8,7,7,7,6,4,4,3,3,3,3,2,2,2,2,2,2,2,1,1,1};
int[] h = new int[] {813,676,430,382,360,323,265,235,192,170,107,103,80,70,60,57,43,41,21,17,15,15,12,10,9,9,9,8,8,6,6,6,4,4,4,3,2,2,2,1,1,1};
int[] r = new int[] {1398,1009,443,339,242,215,210,205,177,168,164,144,144,92,85,82,71,61,58,47,44,33,21,19,18,17,12,11,11,9,8,7,7,6,5,4,3,3,3,3,2,2,2,1,1,1,1};
The output is for different offsets:
1182
1543
1847
2286
904
1231
1421
1735
813
1120
1272
1557
It sounds like some sort of constraint between the houses may need to be introduced. I might suggest finding the person that earned the most points out of all the houses and using it as the denominator when rolling up the scores. This will guarantee the max value of a user's contribution is 1, then all the scores for a house can be summed and then divided by the number of users to normalize the house's score. That should give you a reasonable comparison. It does introduce issues with low numbers of users in a house that are high achievers in which you may want to consider lower limits to the number of house members. Another technique may be to introduce handicap scores for users to balance the scales. The algorithm will most likely flex over time based on the data you receive. To keep it fair it will take some responsive action after the initial iteration. Players can come up with some creative ways to make scoring systems work for them. Here is some pseudo-code in PHP that you may use:
<?php
$mostPointsEarned; // Find the user that earned the most points
$houseScores = [];
foreach ($houses as $house) {
$numberOfUsers = 0;
$normalizedScores = [];
foreach ($house->getUsers() as $user) {
$normalizedScores[] = $user->getPoints() / $mostPointsEarned;
$numberOfUsers++;
}
$houseScores[] = array_sum($normalizedScores) / $numberOfUsers;
}
var_dump($houseScores);
You haven't given any examples on what should be preferred state, and what are situations against which you want to be immune. (3,2,1,1 compared to 5,2 etc.)
It's also a pity you haven't provided us the dataset in some nice way to play.
scala> val input = Map( // as seen on 2016-09-09 14:10 UTC on https://worldofpotter.dk/points
'G' -> Seq(758,294,266,166,157,132,129,116,111,88,83,74,62,60,60,52,43,40,28,26,25,24,18,18,17,15,15,15,14,14,12,10,9,5,5,4,4,4,4,3,3,3,2,1,1,1,1,1),
'S' -> Seq(612,324,301,273,201,182,176,139,130,121,119,114,113,113,106,86,77,76,65,62,60,58,57,54,54,42,42,40,36,35,34,29,28,23,22,19,17,16,14,14,13,11,11,9,9,8,8,7,7,7,6,4,4,3,3,3,3,2,2,2,2,2,2,2,1,1,1),
'H' -> Seq(813,676,430,382,360,323,265,235,192,170,107,103,80,70,60,57,43,41,21,17,15,15,12,10,9,9,9,8,8,6,6,6,4,4,4,3,2,2,2,1,1,1),
'R' -> Seq(1398,1009,443,339,242,215,210,205,177,168,164,144,144,92,85,82,71,61,58,47,44,33,21,19,18,17,12,11,11,9,8,7,7,6,5,4,3,3,3,3,2,2,2,1,1,1,1)
) // and the results on the website were: 1. R 1951, 2. H 1859, 3. S 990, 4. G 954
Here is what I thought of:
def singleValuedScore(individualScores: Seq[Int]) = individualScores
.sortBy(-_) // sort from most to least
.zipWithIndex // add indices e.g. (best, 0), (2nd best, 1), ...
.map { case (score, index) => score * (1 + index) } // here is the 'logic'
.max
input.mapValues(singleValuedScore)
res: scala.collection.immutable.Map[Char,Int] =
Map(G -> 1044,
S -> 1590,
H -> 1968,
R -> 2018)
The overall positions would be:
Ravenclaw with 2018 aggregated points
Hufflepuff with 1968
Slytherin with 1590
Gryffindor with 1044
Which corresponds to the ordering on that web: 1. R 1951, 2. H 1859, 3. S 990, 4. G 954.
The algorithms output is maximal product of score of user and rank of the user within a house.
This measure is not affected by "long-tail" of users having low score compared to the active ones.
There are no hand-set cutoffs or thresholds.
You could experiment with the rank attribution (score * index or score * Math.sqrt(index) or score / Math.log(index + 1) ...)
I take it that the fair measure is the number of points divided by the number of house members. Since you have the number of points, the exercise boils down to estimate the number of members.
We are in short supply of data here as the only hint we have on member counts is the answers on the website. This makes us vulnerable to manipulation, members can trick us into underestimating their numbers. If the suggested estimation method to "count respondents with points >10" would be known, houses would only encourage the best to do the test to hide members from our count. This is a real problem and the only thing I will do about it is to present a "manipulation indicator".
How could we then estimate member counts? Since we do not know anything other than test results, we have to infer the propensity to do the test from the actual results. And we have little other to assume than that we would have a symmetric result distribution (of the logarithm of the points) if all members tested. Now let's say the strong would-be respondents are more likely to actually test than weak would-be respondents. Then we could measure the extra dropout ratio for the weak by comparing the numbers of respondents in corresponding weak and strong test-point quantiles.
To be specific, of the 205 answers, there are 27 in the worst half of the overall weakest quartile, while 32 in the strongest half of the best quartile. So an extra 5 respondents of the very weakest have dropped out from an assumed all-testing symmetric population, and to adjust for this, we are going to estimate member count from this quantile by multiplying the number of responses in it by 32/27=about 1.2. Similarly, we have 29/26 for the next less-extreme half quartiles and 41/50 for the two mid quartiles.
So we would estimate members by simply counting the number of respondents but multiplying the number of respondents in the weak quartiles mentioned above by 1.2, 1.1 and 0.8 respectively. If however any result distribution within a house would be conspicuously skewed, which is not the case now, we would have to suspect manipulation and re-design our member count.
For the sample at hand however, these adjustments to member counts are minor, and yields the same house ranks as from just counting the respondents without adjustments.
I got myself to amuse me a little bit with your question and some python programming with some random generated data. As some people mentioned in the comments you need to define what is fairness. If as you said you don't know the number of people in each of the houses, you can use the number of participations of each house, thus you motivate participation (it can be unfair depending on the number of people of each house, but as you said you don't have this data on the first place).
The important part of the code is the following.
import numpy as np
from numpy.random import randint # import random int
# initialize random seed
np.random.seed(4)
houses = ["Gryffindor","Slytherin", "Hufflepuff", "Ravenclaw"]
houses_points = []
# generate random data for each house
for _ in houses:
# houses_points.append(randint(0, 100, randint(60,100)))
houses_points.append(randint(0, 50, randint(2,10)))
# count participation
houses_participations = []
houses_total_points = []
for house_id in xrange(len(houses)):
houses_total_points.append(np.sum(houses_points[house_id]))
houses_participations.append(len(houses_points[house_id]))
# sum the total number of participations
total_participations = np.sum(houses_participations)
# proposed model with weighted total participation points
houses_partic_points = []
for house_id in xrange(len(houses)):
tmp = houses_total_points[house_id]*houses_participations[house_id]/total_participations
houses_partic_points.append(tmp)
The results of this method are the following:
House Points per Participant
Gryffindor: [46 5 1 40]
Slytherin: [ 8 9 39 45 30 40 36 44 38]
Hufflepuff: [42 3 0 21 21 9 38 38]
Ravenclaw: [ 2 46]
House Number of Participations per House
Gryffindor: 4
Slytherin: 9
Hufflepuff: 8
Ravenclaw: 2
House Total Points
Gryffindor: 92
Slytherin: 289
Hufflepuff: 172
Ravenclaw: 48
House Points weighted by a participation factor
Gryffindor: 16
Slytherin: 113
Hufflepuff: 59
Ravenclaw: 4
You'll find the complete file with printing results here (https://gist.github.com/silgon/5be78b1ea0b55a20d90d9ec3e7c515e5).
You should enter some more rules to define the fairness.
Idea 1
You could set up the rule that anyone has to earn at least 10 points to enter the competition.
Then you can calculate the average points for each house.
Positive: Everyone needs to show some motivation.
Idea 2
Another approach would be to set the rule that from each house only the 10 best students will count for the competition.
Positive: Easy rule to calculate the points.
Negative: Students might become uninterested if they see they can't reach the top 10 places of their house.
From my point of view, your problem is diveded in a few points:
The best thing to do would be to re - assignate the player in the different Houses so that each House has the same number of players. (as explain by #navid-vafaei)
If you don't want to do that because you believe that it may affect your game popularity with player whom are in House that they don't want because you can change the choice of the Sorting Hat at least in the movie or books.
In that case, you can sum the point of the student's house and divide by the number of students. You may just remove the number of student with a very low score. You may remove as well the student with a very low activity because students whom skip school might be fired.
The most important part for me n your algorithm is weather or not you give points for all valuables things:
In the Harry Potter's story, the students earn point on the differents subjects they chose at school and get point according to their score.
At the end of the year, there is a special award event. At that moment, the Director gave points for valuable things which cannot be evaluated in the subject at school suche as the qualites (bravery for example).

random choice, specific distributions, variable applicability

Imagine I have person 1, 2, 3 and 4, then I have shirt styles A, B, C, D and I want to distribute the shirt styles to the people such that 25% of them get style A, 25% get style B, 25% get style C and 25% get style D but some of the people refuse to wear certain styles, these people are represented by Fs. How can I randomly match all the people with the styles they are willing to wear to get the approximate distribution?
A B C D
1 T F T T
2 T F F F
3 T T T T
4 T T T F
In this case this is easy and 25% is can be fully achieved, just give each person a different style. However, I intend to take this problem beyond this simple situation, my solution has to be generic. The number or styles, the number of people, and the distribution is all variable. Sometimes, the distribution will be impossible to create 100% accurately, approximate/close/best effor is expected. The selection process should be random and attempt to maintain the distribution.
I'm pretty agnostic to the language here, I'm just seeking the algorithm. Though preferably it would be able to be distributed.
Finding a solution when you are hampered by the Fs is https://en.wikipedia.org/wiki/Assignment_problem. One way to select an arbitrary assignment when there are many would be to set random costs where a style is acceptable to a person and then let it find the assignment with lowest possible cost. However it is not obvious that this will fit any natural definition of random. One (very inefficient) natural definition of random would be to select from all possible assignments at random until you get one that is acceptable to everybody. The distribution you get from this might not be the same as the one you would get by setting up random costs and then solving the resulting assignment problem.
You are using the term 'randomly match' which should be used with caution. The correct interpretation, I believe, is a random selection from the set of all valid solutions, so basically if we could enumerate all valid solution - we could trivially solve the problem.
You are looking for a close-enough solution, so we need to better define what a valid solution is. I suggest to define some threshold (say 1% error at most).
In your example, there are 4 groups (those assigned with shirt style A/B/C/D). Therefore, there are 2^4-1 possible person archetypes (love/hate each of A/B/C/D, with the assumption that anyone loves at least one shirt style). Each archetype population has a given population size, and each archetype can be assigned to some of the 4 groups (1 or more).
The goal is to divide the population of each archetype between the 4 groups, such that say, the size of each group is between L and H.
Lets formalize it.
Problem statement:
Denote A(0001),...,A(1111): the population size of each of the 15 archetypes
Denote G1(0001): the size of A(0001) assigned to G1, etc.
Given
L,H: constants
A(0001),...,A(1111): 15 constants
Our goal is to find all integer solutions for
G1(0001),G1(0011),G1(0101),G1(0111),G1(1001),G1(1011),G1(1101),G1(1111),
G2(0010),G2(0011),G2(0110),G2(0111),G2(1010),G2(1011),G2(1110),G2(1111),
G3(0100),G3(0101),G3(0110),G3(0111),G3(1100),G3(1101),G3(1110),G3(1111),
G4(1000),G4(1001),G4(1010),G4(1011),G4(1100),G4(1101),G4(1110),G4(1111)
subject to:
G1(0001) = A(0001)
G2(0010) = A(0010)
G2(0011) + G1(0011) = A(0011)
G3(0100) = A(0100)
G3(0101) + G1(0101) = A(0101)
G3(0110) + G2(0110) = A(0110)
G3(0111) + G2(0101) + G1(0101) = A(0111)
G4(1000) = A(1000)
G4(1001) + G1(1001) = A(1001)
G4(1010) + G2(1010) = A(1010)
G4(1011) + G2(1011) + G1(1011) = A(1011)
G4(1100) + G3(1100) = A(1100)
G4(1101) + G3(1101) + G1(1101) = A(1101)
G4(1110) + G3(1110) + G2(1110) = A(1110)
G4(1111) + G3(1111) + G2(1111) + G1(1111) = A(1111)
L < G1(0001)+G1(0011)+G1(0101)+G1(0111)+G1(1001)+G1(1011)+G1(1101)+G1(1111) <
H
L < G2(0010)+G2(0011)+G2(0110)+G2(0111)+G2(1010)+G2(1011)+G2(1110)+G2(1111) <
H
L < G3(0100)+G3(0101)+G3(0110)+G3(0111)+G3(1100)+G3(1101)+G3(1110)+G3(1111) <
H
L < G4(1000)+G4(1001)+G4(1010)+G4(1011)+G4(1100)+G4(1101)+G4(1110)+G4(1111) <
H
Now we can use an integer programming solver for the job.

A simple explanation of Naive Bayes Classification [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am finding it hard to understand the process of Naive Bayes, and I was wondering if someone could explain it with a simple step by step process in English. I understand it takes comparisons by times occurred as a probability, but I have no idea how the training data is related to the actual dataset.
Please give me an explanation of what role the training set plays. I am giving a very simple example for fruits here, like banana for example
training set---
round-red
round-orange
oblong-yellow
round-red
dataset----
round-red
round-orange
round-red
round-orange
oblong-yellow
round-red
round-orange
oblong-yellow
oblong-yellow
round-red
The accepted answer has many elements of k-NN (k-nearest neighbors), a different algorithm.
Both k-NN and NaiveBayes are classification algorithms. Conceptually, k-NN uses the idea of "nearness" to classify new entities. In k-NN 'nearness' is modeled with ideas such as Euclidean Distance or Cosine Distance. By contrast, in NaiveBayes, the concept of 'probability' is used to classify new entities.
Since the question is about Naive Bayes, here's how I'd describe the ideas and steps to someone. I'll try to do it with as few equations and in plain English as much as possible.
First, Conditional Probability & Bayes' Rule
Before someone can understand and appreciate the nuances of Naive Bayes', they need to know a couple of related concepts first, namely, the idea of Conditional Probability, and Bayes' Rule. (If you are familiar with these concepts, skip to the section titled Getting to Naive Bayes')
Conditional Probability in plain English: What is the probability that something will happen, given that something else has already happened.
Let's say that there is some Outcome O. And some Evidence E. From the way these probabilities are defined: The Probability of having both the Outcome O and Evidence E is:
(Probability of O occurring) multiplied by the (Prob of E given that O happened)
One Example to understand Conditional Probability:
Let say we have a collection of US Senators. Senators could be Democrats or Republicans. They are also either male or female.
If we select one senator completely randomly, what is the probability that this person is a female Democrat? Conditional Probability can help us answer that.
Probability of (Democrat and Female Senator)= Prob(Senator is Democrat) multiplied by Conditional Probability of Being Female given that they are a Democrat.
P(Democrat & Female) = P(Democrat) * P(Female | Democrat)
We could compute the exact same thing, the reverse way:
P(Democrat & Female) = P(Female) * P(Democrat | Female)
Understanding Bayes Rule
Conceptually, this is a way to go from P(Evidence| Known Outcome) to P(Outcome|Known Evidence). Often, we know how frequently some particular evidence is observed, given a known outcome. We have to use this known fact to compute the reverse, to compute the chance of that outcome happening, given the evidence.
P(Outcome given that we know some Evidence) = P(Evidence given that we know the Outcome) times Prob(Outcome), scaled by the P(Evidence)
The classic example to understand Bayes' Rule:
Probability of Disease D given Test-positive =
P(Test is positive|Disease) * P(Disease)
_______________________________________________________________
(scaled by) P(Testing Positive, with or without the disease)
Now, all this was just preamble, to get to Naive Bayes.
Getting to Naive Bayes'
So far, we have talked only about one piece of evidence. In reality, we have to predict an outcome given multiple evidence. In that case, the math gets very complicated. To get around that complication, one approach is to 'uncouple' multiple pieces of evidence, and to treat each of piece of evidence as independent. This approach is why this is called naive Bayes.
P(Outcome|Multiple Evidence) =
P(Evidence1|Outcome) * P(Evidence2|outcome) * ... * P(EvidenceN|outcome) * P(Outcome)
scaled by P(Multiple Evidence)
Many people choose to remember this as:
P(Likelihood of Evidence) * Prior prob of outcome
P(outcome|evidence) = _________________________________________________
P(Evidence)
Notice a few things about this equation:
If the Prob(evidence|outcome) is 1, then we are just multiplying by 1.
If the Prob(some particular evidence|outcome) is 0, then the whole prob. becomes 0. If you see contradicting evidence, we can rule out that outcome.
Since we divide everything by P(Evidence), we can even get away without calculating it.
The intuition behind multiplying by the prior is so that we give high probability to more common outcomes, and low probabilities to unlikely outcomes. These are also called base rates and they are a way to scale our predicted probabilities.
How to Apply NaiveBayes to Predict an Outcome?
Just run the formula above for each possible outcome. Since we are trying to classify, each outcome is called a class and it has a class label. Our job is to look at the evidence, to consider how likely it is to be this class or that class, and assign a label to each entity.
Again, we take a very simple approach: The class that has the highest probability is declared the "winner" and that class label gets assigned to that combination of evidences.
Fruit Example
Let's try it out on an example to increase our understanding: The OP asked for a 'fruit' identification example.
Let's say that we have data on 1000 pieces of fruit. They happen to be Banana, Orange or some Other Fruit.
We know 3 characteristics about each fruit:
Whether it is Long
Whether it is Sweet and
If its color is Yellow.
This is our 'training set.' We will use this to predict the type of any new fruit we encounter.
Type Long | Not Long || Sweet | Not Sweet || Yellow |Not Yellow|Total
___________________________________________________________________
Banana | 400 | 100 || 350 | 150 || 450 | 50 | 500
Orange | 0 | 300 || 150 | 150 || 300 | 0 | 300
Other Fruit | 100 | 100 || 150 | 50 || 50 | 150 | 200
____________________________________________________________________
Total | 500 | 500 || 650 | 350 || 800 | 200 | 1000
___________________________________________________________________
We can pre-compute a lot of things about our fruit collection.
The so-called "Prior" probabilities. (If we didn't know any of the fruit attributes, this would be our guess.) These are our base rates.
P(Banana) = 0.5 (500/1000)
P(Orange) = 0.3
P(Other Fruit) = 0.2
Probability of "Evidence"
p(Long) = 0.5
P(Sweet) = 0.65
P(Yellow) = 0.8
Probability of "Likelihood"
P(Long|Banana) = 0.8
P(Long|Orange) = 0 [Oranges are never long in all the fruit we have seen.]
....
P(Yellow|Other Fruit) = 50/200 = 0.25
P(Not Yellow|Other Fruit) = 0.75
Given a Fruit, how to classify it?
Let's say that we are given the properties of an unknown fruit, and asked to classify it. We are told that the fruit is Long, Sweet and Yellow. Is it a Banana? Is it an Orange? Or Is it some Other Fruit?
We can simply run the numbers for each of the 3 outcomes, one by one. Then we choose the highest probability and 'classify' our unknown fruit as belonging to the class that had the highest probability based on our prior evidence (our 1000 fruit training set):
P(Banana|Long, Sweet and Yellow)
P(Long|Banana) * P(Sweet|Banana) * P(Yellow|Banana) * P(banana)
= _______________________________________________________________
P(Long) * P(Sweet) * P(Yellow)
= 0.8 * 0.7 * 0.9 * 0.5 / P(evidence)
= 0.252 / P(evidence)
P(Orange|Long, Sweet and Yellow) = 0
P(Other Fruit|Long, Sweet and Yellow)
P(Long|Other fruit) * P(Sweet|Other fruit) * P(Yellow|Other fruit) * P(Other Fruit)
= ____________________________________________________________________________________
P(evidence)
= (100/200 * 150/200 * 50/200 * 200/1000) / P(evidence)
= 0.01875 / P(evidence)
By an overwhelming margin (0.252 >> 0.01875), we classify this Sweet/Long/Yellow fruit as likely to be a Banana.
Why is Bayes Classifier so popular?
Look at what it eventually comes down to. Just some counting and multiplication. We can pre-compute all these terms, and so classifying becomes easy, quick and efficient.
Let z = 1 / P(evidence). Now we quickly compute the following three quantities.
P(Banana|evidence) = z * Prob(Banana) * Prob(Evidence1|Banana) * Prob(Evidence2|Banana) ...
P(Orange|Evidence) = z * Prob(Orange) * Prob(Evidence1|Orange) * Prob(Evidence2|Orange) ...
P(Other|Evidence) = z * Prob(Other) * Prob(Evidence1|Other) * Prob(Evidence2|Other) ...
Assign the class label of whichever is the highest number, and you are done.
Despite the name, Naive Bayes turns out to be excellent in certain applications. Text classification is one area where it really shines.
Your question as I understand it is divided in two parts, part one being you need a better understanding of the Naive Bayes classifier & part two being the confusion surrounding Training set.
In general all of Machine Learning Algorithms need to be trained for supervised learning tasks like classification, prediction etc. or for unsupervised learning tasks like clustering.
During the training step, the algorithms are taught with a particular input dataset (training set) so that later on we may test them for unknown inputs (which they have never seen before) for which they may classify or predict etc (in case of supervised learning) based on their learning. This is what most of the Machine Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon.
So in a general Machine Learning project basically you have to divide your input set to a Development Set (Training Set + Dev-Test Set) & a Test Set (or Evaluation set). Remember your basic objective would be that your system learns and classifies new inputs which they have never seen before in either Dev set or test set.
The test set typically has the same format as the training set. However, it is very important that the test set be distinct from the training corpus: if we simply
reused the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores.
In general, for an example, 70% of our data can be used as training set cases. Also remember to partition the original set into the training and test sets randomly.
Now I come to your other question about Naive Bayes.
To demonstrate the concept of Naïve Bayes Classification, consider the example given below:
As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently existing objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen.
Thus, we can write:
Prior Probability of GREEN: number of GREEN objects / total number of objects
Prior Probability of RED: number of RED objects / total number of objects
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:
Prior Probability for GREEN: 40 / 60
Prior Probability for RED: 20 / 60
Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in the diagram below). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:
Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
Finally, we classify X as RED since its class membership achieves the largest posterior probability.
Naive Bayes comes under supervising machine learning which used to make classifications of data sets.
It is used to predict things based on its prior knowledge and independence assumptions.
They call it naive because it’s assumptions (it assumes that all of the features in the dataset are equally important and independent) are really optimistic and rarely true in most real-world applications.
It is classification algorithm which makes the decision for the unknown data set. It is based on Bayes Theorem which describe the probability of an event based on its prior knowledge.
Below diagram shows how naive Bayes works
Formula to predict NB:
How to use Naive Bayes Algorithm ?
Let's take an example of how N.B woks
Step 1: First we find out Likelihood of table which shows the probability of yes or no in below diagram.
Step 2: Find the posterior probability of each class.
Problem: Find out the possibility of whether the player plays in Rainy condition?
P(Yes|Rainy) = P(Rainy|Yes) * P(Yes) / P(Rainy)
P(Rainy|Yes) = 2/9 = 0.222
P(Yes) = 9/14 = 0.64
P(Rainy) = 5/14 = 0.36
Now, P(Yes|Rainy) = 0.222*0.64/0.36 = 0.39 which is lower probability which means chances of the match played is low.
For more reference refer these blog.
Refer GitHub Repository Naive-Bayes-Examples
Ram Narasimhan explained the concept very nicely here below is an alternative explanation through the code example of Naive Bayes in action
It uses an example problem from this book on page 351
This is the data set that we will be using
In the above dataset if we give the hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'} then what is the probability that he will buy or will not buy a computer.
The code below exactly answers that question.
Just create a file called named new_dataset.csv and paste the following content.
Age,Income,Student,Creadit_Rating,Buys_Computer
<=30,high,no,fair,no
<=30,high,no,excellent,no
31-40,high,no,fair,yes
>40,medium,no,fair,yes
>40,low,yes,fair,yes
>40,low,yes,excellent,no
31-40,low,yes,excellent,yes
<=30,medium,no,fair,no
<=30,low,yes,fair,yes
>40,medium,yes,fair,yes
<=30,medium,yes,excellent,yes
31-40,medium,no,excellent,yes
31-40,high,yes,fair,yes
>40,medium,no,excellent,no
Here is the code the comments explains everything we are doing here! [python]
import pandas as pd
import pprint
class Classifier():
data = None
class_attr = None
priori = {}
cp = {}
hypothesis = None
def __init__(self,filename=None, class_attr=None ):
self.data = pd.read_csv(filename, sep=',', header =(0))
self.class_attr = class_attr
'''
probability(class) = How many times it appears in cloumn
__________________________________________
count of all class attribute
'''
def calculate_priori(self):
class_values = list(set(self.data[self.class_attr]))
class_data = list(self.data[self.class_attr])
for i in class_values:
self.priori[i] = class_data.count(i)/float(len(class_data))
print "Priori Values: ", self.priori
'''
Here we calculate the individual probabilites
P(outcome|evidence) = P(Likelihood of Evidence) x Prior prob of outcome
___________________________________________
P(Evidence)
'''
def get_cp(self, attr, attr_type, class_value):
data_attr = list(self.data[attr])
class_data = list(self.data[self.class_attr])
total =1
for i in range(0, len(data_attr)):
if class_data[i] == class_value and data_attr[i] == attr_type:
total+=1
return total/float(class_data.count(class_value))
'''
Here we calculate Likelihood of Evidence and multiple all individual probabilities with priori
(Outcome|Multiple Evidence) = P(Evidence1|Outcome) x P(Evidence2|outcome) x ... x P(EvidenceN|outcome) x P(Outcome)
scaled by P(Multiple Evidence)
'''
def calculate_conditional_probabilities(self, hypothesis):
for i in self.priori:
self.cp[i] = {}
for j in hypothesis:
self.cp[i].update({ hypothesis[j]: self.get_cp(j, hypothesis[j], i)})
print "\nCalculated Conditional Probabilities: \n"
pprint.pprint(self.cp)
def classify(self):
print "Result: "
for i in self.cp:
print i, " ==> ", reduce(lambda x, y: x*y, self.cp[i].values())*self.priori[i]
if __name__ == "__main__":
c = Classifier(filename="new_dataset.csv", class_attr="Buys_Computer" )
c.calculate_priori()
c.hypothesis = {"Age":'<=30', "Income":"medium", "Student":'yes' , "Creadit_Rating":'fair'}
c.calculate_conditional_probabilities(c.hypothesis)
c.classify()
output:
Priori Values: {'yes': 0.6428571428571429, 'no': 0.35714285714285715}
Calculated Conditional Probabilities:
{
'no': {
'<=30': 0.8,
'fair': 0.6,
'medium': 0.6,
'yes': 0.4
},
'yes': {
'<=30': 0.3333333333333333,
'fair': 0.7777777777777778,
'medium': 0.5555555555555556,
'yes': 0.7777777777777778
}
}
Result:
yes ==> 0.0720164609053
no ==> 0.0411428571429
I try to explain the Bayes rule with an example.
What is the chance that a random person selected from the society is a smoker?
You may reply 10%, and let's assume that's right.
Now, what if I say that the random person is a man and is 15 years old?
You may say 15 or 20%, but why?.
In fact, we try to update our initial guess with new pieces of evidence ( P(smoker) vs. P(smoker | evidence) ). The Bayes rule is a way to relate these two probabilities.
P(smoker | evidence) = P(smoker)* p(evidence | smoker)/P(evidence)
Each evidence may increase or decrease this chance. For example, this fact that he is a man may increase the chance provided that this percentage (being a man) among non-smokers is lower.
In the other words, being a man must be an indicator of being a smoker rather than a non-smoker. Therefore, if an evidence is an indicator of something, it increases the chance.
But how do we know that this is an indicator?
For each feature, you can compare the commonness (probability) of that feature under the given conditions with its commonness alone. (P(f | x) vs. P(f)).
P(smoker | evidence) / P(smoker) = P(evidence | smoker)/P(evidence)
For example, if we know that 90% of smokers are men, it's not still enough to say whether being a man is an indicator of being smoker or not. For example if the probability of being a man in the society is also 90%, then knowing that someone is a man doesn't help us ((90% / 90%) = 1. But if men contribute to 40% of the society, but 90% of the smokers, then knowing that someone is a man increases the chance of being a smoker (90% / 40%) = 2.25, so it increases the initial guess (10%) by 2.25 resulting 22.5%.
However, if the probability of being a man was 95% in the society, then regardless of the fact that the percentage of men among smokers is high (90%)! the evidence that someone is a man decreases the chance of him being a smoker! (90% / 95%) = 0.95).
So we have:
P(smoker | f1, f2, f3,... ) = P(smoker) * contribution of f1* contribution of f2 *...
=
P(smoker)*
(P(being a man | smoker)/P(being a man))*
(P(under 20 | smoker)/ P(under 20))
Note that in this formula we assumed that being a man and being under 20 are independent features so we multiplied them, it means that knowing that someone is under 20 has no effect on guessing that he is man or woman. But it may not be true, for example maybe most adolescence in a society are men...
To use this formula in a classifier
The classifier is given with some features (being a man and being under 20) and it must decide if he is an smoker or not (these are two classes). It uses the above formula to calculate the probability of each class under the evidence (features), and it assigns the class with the highest probability to the input. To provide the required probabilities (90%, 10%, 80%...) it uses the training set. For example, it counts the people in the training set that are smokers and find they contribute 10% of the sample. Then for smokers checks how many of them are men or women .... how many are above 20 or under 20....In the other words, it tries to build the probability distribution of the features for each class based on the training data.

Resources