Normalize 5-star rating to make rating more uniform - algorithm

I have a system where people rate different items on a scale from 0-5. The issue is, not everyone rates the same items, and the scoring is not objective. The goal is to achieve a fair comparison between items so that an item's score is not affected too much if one of the scorers is very "lenient" or "harsh." In actuality, there may be 100 items, each one scored twice, but here is an example dataset where 4 people scored 12 items, each one being scored twice:
| Item | Score 1 | Score 2 |
_____
1 | 5 | | 4 |
2 | 5 | | 3 | C
3 | 4 | |_2_|
4 A | 5 | | 5 |
5 | 3 | | 0 |
6 |_5_| | 3 |
7 | 3 | | 1 | D
8 | 4 | | 1 |
9 B | 4 | |_2_|
10 | 4 | | 3 |
11 | 4 | | 3 | C
12 |_5_| | 4 |
In this table, the boxes represent a single person's set of scores. We can label the person who gave score 1 to items 1-6 person A, the one who gave score 1 to 7-12 person B, the one who gave score 2 to 1-3 and 10-12 person C, and the one who gave score 2 to to 4-9 person D.
Informally, if we assume person C was the closest to each item's objective score, we might reason as follows:
Person A generally gave higher scores than C on items 1-3 so he is "lenient."
D gave low scores to all of them except for item 4 which then must have been truly good. He gave scores generally lower than A, so his scores should be adjusted slightly upwards perhaps.
B gave higher scores than D, and a bit higher than C, so a bit "lenient".
Thus, we might produce adjusted scores for each item. For example, even though item 2 has a higher average score than item 9, they are probably on par considering A is generally lenient and D is generally harsh. The question is, how do we do this programmatically. I thought we might make several transformation functions which transform a raw score into an adjusted score, say A, B, C, and D. For example, we might have A(5)=3.7 because when A rates an item as 5, it is really in the 3-4 range. Then, we want to minimize
|A(x_0a)-C(x_0c)|^2 + |D(x_1d)-A(x_1a)|^2 + |B(x_2b)-D(x_2d)|^2 + |C(x_3c)-B(x_3b)|^2
where x_ip is a vector which consists of person p's ratings for items 3i+1, 3i+2, and 3i+3. We might make A, B, C, and D linear transformations, for example. How then do you optimize it? And is this the best way to eliminate one the harshness or leniency of scorers without throwing away their ratings?

Related

How to decide the probability percentage in question

I have the below question:
In the first part of the question, is says the probability that the selected person will be a male is 0.44, it means the number of males is 25*0.44 = 11. That's ok
In the second part, the probability of the selected person will be a male who was born before 1960 is 0.28, Does that mean 0.28 out of the total number which is 25 or out of the number of males?
I mean should the number of male who was born before 1960 equals into 250.28 OR 110.28
I find it easiest to think of these sorts of problems as contingency tables.
You use a maxtrix layout to express the distributions in terms of two or more factors or characteristics, each having two or more categories. The table can be constructed either with probabilities (proportions) or with counts, and switching back and forth is easy based on the total count in the table. Entries in the table are the intersections of the categories, corresponding to and in a verbal description. The numbers to the right or at the bottom of the table are called marginals, because they're found in the margins of the tables, and are always the sum of the table row or column entries in which they occur. The total probability (or count) in the table is found by summing across all the rows and columns. The marginal distribution of gender would be found by summing across rows, and the marginal distribution of birthdays would be found by summing across the columns.
Based on this, you can inferentially determine other values as indicated by the entries in parentheses below. With one more entry, either for gender or in the marginal row for birthdays, you'd be able to fill in the whole table inferentially. (This is related to the concept of degrees of freedom - how many pieces of info can you fill in independently before the others are determined by the known constraint that the totals are fixed or that probability adds to 1.)
Probabilities
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (0.56)
n __|_________|__________|
d | | |
e M | 0.28 | (0.16) | 0.44
r __|_________|__________|______
? ? | 1.00
Counts
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (14)
n __|_________|__________|
d | | |
e M | 7 | (4) | 11
r __|_________|__________|_____
? ? | 25
Conditional probability corresponds to limiting yourself to the subset of rows or columns specified in the condition. If you had been asked what is the probability of a birthday < 1960 given the gender is male, i.e., P{birthday < 1960 | M} in relatively standard notation, you'd be restricting your focus to just the M row, so the answer would be 7/11 = 0.28/0.44. Computationally, you take the probabilities or counts in the qualifying table entries and express them as a proportion of the probabilities or counts of the specified (given) marginal entries. This is often written in prob & stats texts as P(A|B) = P(AB)/P(B), where AB is a set shorthand for A and B (intersection).
0,44 = 11 / 25 people are male.
0,28 = 7 / 25 people are male & born before 1960.

Minimum cost within limited time for a timetable?

I have a timetable like this:
+-----------+-------------+------------+------------+------------+------------+-------+----+
| transport | trainnumber | departcity | arrivecity | departtime | arrivetime | price | id |
+-----------+-------------+------------+------------+------------+------------+-------+----+
| Q | Q00 | BJ | TJ | 13:00:00 | 15:00:00 | 10 | 1 |
| Q | Q01 | BJ | TJ | 18:00:00 | 20:00:00 | 10 | 2 |
| Q | Q02 | TJ | BJ | 16:00:00 | 18:00:00 | 10 | 3 |
| Q | Q03 | TJ | BJ | 21:00:00 | 23:00:00 | 10 | 4 |
| Q | Q04 | HA | DL | 06:00:00 | 11:00:00 | 50 | 5 |
| Q | Q05 | HA | DL | 14:00:00 | 19:00:00 | 50 | 6 |
| Q | Q06 | HA | DL | 18:00:00 | 23:00:00 | 50 | 7 |
| Q | Q07 | DL | HA | 07:00:00 | 12:00:00 | 50 | 8 |
| Q | Q08 | DL | HA | 15:00:00 | 20:00:00 | 50 | 9 |
| ... | ... | ... | ... | ... | ... | ... | ...|
+-----------+-------------+------------+------------+------------+------------+-------+----+
In this table, there 13 cities and 116 routes altogether and the smallest unit of time is half an hour.
There are difference transports, which doesn't matter. As you can see, there can be multiple edges with same departcity and arrivecity but difference time and difference price. The time is constant everyday.
Now, here arises a problem.
A user wonder how he can travel from city A to city B (A and B may be one city), with passing zero or some cities C, D...(whether they should be in order depends on whether the user wants it to be, that is, there are two problems), within X hours and also least costs under above conditions.
Before this problem, I have solved another simpler problem.
A user wonder how he can travel from city A to city B (A and B may be one city), with passing zero or some cities C, D...(whether they should be in order depends), with least costs under above conditions.
Here is how I solve it (just take not in order as an example):
Sort the must-pass cities:C1, C2, C3...Cn. Let C0 = A, C(n+1) = B, minCost.cost = INFINITE;
i = 0, j = 1, W = {};
Find a least cost way S from Ci to Cj using Dijkstra Algorithm with price as the weight of edges. W=W∪S;
i = i + 1, j = j + 1;
If j <= n + 1, goto 3;
if W.cost < minCost.cost, minCost = W;
If next permutation for C1...Cn exists, rearrange list C1...Cn in order of the next permutation for C1...Cn and goto 2;
Return minCost;
However, I cannot come up with a efficient solution to the first problem, Please help me, thanks.
I'll be appreciated if anyone can solve another problem:
A user wonder how he can travel from city A to city B (A and B may be one city), with passing zero or some cities C, D...(whether they should be in order depends), within least time under above conditions.
It's quite a big problem, so I will just sketch a solution.
First, remodel your graph as follows. Instead of each vertex representing a city, let a vertex represent a tuple of (city, time). This is feasible as there are only 13 cities and only (time_limit - current_time) * 2 possible time slots as the smallest unit of time is half an hour. Now connect vertices according to the given timetable with prices as their weights as before. Don't forget that the user can stay at any city for any amount of time for free. All nodes with city A are start nodes, all nodes with city B are target nodes. Take the minimum value of all (B, time) vertices to get the solution with least cost. If there are multiple, take the one with the smallest time.
Now on towards forcing the user to pass through certain cities in order. If there are n cities to pass through (plus start and target city), you need n+2 copies of the same graph which act as different levels. The level represents how many cities of your list you have already passed. So you start in level 0 on vertex A. Once you get to C1 in level 0 you move to the vertex C1 in level 1 of the graph (connect the vertices by 0-weight edges). This means that when you are in level k, you have already passed cities C1 to Ck and you can get to the next level only by going through C(k+1). The vertices of city B in the last level are your target nodes.
Note: I said copies of the same graph, but that is not exactly true. You can't allow the user to reach C(k+2), ..., B in level k, that would violate the required order.
To enforce passing cities in any order, a different scheme of connecting the levels (and modifying them during runtime) is required. I'll leave this to you.

How to randomize across categories holding the mean equal?

I am looking for some conceptional inputs detached from any specific platform/software for the following problem:
Let R be a Nx2 matrix with the first column denoting the object ID and the second column the category (e.g. from 1 to 10).
ID | Category
1 | 1
2 | 1
3 | 1
4 | 2
5 | 2
6 | 3
7 | 3
8 | 3
9 | 3
. | .
. | .
Further, assume we have a matrix C which assignes for each cateogry a number, e.g.:
Category | Number
1 | 0.5
2 | 0.2
3 | 0.9
. | .
. | .
So for each object in matrix R a number can be mapped according to matrix C (e.g. for ID=1 with category=1, the number according to matrix C is 0.5).
The goal now is to create an algorithm which randomizes the objects across a pre-specified category-range with the overall average of the column number (which is mapped to the corresponding category) being held constant.
E.g. assume that the category-range is defined as 2 meaning that each object from category 1 can either stay in category 1, randomly be shifted to category 2 or even up to category 3. Similarly, an object from category 3 with a selected category range of 1 can either be moved down to category 2, stay at category 3 or move up to category 4). If an object is now shifted to another category, it gets assigned a new number according to matrix C which impacts the overall average across the column numbers.
However, all swaps have to be executed on a purely random basis with the additional constraint that the average across the column number after the randomization is equal to the one from the beginning.
Any input would be greatly appreciated.

Faster computation to get multiples of number at different levels

Here is the scenario:
We have several items that are shipped to many stores. We want to be able to allocate a certain quantity of each item to a store based on need. Each of these stores is also associated to a specific warehouse.
The catch is that at the warehouse level, the total quantity of each item must be a multiple of a number (6 for example).
I have already calculated out the quantity needed by each store at store level, but they do not sum up to a multiple of 6 at the warehouse level.
My solution was this using Excel:
Using a SUMIFS formula to keep track of the sum of each item allocated at the warehouse level. Then another MOD(6) formula that calculates the remaining until a multiple of 6. Then my actually VBA code loops through and subtracts 1 (if MOD <= 3) or adds (if MOD > 3) from the store level units needed until MOD = 0 for all rows.
Now this works for me, but is extremely slow even when I have just ~5000 rows.
I am looking for a faster solution, because everytime I subtract/add to units needed, the SUMIFS and MOD need to be calculated again.
EDIT: (trying to be clearer)
I have a template file that I paste my data into with the following setup:
+------+-------+-----------+----------+--------------+--------+
| Item | Store | Warehouse | StoreQty | WarehouseQty | Mod(6) |
+------+-------+-----------+----------+--------------+--------+
| 1 | 1 | 1 | 2 | 8 | 2 |
| 1 | 2 | 1 | 3 | 8 | 2 |
| 1 | 3 | 1 | 1 | 8 | 2 |
| 1 | 4 | 1 | 2 | 8 | 2 |
| 2 | 1 | 2 | 1 | 4 | 2 |
| 2 | 2 | 2 | 3 | 4 | 2 |
+------+-------+-----------+----------+--------------+--------+
Currently the WarehouseQty column is the SUMIFS formula summing up the StoreQty for each Item-Store combo that is associated to the Warehouse. So I guess the Warehouse/WarehouseQty columns is actually duplicated several times every time an Item-Store combo shows up. The WarehouseQty is the one that needs to be a multiple of 6.
Screen updating can be turned OFF to speed up length computations like this:
Application.ScreenUpdating = FALSE
The opposite assignment turns screen updating back on again.
put the data into an array first, rather than cells, then put the data back after you have manipulated it - this will be much faster.
an example which uses your criteria:
Option Explicit
Sub test()
Dim q() 'this is what will be used for the range
Dim i As Long
q = Range("C2:C41") 'put the data into the array - *ALWAYS* 2 dimensions, even if a single column
For i = LBound(q) To UBound(q) ' use this, in case it's a dynamic array - 1 to 40 would have worked here
Select Case q(i, 1) Mod 6 ' calculate remander
Case 0 To 3
q(i, 1) = q(i, 1) - (q(i, 1) Mod 6) 'make a multiple of 6
Case 4 To 5
q(i, 1) = q(i, 1) - (q(i, 1) Mod 6) + 6 ' and go higher in the later numbers
End Select
Next i
Range("D2:D41") = q ' drop the data back
End Sub
Guessing you may find that stopping the screen refresh may help quite a chunk and therefore not need any more suggestions.
Another option would be to reduce your adjustment to a quantity which is divisible by 6 to a number of if statements, depending on the value of mod(6).
You could also address how you sum up the number of a particular item across all stores, using a pivot table and reading the sum totals from there is a lot quicker than using sumifs in a macro
Based on your modifications to the question:
You're correct that you could have huge amounts of replication doing the calculation row by row, as well as adjusting the quantity by a single unit at a time even though you know exactly how many units you need to add / remove from the mod(6) formula.
Could you not create a new sheet with all your possible combinations of product Id and store. You could then use sumifs() for each of these unique combinations and in a final step round up/down at a warehouse level?

Apriori Algorithm

I've heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings of this algorithm? Also, a basic example would make it a lot easier for me to understand.
Apriori Algorithm
It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.
Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.
Apriori Property - A given (k+1)-itemset is a candidate (k+1)-itemset only if everyone of its k-itemset subsets are frequent.
Now, here is the apriori algorithm in 4 steps.
Initially, scan the database/dataset once to get the frequent 1-itemset.
Generate length k+1 candidate itemsets from length k frequent itemsets.
Test the candidates against the database/dataset.
Terminate when no frequent or candidate set can be generated.
Solved Example
Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - min_sup is 2. The term support is the number of transactions in which a certain itemset is present/included.
Transaction DB
tid | items
-------------
10 | A,C,D
20 | B,C,E
30 | A,B,C,E
40 | B,E
Now, let's create the candidate 1-itemsets by the 1st scan of the DB. It is simply called as the set of C_1 as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{D} | 1
{E} | 3
If we test this with min_sup, we can see {D} does not satisfy the min_sup of 2. So, it will not be included in the frequent 1-itemset, which we simply call as the set of L_1 as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{E} | 3
Now, let's scan the DB for the 2nd time, and generate candidate 2-itemsets, which we simply call as the set of C_2 as follows.
itemset | sup
-------------
{A,B} | 1
{A,C} | 2
{A,E} | 1
{B,C} | 2
{B,E} | 3
{C,E} | 2
As you can see, {A,B} and {A,E} itemsets do not satisfy the min_sup of 2 and hence they will not be included in the frequent 2-itemset, L_2
itemset | sup
-------------
{A,C} | 2
{B,C} | 2
{B,E} | 3
{C,E} | 2
Now let's do a 3rd scan of the DB and get candidate 3-itemsets, C_3 as follows.
itemset | sup
-------------
{A,B,C} | 1
{A,B,E} | 1
{A,C,E} | 1
{B,C,E} | 2
You can see that, {A,B,C}, {A,B,E} and {A,C,E} does not satisfy min_sup of 2. So they will not be included in frequent 3-itemset, L_3 as follows.
itemset | sup
-------------
{B,C,E} | 2
Now, finally, we can calculate the support (supp), confidence (conf) and lift (interestingness value) values of the Association/Correlation Rules that can be generated by the itemset {B,C,E} as follows.
Rule | supp | conf | lift
-------------------------------------------
B -> C & E | 50% | 66.67% | 1.33
E -> B & C | 50% | 66.67% | 1.33
C -> E & B | 50% | 66.67% | 1.77
B & C -> E | 50% | 100% | 1.33
E & B -> C | 50% | 66.67% | 1.77
C & E -> B | 50% | 100% | 1.33
See Top 10 algorithms in data mining (free access) or The Top Ten Algorithms in Data Mining. The latter gives a detailed description of the algorithm, together with details on how to get optimized implementations.
Well, I would assume you've read the wikipedia entry but you said "a basic example would make it a lot easier for me to understand". Wikipedia has just that so I'll assume you haven't read it and suggest that you do.
Read the wikipedia article.
The best introduction to Apriori can be downloaded from this book:
http://www-users.cs.umn.edu/~kumar/dmbook/index.php
you can download the chapter 6 for free which explain Apriori very clearly.
Moreover, if you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:
http://www.philippe-fournier-viger.com/spmf/

Resources