Ranking items by score and relative frequencies - algorithm

I want to rank item types by comparing the ratio of frequency in basket 1 over frequency in another basket 2.
For example, if item type A has about 5 counts in basket 1 and 0 counts in basket 2, this should rank much higher than type B with say 10 items in basket 1 and 10 items in basket 2. I use the odds ratio abs(log(freq in basket1/freq in basket2)), however this doesn't capture the fact that I should prioritize abs(log(10/100)) as abs(log(1/10)).
I'm thinking whether to add multiply this result by their the total count e.g (10+100)abs(log(10/100)) but then again this amount seems to overwhelm the log value.
What would be a good suggestion to weigh the log values?

The standard approach to these types of tasks is to model the item as a biased coin that produces baskets, B1 with probability p and B2 with 1 - p. Intuitively this means that an item type has an underlying true ratio of baskets which produces a particular split of items between baskets. So a "90% A" might produce [9,1] but also [10,0] or even [0,10] although this result with a pretty low probability.
Then you can look at a sample like [5,0] and [10,1] and calculate a confidence interval for the parameter p, then rank the item types by the lower bound of the interval. This way [10,2] will sort above [5,1]. Even though proportions in both samples are the same, [10,2] will have a narrower confidence interval and thus its lower bound will be higher.
The idea and some more detailed formulas is described at: http://www.evanmiller.org/how-not-to-sort-by-average-rating.html

Related

Formal name for this optimization algorithm?

I have the following problem in one of my coding project which I will simplify here:
I am ordering groceries online and want very specific things in very specific quantities. I would like to order the following:
8 Apples
1 Yam
2 Soups
3 Steaks
20 Orange Juices
There are many stores equidistant from me which I will have food delivered from. Not all stores have what I need. I want to obtain what I need with the fewest number of orders made. For example, ordering from Store #2 below is a wasted order, since I can complete my items in less orders by ordering from different stores. What is the name of the optimization algorithm that solves this?
Store #1 Supply
50 Apples
Store #2 Supply
1 Orange Juice
2 Steaks
1 Soup
Store #3 Supply
25 Soup
50 Orange Juices
Store #4 Supply
25 Steaks
10 Yams
The lowest possible orders is 3 in this case. 8 Apples from Store #1. 2 Soup and 20 Orange Juice from Store #3. 1 Yam and 3 Steaks from Store #4.
To me, this most likely sounds like a restricted case of the Integer Linear programming problem (ILP), namely, its 0-or-1 variant, where the integer variables are restricted to the set {0, 1}. This is known to be NP-hard (and the corresponding decision problem is NP-complete).
The problem is formulated as follows (following the conventions in the op. cit.):
Given the matrix A, the constraint vector b, and the weight vector c, find the vector x ∈ {0, 1}N such that all the constraints A⋅x ≥ b are satisfied, and the cost c⋅x is minimal.
I flipped the constraint inequality, but this is equivalent to changing the sign of both A and b.
The inequalities indicate satisfaction of your order: that you can buy at the least the amount of every item in the visited store. Note that b has the same length as the number of rows in A and the number of columns in both c and x. The dot-product c⋅x is, naturally, a scalar.
Since you are minimizing the number of trips, each trip costs the same, so that c = 1, and c⋅x is the total number of trips. The store inventory matrix A has a row per item, and a column per store, and the b is your shopping list.
Naturally, the exact best solution is found by trying all possible 2N values for the x.
Since there is no single approach to NP-hard problems, consider the problem size, and how close to the optimum you want to arrive. A greedy approach would work well (when your next store to visit has the most total number of items not yet satisfied) when the "inventories" are large. If you have the idea in advance about the expected minimum number of trips, you can trim the search beam at some value, exceeding the number of trips by some multiplication coefficient. This is the best approach when your search is time constrained (I routinely do beam searches, closely related to the branch-and-cut approach mentioned in the article, in graphs that take a few GB of memory slightly faster than the limit of 30ms per exploration step with a beam as wide as 10,000). Simulated annealing also works, if the search landscape is not excessively rough.
Also search on cs.SE; it may be even a better place for questions of this type.

Fractional Knapsack : why value/weight?

In Fractional Knapsack problem,
1. prepare the third array, value per weight array, dividing the weight of each item by its corresponding value
2. sort the items in descending order according to their value per weight
the reason behind the step 1 and 2?
Because it is more profitable to fill all possible volume with substance having the highest value/weight ratio. So use up all given quantity of item with the best value/weight ratio, then with the second and so on.
Just think - if you believe that it is possible to fill some part with less valuable item (while more valuable one is available) - change it to more valuable and you'll have got some additional money
In fractional knapsack you can add part/fraction of item in your knapsack(answer) to maximize the total value of item in your bag. As the goal is to maximize the total value in knapsack we should put items such that the value of weight being put is high and it takes as less space as possible , so that more can be added to the knapsack. Hence, the weight per value is required to calculate. The capacity of knapsack is not unlimited , hence we need value per weight to utilize the space optimally.
For example if we have weight ={10,20,30,} and value {60,100,120} and bag can hold 50 at Max then if don't divide the item then we can have only 1 item of 20 and 1 of 30 . So the total value will be 220.
But as per fractional knapsack we divide value by weight and get the array {6,5,4} . Sort it(already sorted ) . Now the order of item becomes i1,i2, i3
Take all 10 kg of i1=6*10
Take all 20kg of i2= 5*20
Take the remaining 20kg from i3= 4*((2/3)*30)
Total value is= 60+100+80= 240
Hence, we need value per weight.
Refer the link: https://www.geeksforgeeks.org/fractional-knapsack-problem/
This is related to basic mathematics. Let's take a constant, which has numerator and denominator.
In this case:
Constant: value/weight ratio.
Numerator: value
Denominator: weight
As the Constant and Numerator have direct relation, when the constant is bigger, the numerator is also bigger. Simultaneously, when the constant is bigger, the denominator is smaller, as they have indirect relation.
So, when we're arranging ratios in descending order, we're trying to sort entries of maximum value and less weight relatively and put them in our bag with given capacity in that order.

How do I calculate the most profit-dense combination in the most efficient way?

I have a combinations problem that's bothering me. I'd like someone to give me their thoughts and point out if I'm missing some obvious solution that I may have overlooked.
Let's say that there is a shop that buys all of its supplies from one supplier. The supplier has a list of items for sale. Each item has the following attributes:
size, cost, quantity, m, b
m and b are constants in the following equation:
sales = m * (price) + b
This line slopes downward. The equation tells me how many of that item I will be able to sell if I charge that particular price. Each item has its own m and b values.
Let's say that the shop has limited storage space, and limited funds. The shop wants to fill its warehouse with the most profit-dense items possible.
(By the way, profit density = profit/size. I'm defining that profit density be only with regard to the items size. I could work with the density with regard to size and cost, but to do that I'd have to know the cost of warehouse space. That's not a number I know currently, so I'm just going to use size.)
The profit density of items drops the more you buy (see below.)
If I flip the line equation, I can see what price I'd have to charge to sell some given amount of the item in some given period of time.
price = (sales-b)/m
So if I buy n items and wanted to sell all of them, I'd have to charge
price = (n-b)/m
The revenue from this would be
price*n = n*(n-b)/m
The profit would be
price*n-n*cost = n*(n-b)/m - n*cost
and the profit-density would be
(n*(n-b)/m - n*cost)/(n*size)
or, equivalently
((n-b)/m - cost)/size
So let's say I have a table containing every available item, and each item's profit-density.
The question is, how many of each item do I buy in order to maximise the amount of money that the shop makes?
One possibility is to generate every possible combination of items within the bounds of cost and space, and choose the combo with the highest profitability. In a list of 1000 items, this takes too long. (I tried this and it took 17 seconds for a list of 1000. Horrible.)
Another option I tried (on paper) was to take the top two most profitable items on the list. Let's call the most profitable item A, the 2nd-most profitable item B, and the 3rd-most profitable item C. I buy as many of item A as I can until it's less profitable than item B. Then I repeat this process using B and C, for every item in the list.
It might be the case however, that after buying item B, item A is again the most profitable item, more so than C. So this would involve hopping from the current most profitable item to the next until the resources are exhausted. I could do this, but it seems like an ugly way to do it.
I considered dynamic programming, but since the profit-densities of the items change depending on the amount you buy, I couldn't come up with a resolution for this.
I've considered multiple-linear regression, and by 'consider' I mean I've said to myself "is multi-linear regression an option?" and then done nothing with it.
My spidey-sense tells me that there's a far more obvious method staring me in the face, but I'm not seeing it. Please help me kick myself and facepalm at the same time.
If you treat this as a simple exercise in multivariate optimization, where the controllable variables are the quantities bought, then you are optimizing a quadratic function subject to a linear constraint.
If you use a Lagrange multiplier and differentiate then you get a linear equation for each quantity variable involving itself and the Lagrange multiplier as the only unknowns, and the constraint gives you a single linear equation involving all of the quantities. So write each quantity as a linear function of the Lagrange multiplier and substitute into the constraint equation to get a linear equation in the Lagrange multiplier. Solve this and then plug the Lagrange multiplier into the simpler equations to get the quantities.
This gives you a solution if you are allowed to buy fractional and negative quantities of things if required. Clearly you are not, but you might hope that nothing is very negative and you can round the non-integer quantities to get a reasonable answer. If this isn't good enough for you, you could use it as a basis for branch and bound. If you make an assumption on the value of one of the quantities and solve for the others in this way, you get an upper bound on the possible best answer - the profit predicted neglecting real world constraints on non-negativity and integer values will always be at least the profit earned if you have to comply with these constraints.
You can treat this as a dynamic programming exercise, to make the best use of a limited resource.
As a simple example, consider just satisfying the constraint on space and ignoring that on cost. Then you want to find the items that generate the most profit for the available space. Choose units so that expressing the space used as an integer is reasonable, and then, for i = 1 to number of items, work out, for each integer value of space up to the limit, the selection of the first i items that gives the most return for that amount of space. As usual, you can work out the answers for i+1 from the answers for i: for each value from 0 up to the limit on space just consider all possible quantities of the i+1th item up to that amount of space, and work out the combined return from using that quantity of the item and then using the remaining space according to the answers you have already worked out for the first i items. When i reaches the total number of items you will be working out the best possible return for the problem you actually want to solve.
If you have constraints for both space and cost, then the state of the dynamic program is not the single variable (space) but a pair of variables (space, cost) but you can still solve it, although with more work. Consider all possible values of (space, cost) from (0, 0) up to the actual constraints - you have a 2-dimensional table of returns to compute instead of a single set of values from 0 to max-space. But you can still work from i=1 to N, computing the highest possible return for the first i items for each limit of (space, cost) and using the answers for i to compute the answers for i+1.

How to balance number of ratings versus the ratings themselves?

For a school project, we'll have to implement a ranking system. However, we figured that a dumb rank average would suck: something that one user ranked 5 stars would have a better average that something 188 users ranked 4 stars, and that's just stupid.
So I'm wondering if any of you have an example algorithm of "smart" ranking. It only needs to take in account the rankings given and the number of rankings.
Thanks!
You can use a method inspired by Bayesian probability. The gist of the approach is to have an initial belief about the true rating of an item, and use users' ratings to update your belief.
This approach requires two parameters:
What do you think is the true "default" rating of an item, if you have no ratings at all for the item? Call this number R, the "initial belief".
How much weight do you give to the initial belief, compared to the user ratings? Call this W, where the initial belief is "worth" W user ratings of that value.
With the parameters R and W, computing the new rating is simple: assume you have W ratings of value R along with any user ratings, and compute the average. For example, if R = 2 and W = 3, we compute the final score for various scenarios below:
100 (user) ratings of 4: (3*2 + 100*4) / (3 + 100) = 3.94
3 ratings of 5 and 1 rating of 4: (3*2 + 3*5 + 1*4) / (3 + 3 + 1) = 3.57
10 ratings of 4: (3*2 + 10*4) / (3 + 10) = 3.54
1 rating of 5: (3*2 + 1*5) / (3 + 1) = 2.75
No user ratings: (3*2 + 0) / (3 + 0) = 2
1 rating of 1: (3*2 + 1*1) / (3 + 1) = 1.75
This computation takes into consideration the number of user ratings, and the values of those ratings. As a result, the final score roughly corresponds to how happy one can expect to be about a particular item, given the data.
Choosing R
When you choose R, think about what value you would be comfortable assuming for an item with no ratings. Is the typical no-rating item actually 2.4 out of 5, if you were to instantly have everyone rate it? If so, R = 2.4 would be a reasonable choice.
You should not use the minimum value on the rating scale for this parameter, since an item rated extremely poorly by users should end up "worse" than a default item with no ratings.
If you want to pick R using data rather than just intuition, you can use the following method:
Consider all items with at least some threshold of user ratings (so you can be confident that the average user rating is reasonably accurate).
For each item, assume its "true score" is the average user rating.
Choose R to be the median of those scores.
If you want to be slightly more optimistic or pessimistic about a no-rating item, you can choose R to be a different percentile of the scores, for instance the 60th percentile (optimistic) or 40th percentile (pessimistic).
Choosing W
The choice of W should depend on how many ratings a typical item has, and how consistent ratings are. W can be higher if items naturally obtain many ratings, and W should be higher if you have less confidence in user ratings (e.g., if you have high spammer activity). Note that W does not have to be an integer, and can be less than 1.
Choosing W is a more subjective matter than choosing R. However, here are some guidelines:
If a typical item obtains C ratings, then W should not exceed C, or else the final score will be more dependent on R than on the actual user ratings. Instead, W should be close to a fraction of C, perhaps between C/20 and C/5 (depending on how noisy or "spammy" ratings are).
If historical ratings are usually consistent (for an individual item), then W should be relatively small. On the other hand, if ratings for an item vary wildly, then W should be relatively large. You can think of this algorithm as "absorbing" W ratings that are abnormally high or low, turning those ratings into more moderate ones.
In the extreme, setting W = 0 is equivalent to using only the average of user ratings. Setting W = infinity is equivalent to proclaiming that every item has a true rating of R, regardless of the user ratings. Clearly, neither of these extremes are appropriate.
Setting W too large can have the effect of favoring an item with many moderately-high ratings over an item with slightly fewer exceptionally-high ratings.
I appreciated the top answer at the time of posting, so here it is codified as JavaScript:
const defaultR = 2;
const defaultW = 3; // should not exceed typicalNumberOfRatingsPerAnswers 0 is equivalent to using only average of ratings
function getSortAlgoValue(ratings) {
const allRatings = ratings.reduce((sum, r) => sum + r, 0);
return (defaultR * defaultW + allRatings) / (defaultW + ratings.length);
}
Only listed as a separate answer because the formatting of the code block as a reply wasn't very
Since you've stated that the machine would only be given the rankings and the number of rankings, I would argue that it may be negligent to attempt a calculated weighting method.
First, there are two many unknowns to confirm the proposition that in enough circumstances a larger quantity of ratings are a better indication of quality than a smaller number of ratings. One example is how long have rankings been given? Has there been equal collection duration (equal attention) given to different items ranked with this same method? Others are, which markets have had access to this item and, of course, who specifically ranked it?
Secondly, you've stated in a comment below the question that this is not for front-end use but rather "the ratings are generated by machines, for machines," as a response to my comment that "it's not necessarily only statistical. One person might consider 50 ratings enough, where that might not be enough for another. And some raters' profiles might look more reliable to one person than to another. When that's transparent, it lets the user make a more informed assessment."
Why would that be any different for machines? :)
In any case, if this is about machine-to-machine rankings, the question needs greater detail in order for us to understand how different machines might generate and use the rankings.
Can a ranking generated by a machine be flawed (so as to suggest that more rankings may somehow compensate for those "flawed" rankings? What does that even mean - is it a machine error? Or is it because the item has no use to this particular machine, for example? There are many issues here we might first want to unpack, including if we have access to how the machines are generating the ranking, on some level we may already know the meaning this item may have for this machine, making the aggregated ranking superfluous.
What you can find on different plattforms is the blanking of ratings without enough votings: "This item does not have enough votings"
The problem is you can't do it in an easy formula to calculate a ranking.
I would suggest a hiding of ranking with less than minimum votings but caclulate intern a moving average. I always prefer moving average against total average as it prefers votings from the last time against very old votings which might be given for totaly different circumstances.
Additionally you do not need to have too add a list of all votings. you just have the calculated average and the next voting just changes this value.
newAverage = weight * newVoting + (1-weight) * oldAverage
with a weight about 0.05 for a preference of the last 20 values. (just experiment with this weight)
Additionally I would start with these conditions:
no votings = medium range value (1-5 stars => start with 3 stars)
the average will not be shown if less than 10 votings were given.
A simple solution might be a weighted average:
sum(votes) / number_of_votes
That way, 3 people voting 1 star, and one person voting 5 would give a weighted average of (1+1+1+5)/4 = 2 stars.
Simple, effective, and probably sufficient for your purposes.

How can I sort a 10 x 10 grid of 100 car images in two dimensions, by price and speed?

Here's the scenario.
I have one hundred car objects. Each car has a property for speed, and a property for price. I want to arrange images of the cars in a grid so that the fastest and most expensive car is at the top right, and the slowest and cheapest car is at the bottom left, and all other cars are in an appropriate spot in the grid.
What kind of sorting algorithm do I need to use for this, and do you have any tips?
EDIT: the results don't need to be exact - in reality I'm dealing with a much bigger grid, so it would be sufficient if the cars were clustered roughly in the right place.
Just an idea inspired by Mr Cantor:
calculate max(speed) and max(price)
normalize all speed and price data into range 0..1
for each car, calculate the "distance" to the possible maximum
based on a²+b²=c², distance could be something like
sqrt( (speed(car[i])/maxspeed)^2 + (price(car[i])/maxprice)^2 )
apply weighting as (visually) necessary
sort cars by distance
place "best" car in "best" square (upper right in your case)
walk the grid in zigzag and fill with next car in sorted list
Result (mirrored, top left is best):
1 - 2 6 - 7
/ / /
3 5 8
| /
4
Treat this as two problems:
1: Produce a sorted list
2: Place members of the sorted list into the grid
The sorting is just a matter of you defining your rules more precisely. "Fastest and most expensive first" doesn't work. Which comes first my £100,000 Rolls Royce, top speed 120, or my souped-up Mini, cost £50,000, top speed 180?
Having got your list how will you fill it? First and last is easy, but where does number two go? Along the top or down? Then where next, along rows, along the columns, zig-zag? You've got to decide. After that coding should be easy.
I guess what you want is to have cars that have "similar" characteristics to be clustered nearby, and additionally that the cost in general increases rightwards, and speed in general increases upwards.
I would try to following approach. Suppose you have N cars and you want to put them in an X * Y grid. Assume N == X * Y.
Put all the N cars in the grid at random locations.
Define a metric that calculates the total misordering in the grid; for example, count the number of car pairs C1=(x,y) and C2=(x',y') such that C1.speed > C2.speed but y < y' plus car pairs C1=(x,y) and C2=(x',y') such that C1.price > C2.price but x < x'.
Run the following algorithm:
Calculate current misordering metric M
Enumerate through all pairs of cars in the grid and calculate the misordering metric M' you obtain if you swapt the cars
Swap the pair of cars that reduces the metric most, if any such pair was found
If you swapped two cars, repeat from step 1
Finish
This is a standard "local search" approach to an optimization problem. What you have here is basically a simple combinatorial optimization problem. Another approaches to try might be using a self-organizing map (SOM) with preseeded gradient of speed and cost in the matrix.
Basically you have to take one of speed or price as primary and then get the cars with the same value of this primary and sort those values in ascending/descending order and primaries are also taken in the ascending/descending order as needed.
Example:
c1(20,1000) c2(30,5000) c3(20, 500) c4(10, 3000) c5(35, 1000)
Lets Assume Car(speed, price) as the measure in the above list and the primary is speed.
1 Get the car with minimum speed
2 Then get all the cars with the same speed value
3 Arrange these values in ascending order of car price
4 Get the next car with the next minimum speed value and repeat the above process
c4(10, 3000)
c3(20, 500)
c1(20, 1000)
c2(30, 5000)
c5(35, 1000)
If you post what language you are using them it would we helpful as some language constructs make this easier to implement. For example LINQ makes your life very easy in this situation.
cars.OrderBy(x => x.Speed).ThenBy(p => p.Price);
Edit:
Now you got the list, as per placing this cars items into the grid unless you know that there will be this many number of predetermined cars with these values, you can't do anything expect for going with some fixed grid size as you are doing now.
One option would be to go with a nonuniform grid, If you prefer, with each row having car items of a specific speed, but this is only applicable when you know that there will be considerable number of cars which has same speed value.
So each row will have cars of same speed shown in the grid.
Thanks
Is the 10x10 constraint necessary? If it is, you must have ten speeds and ten prices, or else the diagram won't make very much sense. For instance, what happens if the fastest car isn't the most expensive?
I would rather recommend you make the grid size equal to
(number of distinct speeds) x (number of distinct prices),
then it would be a (rather) simple case of ordering by two axes.
If the data originates in a database, then you should order them as you fetch them from the database. This should only mean adding ORDER BY speed, price near the end of your query, but before the LIMIT part (where 'speed' and 'price' are the names of the appropriate fields).
As others have said, "fastest and most expensive" is a difficult thing to do, you ought to just pick one to sort by first. However, it would be possible to make an approximation using this algorithm:
Find the highest price and fastest speed.
Normalize all prices and speeds to e.g. a fraction out of 1. You do this by dividing the price by the highest price you found in step 1.
Multiply the normalized price and speed together to create one "price & speed" number.
Sort by this number.
This ensures that is car A is faster and more expensive than car B, it gets put ahead on the list. Cars where one value is higher but the other is lower get roughly sorted. I'd recommend storing these values in the database and sorting as you select.
Putting them in a 10x10 grid is easy. Start outputting items, and when you get to a multiple of 10, start a new row.
Another option is to apply a score 0 .. 200% to each car, and sort by that score.
Example:
score_i = speed_percent(min_speed, max_speed, speed_i) + price_percent(min_price, max_price, price_i)
Hmmm... kind of bubble sort could be simple algorithm here.
Make a random 10x10 array.
Find two neighbours (horizontal or vertical) that are in "wrong order", and exchange them.
Repeat (2) until no such neighbours can be found.
Two neighbour elements are in "wrong order" when:
a) they're horizontal neighbours and left one is slower than right one,
b) they're vertical neighbours and top one is cheaper than bottom one.
But I'm not actually sure if this algorithm stops for every data. I'm almost sure it is very slow :-). It should be easy to implement and after some finite number of iterations the partial result might be good enough for your purposes though. You can also start by generating the array using one of other methods mentioned here. Also it will maintain your condition on array shape.
Edit: It is too late here to prove anything, but I made some experiments in python. It looks like a random array of 100x100 can be sorted this way in few seconds and I always managed to get full 2d ordering (that is: at the end I got wrongly-ordered neighbours). Assuming that OP can precalculate this array, he can put any reasonable number of cars into the array and get sensible results. Experimental code: http://pastebin.com/f2bae9a79 (you need matplotlib, and I recommend ipython too). iterchange is the sorting method there.

Resources