R - Problem: to find the optimum number of non-uniform bins to show a range of data points.
I have a bunch of data points (let us assume different prices of different mobiles). I need to categorize these mobile phones into some categories (based on the price). The bin size (in this example refers to the price range) need not be uniform (there might be lots of mobiles in the low price category and few in the long tail category).
Is there any efficient algorithm to find the optimum number of bins required and the number of data points (in this case mobile phones) which shall go into each category.
This is not a standard formula, but wanted to post as it seem to work well with data set i tested.
Find the average price of all the mobiles.
Ex: 5 mobiles with prices 10, 20, 40, 80, 200
Avg is 350/5 = 70
Subtract minimum price from average price: 70 - 10 = 60 -> name it N1
Subtract avg price from Max price: 200 - 70 = 130 -> name it N2
Find the ratio N2/N1 : 130/60: Roughly 2
This indicates that it is better to have 2 bins at the lower price range for every 1 bin at higher range.
So, for example take 2 bins below 70. Range 0 - 35(2 mobiles), 36 - 70(1 mobile)
1 bin above 70: Range 71 - 200(2 mobiles)
As you can see, number of bins and bin sizes are reasonably optimal.
Related
I'm in need of some kind of algorithm I can't figure out on my own sadly.
My biggest problem is that I have no good way to describe the problem... :/
I will try like this:
Imagine you have a racing game where everyone can try to be the fastest on a track or map. Every Map is worth 100 Points in total. If someone finished a map in some amount of time he gets a record in a database. If the player is the first and only player to finish this map he earns all the 100 points of this map.
Now, that's easy ;) but...
Now another player finishes the map. Let's imagine the first player finishes in 50 Seconds and the 2nd player finishes in 55 seconds, so a bit slower. I now need a calculation depending on both records in the database. Each of both players now earn a part of the 100 points. The faster player a bit more then the slower player. Let's say they finished the exact same time they both would get 50 points from 100, but as the first one is slightly faster, he now earns something around 53 of the points and the slower player just 47.
I started to calculate this like this:
Sum of both records is 105 seconds, the faster player took 50/105 in percent of this, so he earns 100-(50/105*100) points and the slower player 100-(55/105*100) points. The key to this is, that all points distributed among the players always equals to 100 in total. This works for 2 players, but it breaks at 3 and more.
For example:
Player 1 : 20 seconds
Player 2 : 20 seconds
Player 3 : 25 seconds
Calculation would be:
Player 1: 100-(20/65*100) = 69 points
Player 2: 100-(20/65*100) = 69 points
Player 3: 100-(25/65*100) = 61 points
This would no longer add up to 100 points in total.
Fair would be something around values of:
Player 1 & 2 (same time) = 35 points
Player 3 = 30 points
My problem is i can't figure out a algorithm which solves this.
And I need the same algorithm for any amount of players. Can someone help with an idea? I don't need a complete finished algorithm, maybe just an idea at which step i used the wrong idea, maybe the sum of all times is already a bad start.
Thx in advance :)
We can give each player points proportional to the reciprocal of their time.
One player with t seconds gets 100 × (1/t) / (1/t) = 100 points.
Of the two players, the one with 50 seconds gets 100 × (1/50) / (1/50 + 1/55) ≈ 52.4, and the one with 55 gets 100 × (1/55) / (1/50 + 1/55) ≈ 47.6.
Of the three players, the ones with 20 seconds get 100 × (1/20) / (1/20 + 1/20 + 1/25) ≈ 35.7, and the one with 25 seconds gets 100 × (1/25) / (1/20 + 1/20 + 1/25) ≈ 28.6.
Simple observation: Let the sum of times for all players be S. A person with lower time t would have a higher value of S-t. So you can reward points proportional to S-t for each player.
Formula:
Let the scores for N players be a,b,c...,m,n. Total sum S = a+b+c...+m+n. Then score for a given player would be
score = [S-(player's score)]/[(N-1)*S] * 100
You can easily see that using this formula, the sum of scores of all players will be always be 100.
Example 1:
S = 50 + 55 = 105, N-1 = 2-1 = 1
Player 1 : 50 seconds => score = ((105-50)/[1*105])*100 = 52.38
Player 2 : 55 seconds => score = ((105-55)/[1*105])*100 = 47.62
Similarly, for your second example,
S = 20 + 20 + 25 = 65
N - 1 = 3 - 1 = 2
For Player 1, (S-t) = 65-20 = 45
Player 1's score => (45/(2*65))*100 = 34.6
Player 2 => same as Player 1
For Player 3, (S-t) = 65-25 = 40
Player 3's score => (40/(2*65))*100 = 30.8
This method avoids any division in the intermediate states, so there will be no floating point issues for the calculations.
This video covers an implementation of the min coins to make change.
https://en.wikipedia.org/wiki/Change-making_problem
The place I'm not clear on is where the interviewer goes into the details of optimization, starting from here.
https://youtu.be/HWW-jA6YjHk?t=1875
He suggests that to make the min number of coins, using denominations [25, 10, 1], we only need to use the algorithm to make change for numbers above 50 cents, after which we can safely just use 25 cents. So if the number was $100.10, we can use 25 cents till we hit 50 cents at which time we need to use the algorithm to compute the precise value.
This makes sense for the list of denominations give [25, 10, 1]. To get the breakpoint figure he suggests using LCM of the denominations which is 50 in this case.
For example
32 - 25 * 1 + 1 * 7 = 8 coins. But with 10 cents we can do
32 - 10 * 3 + 1 * 2 = 5 coins.
So we cannot just assume 25 cents is going to be included in the minimum number of coins calculation.
Here is my question --
Suppose we have denominations [25, 10, 5, 1], the lcm is still 50. But there is no min solution for any number over 25 cents has doesn't include the 25.
eg -
32 - 25 * 1 + 5 * 1 + 1 * 2 = 4 coins.
32 - 10 * 3 + 1 * 2 = 5 coins
So shouldn't the breakpoint be 25 cents in this case? Instead of the lcm?
Thanks for answering.
The LCM of the values provides a minimum upper bound on the "break point", that point at which we cannot blithely assume that the highest-denomination coin is part of the solution. A little number theory will prove that the LCM is a boundary.
50 is the LCM of {25, 10}. For any amount >= 50, any combination including at least 5*10 can replace that element by 2*25, reducing the coin count. This argument applies to all other coins and combinations thereof. This simple demonstration does not universally apply below the LCM; there will be amounts that serve as counterexamples.
To keep the overall algorithm easy to understand and maintain, we use only the two phases: largest coin above that breakpoint, and full DP solution below -- where, for most applications, even a brute-force solution is generally efficient enough for practical purposes.
They didn't say we can't use 25 when the input is lower than the break point. They suggested that a good optimisation can be to use the highest denomination until we reduce the number to the break point (because that is guaranteed to be the least number of coins needed for that portion) and then switch to the more resource-intensive algorithm to count the rest of the needed coins.
Let's assume I have 3 different baskets with a fixed capacity
And n-products which provide different value for each basket -- you can only pick whole products
Each product should be limited to a max amount (i.e. you can maximal pick product A 5 times)
Every product adds at least 0 or more value to all baskets and come in all kinds of variations
Now I want a list with all possible combinations of products fitting in the baskets ordered by accuracy (like basket 1 is 5% more full would be 5% less accurate)
Edit: Example
Basket A capacity 100
Basket B capacity 80
Basket C capacity 30
fake products
Product 1 (A: 5, B: 10, C: 1)
Product 2 (A: 20 B: 0, C: 0)
There might be hundreds more products
Best fit with max 5 each would be
5 times Product 1
4 times Product 2
Result
A: 105
B: 50
C: 5
Accuracy: (qty_used / max_qty) * 100 = (160 / 210) * 100 = 76.190%
Next would be another combination with less accuracy
Any pointing in the right direction is highly appreciated Thanks
Edit:
instead of above method, accuracy should be as error and the list should be in ascending order of error.
Error(Basket x) = (|max_qty(x) - qty_used(x)| / max_qty(x)) * 100
and the overall error should be the weighted average of the errors of all baskets.
Total Error = [Σ (Error(x) * max_qty(x))] / [Σ (max_qty(x))]
I have this problem and I think that I'm gonna need a mathematical solution.
I have some boxes. Of these, I only know their total weight and what is inside each one. I have to calculate each one's weight.
For example I have:
Total weight: 100
Number of boxes: 5
Number of items: 14
Stock:
Type1: 2 items
Type2: 1 items
Type3: 7 items
Type4: 4 items
Box #1:
Type1: 2 items
Type2: 1 items
Box #2:
Type4: 3 items
Box #3:
Type3: 3 items
Box #4:
Type3: 2 items
Type4: 1 items
Box #5:
Type3: 2 items
Each box can potentially have n types of items, so how can I distribute the total weight?
I cannot divide the total weight by the number of boxes because the result would be equal for all boxes and this is not a real case.
You have:
Four variables - the weight of each item type
One linear equation 2A + B + 7C + 4D = 100 - what you know about the total weight.
Some linear inequalities - you know that A, B, C and D are all positive.
There's an infinite number of possible solutions. For example A=B=C=2,D=20 or A=B=C=4,D=15 and everything in between.
I have some problems with defining a algorithm that will calculate a ranking number for a dentist.
Assume, we have three different dentists:
dentist number 1: Got 125 patients and out of the 125 patients the
dentist have booked a time with 75 of them. 60% of them got a time.
dentist number 2: Got 5 patients and out of the 5 patients the
dentist have booked a time with 4 of them. 80% of them got a time.
dentist number 3: Got 25 patients and out of the 14 patients the
dentist have booked a time with 14 of them. 56% got a time.
If we use the formula:
patients booked time with / totalpatients * 100
it will not be the right way to calculate the ranking, as we will get an output of the higher percentage is, the better the dentist is, but it's wrong. By doing it in that way, the dentists would have a ranking:
dentist number 2 would have a ranking of 1. (80% got a time).
dentist number 1 would have a ranking of 2 (60% got a time).
dentist number 3 would have a ranking of 3. (56% got a time).
But, it should be in this way:
dentist number 1 = ranking 1
dentist number 2 = ranking 2
dentist number 3 = ranking 3
I don't know to make a algorithm that also takes the amount of patients as a factor to the ranking-calculation.
It is quite arbitrary how you define what makes a better dentist in terms of number of patients and the percentage of those that have an appointment with them.
Let's call the number of patients P, the number of those that have an appointment A, and the function determining how "good" a dentist is f. So f would be a function of P and A: f(P, A).
One component of f could indeed be what you already calculated: A/P.
Another component would have to be P, but I would think that the effect on f(P, A) of increasing P with 1 would be much higher for a low P, than for a high P, so this component should not be a linear function. It would also be practical if this component would have a value between 0 and 1, just like the other component.
Taking all this together, I suggest this definition of f, which will give a number between 0 and 1:
f(P,A) = 1/3 * P/(10 + P) + 2/3 * A/P
For the different dentists, this results in:
1: 1/3 * 125/135 + 2/3 * 75/125 = 0.7086419753...
2: 1/3 * 5/15 + 2/3 * 4/5 = 0.6444444444...
3: 1/3 * 25/35 + 2/3 * 14/25 = 0.6114285714...
You could play a bit with the constant factors in the formula, like increasing the term 10. Or you could change the factors 1/3 and 2/3 making sure that their sum is 1.
This is just one way to do it. There are an infinity of other ways...