Consider the four percentages below, represented as float numbers:
13.626332%
47.989636%
9.596008%
28.788024%
-----------
100.000000%
I need to represent these percentages as whole numbers. If I simply use Math.round(), I end up with a total of 101%.
14 + 48 + 10 + 29 = 101
If I use parseInt(), I end up with a total of 97%.
13 + 47 + 9 + 28 = 97
What's a good algorithm to represent any number of percentages as whole numbers while still maintaining a total of 100%?
Edit: After reading some of the comments and answers, there are clearly many ways to go about solving this.
In my mind, to remain true to the numbers, the "right" result is the one that minimizes the overall error, defined by how much error rounding would introduce relative to the actual value:
value rounded error decision
----------------------------------------------------
13.626332 14 2.7% round up (14)
47.989636 48 0.0% round up (48)
9.596008 10 4.0% don't round up (9)
28.788024 29 2.7% round up (29)
In case of a tie (3.33, 3.33, 3.33) an arbitrary decision can be made (e.g. 3, 4, 3).
There are many ways to do just this, provided you are not concerned about reliance on the original decimal data.
The first and perhaps most popular method would be the Largest Remainder Method
Which is basically:
Rounding everything down
Getting the difference in sum and 100
Distributing the difference by adding 1 to items in decreasing order of their decimal parts
In your case, it would go like this:
13.626332%
47.989636%
9.596008%
28.788024%
If you take the integer parts, you get
13
47
9
28
which adds up to 97, and you want to add three more. Now, you look at the decimal parts, which are
.626332%
.989636%
.596008%
.788024%
and take the largest ones until the total reaches 100. So you would get:
14
48
9
29
Alternatively, you can simply choose to show one decimal place instead of integer values. So the numbers would be 48.3 and 23.9 etc. This would drop the variance from 100 by a lot.
Probably the "best" way to do this (quoted since "best" is a subjective term) is to keep a running (non-integral) tally of where you are, and round that value.
Then use that along with the history to work out what value should be used. For example, using the values you gave:
Value CumulValue CumulRounded PrevBaseline Need
--------- ---------- ------------ ------------ ----
0
13.626332 13.626332 14 0 14 ( 14 - 0)
47.989636 61.615968 62 14 48 ( 62 - 14)
9.596008 71.211976 71 62 9 ( 71 - 62)
28.788024 100.000000 100 71 29 (100 - 71)
---
100
At each stage, you don't round the number itself. Instead, you round the accumulated value and work out the best integer that reaches that value from the previous baseline - that baseline is the cumulative value (rounded) of the previous row.
This works because you're not losing information at each stage but rather using the information more intelligently. The 'correct' rounded values are in the final column and you can see that they sum to 100.
You can see the difference between this and blindly rounding each value, in the third value above. While 9.596008 would normally round up to 10, the accumulated 71.211976 correctly rounds down to 71 - this means that only 9 is needed to add to the previous baseline of 62.
This also works for "problematic" sequence like three roughly-1/3 values, where one of them should be rounded up:
Value CumulValue CumulRounded PrevBaseline Need
--------- ---------- ------------ ------------ ----
0
33.333333 33.333333 33 0 33 ( 33 - 0)
33.333333 66.666666 67 33 34 ( 67 - 33)
33.333333 99.999999 100 67 33 (100 - 67)
---
100
Since none of the answers here seem to solve it properly, here's my semi-obfuscated version using underscorejs:
function foo(l, target) {
var off = target - _.reduce(l, function(acc, x) { return acc + Math.round(x) }, 0);
return _.chain(l).
sortBy(function(x) { return Math.round(x) - x }).
map(function(x, i) { return Math.round(x) + (off > i) - (i >= (l.length + off)) }).
value();
}
foo([13.626332, 47.989636, 9.596008, 28.788024], 100) // => [48, 29, 14, 9]
foo([16.666, 16.666, 16.666, 16.666, 16.666, 16.666], 100) // => [17, 17, 17, 17, 16, 16]
foo([33.333, 33.333, 33.333], 100) // => [34, 33, 33]
foo([33.3, 33.3, 33.3, 0.1], 100) // => [34, 33, 33, 0]
The goal of rounding is to generate the least amount of error. When you're rounding a single value, that process is simple and straightforward and most people understand it easily. When you're rounding multiple numbers at the same time, the process gets trickier - you must define how the errors are going to combine, i.e. what must be minimized.
The well-voted answer by Varun Vohra minimizes the sum of the absolute errors, and it's very simple to implement. However there are edge cases it does not handle - what should be the result of rounding 24.25, 23.25, 27.25, 25.25? One of those needs to be rounded up instead of down. You would probably just arbitrarily pick the first or last one in the list.
Perhaps it's better to use the relative error instead of the absolute error. Rounding 23.25 up to 24 changes it by 3.2% while rounding 27.25 up to 28 only changes it by 2.8%. Now there's a clear winner.
It's possible to tweak this even further. One common technique is to square each error, so that large errors count disproportionately more than small ones. I'd also use a non-linear divisor to get the relative error - it doesn't seem right that an error at 1% is 99 times more important than an error at 99%. In the code below I've used the square root.
The complete algorithm is as follows:
Sum the percentages after rounding them all down, and subtract from 100. This tells you how many of those percentages must be rounded up instead.
Generate two error scores for each percentage, one when when rounded down and one when rounded up. Take the difference between the two.
Sort the error differences produced above.
For the number of percentages that need to be rounded up, take an item from the sorted list and increment the rounded down percentage by 1.
You may still have more than one combination with the same error sum, for example 33.3333333, 33.3333333, 33.3333333. This is unavoidable, and the result will be completely arbitrary. The code I give below prefers to round up the values on the left.
Putting it all together in Python looks like this.
from math import isclose, sqrt
def error_gen(actual, rounded):
divisor = sqrt(1.0 if actual < 1.0 else actual)
return abs(rounded - actual) ** 2 / divisor
def round_to_100(percents):
if not isclose(sum(percents), 100):
raise ValueError
n = len(percents)
rounded = [int(x) for x in percents]
up_count = 100 - sum(rounded)
errors = [(error_gen(percents[i], rounded[i] + 1) - error_gen(percents[i], rounded[i]), i) for i in range(n)]
rank = sorted(errors)
for i in range(up_count):
rounded[rank[i][1]] += 1
return rounded
>>> round_to_100([13.626332, 47.989636, 9.596008, 28.788024])
[14, 48, 9, 29]
>>> round_to_100([33.3333333, 33.3333333, 33.3333333])
[34, 33, 33]
>>> round_to_100([24.25, 23.25, 27.25, 25.25])
[24, 23, 28, 25]
>>> round_to_100([1.25, 2.25, 3.25, 4.25, 89.0])
[1, 2, 3, 4, 90]
As you can see with that last example, this algorithm is still capable of delivering non-intuitive results. Even though 89.0 needs no rounding whatsoever, one of the values in that list needed to be rounded up; the lowest relative error results from rounding up that large value rather than the much smaller alternatives.
This answer originally advocated going through every possible combination of round up/round down, but as pointed out in the comments a simpler method works better. The algorithm and code reflect that simplification.
I wrote a C# version rounding helper, the algorithm is same as Varun Vohra's answer, hope it helps.
public static List<decimal> GetPerfectRounding(List<decimal> original,
decimal forceSum, int decimals)
{
var rounded = original.Select(x => Math.Round(x, decimals)).ToList();
Debug.Assert(Math.Round(forceSum, decimals) == forceSum);
var delta = forceSum - rounded.Sum();
if (delta == 0) return rounded;
var deltaUnit = Convert.ToDecimal(Math.Pow(0.1, decimals)) * Math.Sign(delta);
List<int> applyDeltaSequence;
if (delta < 0)
{
applyDeltaSequence = original
.Zip(Enumerable.Range(0, int.MaxValue), (x, index) => new { x, index })
.OrderBy(a => original[a.index] - rounded[a.index])
.ThenByDescending(a => a.index)
.Select(a => a.index).ToList();
}
else
{
applyDeltaSequence = original
.Zip(Enumerable.Range(0, int.MaxValue), (x, index) => new { x, index })
.OrderByDescending(a => original[a.index] - rounded[a.index])
.Select(a => a.index).ToList();
}
Enumerable.Repeat(applyDeltaSequence, int.MaxValue)
.SelectMany(x => x)
.Take(Convert.ToInt32(delta/deltaUnit))
.ForEach(index => rounded[index] += deltaUnit);
return rounded;
}
It pass the following Unit test:
[TestMethod]
public void TestPerfectRounding()
{
CollectionAssert.AreEqual(Utils.GetPerfectRounding(
new List<decimal> {3.333m, 3.334m, 3.333m}, 10, 2),
new List<decimal> {3.33m, 3.34m, 3.33m});
CollectionAssert.AreEqual(Utils.GetPerfectRounding(
new List<decimal> {3.33m, 3.34m, 3.33m}, 10, 1),
new List<decimal> {3.3m, 3.4m, 3.3m});
CollectionAssert.AreEqual(Utils.GetPerfectRounding(
new List<decimal> {3.333m, 3.334m, 3.333m}, 10, 1),
new List<decimal> {3.3m, 3.4m, 3.3m});
CollectionAssert.AreEqual(Utils.GetPerfectRounding(
new List<decimal> { 13.626332m, 47.989636m, 9.596008m, 28.788024m }, 100, 0),
new List<decimal> {14, 48, 9, 29});
CollectionAssert.AreEqual(Utils.GetPerfectRounding(
new List<decimal> { 16.666m, 16.666m, 16.666m, 16.666m, 16.666m, 16.666m }, 100, 0),
new List<decimal> { 17, 17, 17, 17, 16, 16 });
CollectionAssert.AreEqual(Utils.GetPerfectRounding(
new List<decimal> { 33.333m, 33.333m, 33.333m }, 100, 0),
new List<decimal> { 34, 33, 33 });
CollectionAssert.AreEqual(Utils.GetPerfectRounding(
new List<decimal> { 33.3m, 33.3m, 33.3m, 0.1m }, 100, 0),
new List<decimal> { 34, 33, 33, 0 });
}
DO NOT sum the rounded numbers. You're going to have inaccurate results. The total could be off significantly depending on the number of terms and the distribution of fractional parts.
Display the rounded numbers but sum the actual values. Depending on how you're presenting the numbers, the actual way to do that would vary. That way you get
14
48
10
29
__
100
Any way you go you're going to have discrepancy. There's no way in your example to show numbers that add up to 100 without "rounding" one value the wrong way (least error would be changing 9.596 to 9)
EDIT
You need to choose between one of the following:
Accuracy of the items
Accuracy of the sum (if you're summing rounded values)
Consistency between the rounded items and the rounded sum)
Most of the time when dealing with percentages #3 is the best option because it's more obvious when the total equals 101% than when the individual items don't total to 100, and you keep the individual items accurate. "Rounding" 9.596 to 9 is inaccurate in my opinion.
To explain this I sometimes add a footnote that explains that the individual values are rounded and may not total 100% - anyone that understands rounding should be able to understand that explanation.
You could try keeping track of your error due to rounding, and then rounding against the grain if the accumulated error is greater than the fractional portion of the current number.
13.62 -> 14 (+.38)
47.98 -> 48 (+.02 (+.40 total))
9.59 -> 10 (+.41 (+.81 total))
28.78 -> 28 (round down because .81 > .78)
------------
100
Not sure if this would work in general, but it seems to work similar if the order is reversed:
28.78 -> 29 (+.22)
9.59 -> 9 (-.37; rounded down because .59 > .22)
47.98 -> 48 (-.35)
13.62 -> 14 (+.03)
------------
100
I'm sure there are edge cases where this might break down, but any approach is going to be at least somewhat arbitrary since you're basically modifying your input data.
I'm not sure what level of accuracy you need, but what I would do is simply add 1 the first n numbers, n being the ceil of the total sum of decimals. In this case that is 3, so I would add 1 to the first 3 items and floor the rest. Of course this is not super accurate, some numbers might be rounded up or down when it shouldn't but it works okay and will always result in 100%.
So [ 13.626332, 47.989636, 9.596008, 28.788024 ] would be [14, 48, 10, 28] because Math.ceil(.626332+.989636+.596008+.788024) == 3
function evenRound( arr ) {
var decimal = -~arr.map(function( a ){ return a % 1 })
.reduce(function( a,b ){ return a + b }); // Ceil of total sum of decimals
for ( var i = 0; i < decimal; ++i ) {
arr[ i ] = ++arr[ i ]; // compensate error by adding 1 the the first n items
}
return arr.map(function( a ){ return ~~a }); // floor all other numbers
}
var nums = evenRound( [ 13.626332, 47.989636, 9.596008, 28.788024 ] );
var total = nums.reduce(function( a,b ){ return a + b }); //=> 100
You can always inform users that the numbers are rounded and may not be super-accurate...
I once wrote an unround tool, to find the minimal perturbation to a set of numbers to match a goal. It was a different problem, but one could in theory use a similar idea here. In this case, we have a set of choices.
Thus for the first element, we can either round it up to 14, or down to 13. The cost (in a binary integer programming sense) of doing so is less for the round up than the round down, because the round down requires we move that value a larger distance. Similarly, we can round each number up or down, so there are a total of 16 choices we must choose from.
13.626332
47.989636
9.596008
+ 28.788024
-----------
100.000000
I'd normally solve the general problem in MATLAB, here using bintprog, a binary integer programming tool, but there are only a few choices to be tested, so it is easy enough with simple loops to test out each of the 16 alternatives. For example, suppose we were to round this set as:
Original Rounded Absolute error
13.626 13 0.62633
47.99 48 0.01036
9.596 10 0.40399
+ 28.788 29 0.21198
---------------------------------------
100.000 100 1.25266
The total absolute error made is 1.25266. It can be reduced slightly by the following alternative rounding:
Original Rounded Absolute error
13.626 14 0.37367
47.99 48 0.01036
9.596 9 0.59601
+ 28.788 29 0.21198
---------------------------------------
100.000 100 1.19202
In fact, this will be the optimal solution in terms of the absolute error. Of course, if there were 20 terms, the search space will be of size 2^20 = 1048576. For 30 or 40 terms, that space will be of significant size. In that case, you would need to use a tool that can efficiently search the space, perhaps using a branch and bound scheme.
I think the following will achieve what you are after
function func( orig, target ) {
var i = orig.length, j = 0, total = 0, change, newVals = [], next, factor1, factor2, len = orig.length, marginOfErrors = [];
// map original values to new array
while( i-- ) {
total += newVals[i] = Math.round( orig[i] );
}
change = total < target ? 1 : -1;
while( total !== target ) {
// Iterate through values and select the one that once changed will introduce
// the least margin of error in terms of itself. e.g. Incrementing 10 by 1
// would mean an error of 10% in relation to the value itself.
for( i = 0; i < len; i++ ) {
next = i === len - 1 ? 0 : i + 1;
factor2 = errorFactor( orig[next], newVals[next] + change );
factor1 = errorFactor( orig[i], newVals[i] + change );
if( factor1 > factor2 ) {
j = next;
}
}
newVals[j] += change;
total += change;
}
for( i = 0; i < len; i++ ) { marginOfErrors[i] = newVals[i] && Math.abs( orig[i] - newVals[i] ) / orig[i]; }
// Math.round() causes some problems as it is difficult to know at the beginning
// whether numbers should have been rounded up or down to reduce total margin of error.
// This section of code increments and decrements values by 1 to find the number
// combination with least margin of error.
for( i = 0; i < len; i++ ) {
for( j = 0; j < len; j++ ) {
if( j === i ) continue;
var roundUpFactor = errorFactor( orig[i], newVals[i] + 1) + errorFactor( orig[j], newVals[j] - 1 );
var roundDownFactor = errorFactor( orig[i], newVals[i] - 1) + errorFactor( orig[j], newVals[j] + 1 );
var sumMargin = marginOfErrors[i] + marginOfErrors[j];
if( roundUpFactor < sumMargin) {
newVals[i] = newVals[i] + 1;
newVals[j] = newVals[j] - 1;
marginOfErrors[i] = newVals[i] && Math.abs( orig[i] - newVals[i] ) / orig[i];
marginOfErrors[j] = newVals[j] && Math.abs( orig[j] - newVals[j] ) / orig[j];
}
if( roundDownFactor < sumMargin ) {
newVals[i] = newVals[i] - 1;
newVals[j] = newVals[j] + 1;
marginOfErrors[i] = newVals[i] && Math.abs( orig[i] - newVals[i] ) / orig[i];
marginOfErrors[j] = newVals[j] && Math.abs( orig[j] - newVals[j] ) / orig[j];
}
}
}
function errorFactor( oldNum, newNum ) {
return Math.abs( oldNum - newNum ) / oldNum;
}
return newVals;
}
func([16.666, 16.666, 16.666, 16.666, 16.666, 16.666], 100); // => [16, 16, 17, 17, 17, 17]
func([33.333, 33.333, 33.333], 100); // => [34, 33, 33]
func([33.3, 33.3, 33.3, 0.1], 100); // => [34, 33, 33, 0]
func([13.25, 47.25, 11.25, 28.25], 100 ); // => [13, 48, 11, 28]
func( [25.5, 25.5, 25.5, 23.5], 100 ); // => [25, 25, 26, 24]
One last thing, I ran the function using the numbers originally given in the question to compare to the desired output
func([13.626332, 47.989636, 9.596008, 28.788024], 100); // => [48, 29, 13, 10]
This was different to what the question wanted => [ 48, 29, 14, 9]. I couldn't understand this until I looked at the total margin of error
-------------------------------------------------
| original | question | % diff | mine | % diff |
-------------------------------------------------
| 13.626332 | 14 | 2.74% | 13 | 4.5% |
| 47.989636 | 48 | 0.02% | 48 | 0.02% |
| 9.596008 | 9 | 6.2% | 10 | 4.2% |
| 28.788024 | 29 | 0.7% | 29 | 0.7% |
-------------------------------------------------
| Totals | 100 | 9.66% | 100 | 9.43% |
-------------------------------------------------
Essentially, the result from my function actually introduces the least amount of error.
Fiddle here
Note: the selected answer is changing the array order which is not preferred, here I provide more different variations that achieving the same result and keeping the array in order
Discussion
given [98.88, .56, .56] how do you want to round it? you have four option
1- round things up and subtract what is added from the rest of the numbers, so the result becomes [98, 1, 1]
this could be a good answer, but what if we have [97.5, .5, .5, .5, .5, .5]? then you need to round it up to [95, 1, 1, 1, 1, 1]
do you see how it goes? if you add more 0-like numbers, you will lose more value from the rest of your numbers. this could be very troublesome when you have a big array of zero-like number like [40, .5, .5 , ... , .5]. when you round up this, you could end up with an array of ones: [1, 1, .... , 1]
so round-up isn't a good option.
2- you round down the numbers. so [98.88, .56, .56] becomes [98, 0, 0], then you are 2 less than 100. you ignore anything that is already 0, then add up the difference to the biggest numbers. so bigger numbers will get more.
3- same as previous, round down numbers, but you sort descending based on the decimals, divide up the diff based on the decimal, so biggest decimal will get the diff.
4- you round up, but you add what you added to the next number. so like a wave what you have added will be redirected to the end of your array. so [98.88, .56, .56] becomes [99, 0, 1]
none of these are ideal, so be mindful that your data is going to lose its shape.
here I provide a code for cases 2 and 3 (as case No.1 is not practical when you have a lot of zero-like numbers). it's modern Js and doesn't need any library to use
2nd case
const v1 = [13.626332, 47.989636, 9.596008, 28.788024];// => [ 14, 48, 9, 29 ]
const v2 = [16.666, 16.666, 16.666, 16.666, 16.666, 16.666] // => [ 17, 17, 17, 17, 16, 16 ]
const v3 = [33.333, 33.333, 33.333] // => [ 34, 33, 33 ]
const v4 = [33.3, 33.3, 33.3, 0.1] // => [ 34, 33, 33, 0 ]
const v5 = [98.88, .56, .56] // =>[ 100, 0, 0 ]
const v6 = [97.5, .5, .5, .5, .5, .5] // => [ 100, 0, 0, 0, 0, 0 ]
const normalizePercentageByNumber = (input) => {
const rounded: number[] = input.map(x => Math.floor(x));
const afterRoundSum = rounded.reduce((pre, curr) => pre + curr, 0);
const countMutableItems = rounded.filter(x => x >=1).length;
const errorRate = 100 - afterRoundSum;
const deductPortion = Math.ceil(errorRate / countMutableItems);
const biggest = [...rounded].sort((a, b) => b - a).slice(0, Math.min(Math.abs(errorRate), countMutableItems));
const result = rounded.map(x => {
const indexOfX = biggest.indexOf(x);
if (indexOfX >= 0) {
x += deductPortion;
console.log(biggest)
biggest.splice(indexOfX, 1);
return x;
}
return x;
});
return result;
}
3rd case
const normalizePercentageByDecimal = (input: number[]) => {
const rounded= input.map((x, i) => ({number: Math.floor(x), decimal: x%1, index: i }));
const decimalSorted= [...rounded].sort((a,b)=> b.decimal-a.decimal);
const sum = rounded.reduce((pre, curr)=> pre + curr.number, 0) ;
const error= 100-sum;
for (let i = 0; i < error; i++) {
const element = decimalSorted[i];
element.number++;
}
const result= [...decimalSorted].sort((a,b)=> a.index-b.index);
return result.map(x=> x.number);
}
4th case
you just need to calculate how much extra air added or deducted to your numbers on each roundup and, add or subtract it again in the next item.
const v1 = [13.626332, 47.989636, 9.596008, 28.788024];// => [14, 48, 10, 28 ]
const v2 = [16.666, 16.666, 16.666, 16.666, 16.666, 16.666] // => [17, 16, 17, 16, 17, 17]
const v3 = [33.333, 33.333, 33.333] // => [33, 34, 33]
const v4 = [33.3, 33.3, 33.3, 0.1] // => [33, 34, 33, 0]
const normalizePercentageByWave= v4.reduce((pre, curr, i, arr) => {
let number = Math.round(curr + pre.decimal);
let total = pre.total + number;
const decimal = curr - number;
if (i == arr.length - 1 && total < 100) {
const diff = 100 - total;
total += diff;
number += diff;
}
return { total, numbers: [...pre.numbers, number], decimal };
}, { total: 0, numbers: [], decimal: 0 });
If you have just just two options you are good to use Math.round(). Only problematic pair of values are X.5 (eg. 37.5 and 62.5) it will round both values up and you will end up with 101% as you can try here:
https://jsfiddle.net/f8np1t0k/2/
Since you need to show always 100% you simply remove one percentage from on of them, for example on first one
const correctedARounded = Number.isInteger(aRounded-0.5) ? a - 1 : a
Or you can favor the option with more % votes.
The error of 1% diff happens 114 times for 10k cases of divisions between pairs of 1-100 values.
My JS implementation for the well-voted answer by Varun Vohra
const set1 = [13.626332, 47.989636, 9.596008, 28.788024];
// const set2 = [24.25, 23.25, 27.25, 25.25];
const values = set1;
console.log('Total: ', values.reduce((accum, each) => accum + each));
console.log('Incorrectly Rounded: ',
values.reduce((accum, each) => accum + Math.round(each), 0));
const adjustValues = (values) => {
// 1. Separate integer and decimal part
// 2. Store both in a new array of objects sorted by decimal part descending
// 3. Add in original position to "put back" at the end
const flooredAndSortedByDecimal = values.map((value, position) => (
{
floored: Math.floor(value),
decimal: value - Number.parseInt(value),
position
}
)).sort(({decimal}, {decimal: otherDecimal}) => otherDecimal - decimal);
const roundedTotal = values.reduce((total, value) => total + Math.floor(value), 0);
let availableForDistribution = 100 - roundedTotal;
// Add 1 to each value from what's available
const adjustedValues = flooredAndSortedByDecimal.map(value => {
const { floored, ...rest } = value;
let finalPercentage = floored;
if(availableForDistribution > 0){
finalPercentage = floored + 1;
availableForDistribution--;
}
return {
finalPercentage,
...rest
}
});
// Put back and return the new values
return adjustedValues
.sort(({position}, {position: otherPosition}) => position - otherPosition)
.map(({finalPercentage}) => finalPercentage);
}
const finalPercentages = adjustValues(values);
console.log({finalPercentages})
// { finalPercentage: [14, 48, 9, 29]}
Or something like this for brevity, where you just accumulate the error...
const p = [13.626332, 47.989636, 9.596008, 28.788024];
const round = (a, e = 0) => a.map(x => (r = Math.round(x + e), e += x - r, r));
console.log(round(p));
Result: [14, 48, 9, 29]
If you are rounding it there is no good way to get it exactly the same in all case.
You can take the decimal part of the N percentages you have (in the example you gave it is 4).
Add the decimal parts. In your example you have total of fractional part = 3.
Ceil the 3 numbers with highest fractions and floor the rest.
(Sorry for the edits)
If you really must round them, there are already very good suggestions here (largest remainder, least relative error, and so on).
There is also already one good reason not to round (you'll get at least one number that "looks better" but is "wrong"), and how to solve that (warn your readers) and that is what I do.
Let me add on the "wrong" number part.
Suppose you have three events/entitys/... with some percentages that you approximate as:
DAY 1
who | real | app
----|-------|------
A | 33.34 | 34
B | 33.33 | 33
C | 33.33 | 33
Later on the values change slightly, to
DAY 2
who | real | app
----|-------|------
A | 33.35 | 33
B | 33.36 | 34
C | 33.29 | 33
The first table has the already mentioned problem of having a "wrong" number: 33.34 is closer to 33 than to 34.
But now you have a bigger error. Comparing day 2 to day 1, the real percentage value for A increased, by 0.01%, but the approximation shows a decrease by 1%.
That is a qualitative error, probably quite worse that the initial quantitative error.
One could devise a approximation for the whole set but, you may have to publish data on day one, thus you'll not know about day two. So, unless you really, really, must approximate, you probably better not.
Here's a simpler Python implementation of #varun-vohra answer:
def apportion_pcts(pcts, total):
proportions = [total * (pct / 100) for pct in pcts]
apportions = [math.floor(p) for p in proportions]
remainder = total - sum(apportions)
remainders = [(i, p - math.floor(p)) for (i, p) in enumerate(proportions)]
remainders.sort(key=operator.itemgetter(1), reverse=True)
for (i, _) in itertools.cycle(remainders):
if remainder == 0:
break
else:
apportions[i] += 1
remainder -= 1
return apportions
You need math, itertools, operator.
check if this is valid or not as far as my test cases I am able to get this working.
let's say number is k;
sort percentage by descending oder.
iterate over each percentage from descending order.
calculate percentage of k for first percentage take Math.Ceil of output.
next k = k-1
iterate over till all percentage is consumed.
I have implemented the method from Varun Vohra's answer here for both lists and dicts.
import math
import numbers
import operator
import itertools
def round_list_percentages(number_list):
"""
Takes a list where all values are numbers that add up to 100,
and rounds them off to integers while still retaining a sum of 100.
A total value sum that rounds to 100.00 with two decimals is acceptable.
This ensures that all input where the values are calculated with [fraction]/[total]
and the sum of all fractions equal the total, should pass.
"""
# Check input
if not all(isinstance(i, numbers.Number) for i in number_list):
raise ValueError('All values of the list must be a number')
# Generate a key for each value
key_generator = itertools.count()
value_dict = {next(key_generator): value for value in number_list}
return round_dictionary_percentages(value_dict).values()
def round_dictionary_percentages(dictionary):
"""
Takes a dictionary where all values are numbers that add up to 100,
and rounds them off to integers while still retaining a sum of 100.
A total value sum that rounds to 100.00 with two decimals is acceptable.
This ensures that all input where the values are calculated with [fraction]/[total]
and the sum of all fractions equal the total, should pass.
"""
# Check input
# Only allow numbers
if not all(isinstance(i, numbers.Number) for i in dictionary.values()):
raise ValueError('All values of the dictionary must be a number')
# Make sure the sum is close enough to 100
# Round value_sum to 2 decimals to avoid floating point representation errors
value_sum = round(sum(dictionary.values()), 2)
if not value_sum == 100:
raise ValueError('The sum of the values must be 100')
# Initial floored results
# Does not add up to 100, so we need to add something
result = {key: int(math.floor(value)) for key, value in dictionary.items()}
# Remainders for each key
result_remainders = {key: value % 1 for key, value in dictionary.items()}
# Keys sorted by remainder (biggest first)
sorted_keys = [key for key, value in sorted(result_remainders.items(), key=operator.itemgetter(1), reverse=True)]
# Otherwise add missing values up to 100
# One cycle is enough, since flooring removes a max value of < 1 per item,
# i.e. this loop should always break before going through the whole list
for key in sorted_keys:
if sum(result.values()) == 100:
break
result[key] += 1
# Return
return result
For those having the percentages in a pandas Series, here is my implemantation of the Largest remainder method (as in Varun Vohra's answer), where you can even select the decimals to which you want to round.
import numpy as np
def largestRemainderMethod(pd_series, decimals=1):
floor_series = ((10**decimals * pd_series).astype(np.int)).apply(np.floor)
diff = 100 * (10**decimals) - floor_series.sum().astype(np.int)
series_decimals = pd_series - floor_series / (10**decimals)
series_sorted_by_decimals = series_decimals.sort_values(ascending=False)
for i in range(0, len(series_sorted_by_decimals)):
if i < diff:
series_sorted_by_decimals.iloc[[i]] = 1
else:
series_sorted_by_decimals.iloc[[i]] = 0
out_series = ((floor_series + series_sorted_by_decimals) / (10**decimals)).sort_values(ascending=False)
return out_series
Here's a Ruby gem that implements the Largest Remainder method:
https://github.com/jethroo/lare_round
To use:
a = Array.new(3){ BigDecimal('0.3334') }
# => [#<BigDecimal:887b6c8,'0.3334E0',9(18)>, #<BigDecimal:887b600,'0.3334E0',9(18)>, #<BigDecimal:887b4c0,'0.3334E0',9(18)>]
a = LareRound.round(a,2)
# => [#<BigDecimal:8867330,'0.34E0',9(36)>, #<BigDecimal:8867290,'0.33E0',9(36)>, #<BigDecimal:88671f0,'0.33E0',9(36)>]
a.reduce(:+).to_f
# => 1.0
I wrote a function in Javascript that takes an array of percentages and outputs an array with rounded percentages using the Largest Remainder Method. It doesn't use any libraries.
Input: [21.6, 46.7, 31, 0.5, 0.2]
Output: [22, 47, 31, 0, 0]
const values = [21.6, 46.7, 31, 0.5, 0.2];
console.log(roundPercentages(values));
function roundPercentages(values) {
const flooredValues = values.map(e => Math.floor(e));
const remainders = values.map(e => e - Math.floor(e));
const totalRemainder = 100 - flooredValues.reduce((a, b) => a + b);
// Deep copy because order of remainders is important
[...remainders]
// Sort from highest to lowest remainder
.sort((a, b) => b - a)
// Get the n largest remainder values, where n = totalRemainder
.slice(0, totalRemainder)
// Add 1 to the floored percentages with the highest remainder (divide the total remainder)
.forEach(e => flooredValues[remainders.indexOf(e)] += 1);
return flooredValues;
}
This is a case for banker's rounding, aka 'round half-even'. It is supported by BigDecimal. Its purpose is to ensure that rounding balances out, i.e. doesn't favour either the bank orthe customer.
An interview question:
Given a function f(x) that 1/4 times returns 0, 3/4 times returns 1.
Write a function g(x) using f(x) that 1/2 times returns 0, 1/2 times returns 1.
My implementation is:
function g(x) = {
if (f(x) == 0){ // 1/4
var s = f(x)
if( s == 1) {// 3/4 * 1/4
return s // 3/16
} else {
g(x)
}
} else { // 3/4
var k = f(x)
if( k == 0) {// 1/4 * 3/4
return k // 3/16
} else {
g(x)
}
}
}
Am I right? What's your solution?(you can use any language)
If you call f(x) twice in a row, the following outcomes are possible (assuming that
successive calls to f(x) are independent, identically distributed trials):
00 (probability 1/4 * 1/4)
01 (probability 1/4 * 3/4)
10 (probability 3/4 * 1/4)
11 (probability 3/4 * 3/4)
01 and 10 occur with equal probability. So iterate until you get one of those
cases, then return 0 or 1 appropriately:
do
a=f(x); b=f(x);
while (a == b);
return a;
It might be tempting to call f(x) only once per iteration and keep track of the two
most recent values, but that won't work. Suppose the very first roll is 1,
with probability 3/4. You'd loop until the first 0, then return 1 (with probability 3/4).
The problem with your algorithm is that it repeats itself with high probability. My code:
function g(x) = {
var s = f(x) + f(x) + f(x);
// s = 0, probability: 1/64
// s = 1, probability: 9/64
// s = 2, probability: 27/64
// s = 3, probability: 27/64
if (s == 2) return 0;
if (s == 3) return 1;
return g(x); // probability to go into recursion = 10/64, with only 1 additional f(x) calculation
}
I've measured average number of times f(x) was calculated for your algorithm and for mine. For yours f(x) was calculated around 5.3 times per one g(x) calculation. With my algorithm this number reduced to around 3.5. The same is true for other answers so far since they are actually the same algorithm as you said.
P.S.: your definition doesn't mention 'random' at the moment, but probably it is assumed. See my other answer.
Your solution is correct, if somewhat inefficient and with more duplicated logic. Here is a Python implementation of the same algorithm in a cleaner form.
def g ():
while True:
a = f()
if a != f():
return a
If f() is expensive you'd want to get more sophisticated with using the match/mismatch information to try to return with fewer calls to it. Here is the most efficient possible solution.
def g ():
lower = 0.0
upper = 1.0
while True:
if 0.5 < lower:
return 1
elif upper < 0.5:
return 0
else:
middle = 0.25 * lower + 0.75 * upper
if 0 == f():
lower = middle
else:
upper = middle
This takes about 2.6 calls to g() on average.
The way that it works is this. We're trying to pick a random number from 0 to 1, but we happen to stop as soon as we know whether the number is 0 or 1. We start knowing that the number is in the interval (0, 1). 3/4 of the numbers are in the bottom 3/4 of the interval, and 1/4 are in the top 1/4 of the interval. We decide which based on a call to f(x). This means that we are now in a smaller interval.
If we wash, rinse, and repeat enough times we can determine our finite number as precisely as possible, and will have an absolutely equal probability of winding up in any region of the original interval. In particular we have an even probability of winding up bigger than or less than 0.5.
If you wanted you could repeat the idea to generate an endless stream of bits one by one. This is, in fact, provably the most efficient way of generating such a stream, and is the source of the idea of entropy in information theory.
Given a function f(x) that 1/4 times returns 0, 3/4 times returns 1
Taking this statement literally, f(x) if called four times will always return zero once and 1 3 times. This is different than saying f(x) is a probabalistic function and the 0 to 1 ratio will approach 1 to 3 (1/4 vs 3/4) over many iterations. If the first interpretation is valid, than the only valid function for f(x) that will meet the criteria regardless of where in the sequence you start from is the sequence 0111 repeating. (or 1011 or 1101 or 1110 which are the same sequence from a different starting point). Given that constraint,
g()= (f() == f())
should suffice.
As already mentioned your definition is not that good regarding probability. Usually it means that not only probability is good but distribution also. Otherwise you can simply write g(x) which will return 1,0,1,0,1,0,1,0 - it will return them 50/50, but numbers won't be random.
Another cheating approach might be:
var invert = false;
function g(x) {
invert = !invert;
if (invert) return 1-f(x);
return f(x);
}
This solution will be better than all others since it calls f(x) only one time. But the results will not be very random.
A refinement of the same approach used in btilly's answer, achieving an average ~1.85 calls to f() per g() result (further refinement documented below achieves ~1.75, tbilly's ~2.6, Jim Lewis's accepted answer ~5.33). Code appears lower in the answer.
Basically, I generate random integers in the range 0 to 3 with even probability: the caller can then test bit 0 for the first 50/50 value, and bit 1 for a second. Reason: the f() probabilities of 1/4 and 3/4 map onto quarters much more cleanly than halves.
Description of algorithm
btilly explained the algorithm, but I'll do so in my own way too...
The algorithm basically generates a random real number x between 0 and 1, then returns a result depending on which "result bucket" that number falls in:
result bucket result
x < 0.25 0
0.25 <= x < 0.5 1
0.5 <= x < 0.75 2
0.75 <= x 3
But, generating a random real number given only f() is difficult. We have to start with the knowledge that our x value should be in the range 0..1 - which we'll call our initial "possible x" space. We then hone in on an actual value for x:
each time we call f():
if f() returns 0 (probability 1 in 4), we consider x to be in the lower quarter of the "possible x" space, and eliminate the upper three quarters from that space
if f() returns 1 (probability 3 in 4), we consider x to be in the upper three-quarters of the "possible x" space, and eliminate the lower quarter from that space
when the "possible x" space is completely contained by a single result bucket, that means we've narrowed x down to the point where we know which result value it should map to and have no need to get a more specific value for x.
It may or may not help to consider this diagram :-):
"result bucket" cut-offs 0,.25,.5,.75,1
0=========0.25=========0.5==========0.75=========1 "possible x" 0..1
| | . . | f() chooses x < vs >= 0.25
| result 0 |------0.4375-------------+----------| "possible x" .25..1
| | result 1| . . | f() chooses x < vs >= 0.4375
| | | . ~0.58 . | "possible x" .4375..1
| | | . | . | f() chooses < vs >= ~.58
| | ||. | | . | 4 distinct "possible x" ranges
Code
int g() // return 0, 1, 2, or 3
{
if (f() == 0) return 0;
if (f() == 0) return 1;
double low = 0.25 + 0.25 * (1.0 - 0.25);
double high = 1.0;
while (true)
{
double cutoff = low + 0.25 * (high - low);
if (f() == 0)
high = cutoff;
else
low = cutoff;
if (high < 0.50) return 1;
if (low >= 0.75) return 3;
if (low >= 0.50 && high < 0.75) return 2;
}
}
If helpful, an intermediary to feed out 50/50 results one at a time:
int h()
{
static int i;
if (!i)
{
int x = g();
i = x | 4;
return x & 1;
}
else
{
int x = i & 2;
i = 0;
return x ? 1 : 0;
}
}
NOTE: This can be further tweaked by having the algorithm switch from considering an f()==0 result to hone in on the lower quarter, to having it hone in on the upper quarter instead, based on which on average resolves to a result bucket more quickly. Superficially, this seemed useful on the third call to f() when an upper-quarter result would indicate an immediate result of 3, while a lower-quarter result still spans probability point 0.5 and hence results 1 and 2. When I tried it, the results were actually worse. A more complex tuning was needed to see actual benefits, and I ended up writing a brute-force comparison of lower vs upper cutoff for second through eleventh calls to g(). The best result I found was an average of ~1.75, resulting from the 1st, 2nd, 5th and 8th calls to g() seeking low (i.e. setting low = cutoff).
Here is a solution based on central limit theorem, originally due to a friend of mine:
/*
Given a function f(x) that 1/4 times returns 0, 3/4 times returns 1. Write a function g(x) using f(x) that 1/2 times returns 0, 1/2 times returns 1.
*/
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cstdio>
using namespace std;
int f() {
if (rand() % 4 == 0) return 0;
return 1;
}
int main() {
srand(time(0));
int cc = 0;
for (int k = 0; k < 1000; k++) { //number of different runs
int c = 0;
int limit = 10000; //the bigger the limit, the more we will approach %50 percent
for (int i=0; i<limit; ++i) c+= f();
cc += c < limit*0.75 ? 0 : 1; // c will be 0, with probability %50
}
printf("%d\n",cc); //cc is gonna be around 500
return 0;
}
Since each return of f() represents a 3/4 chance of TRUE, with some algebra we can just properly balance the odds. What we want is another function x() which returns a balancing probability of TRUE, so that
function g() {
return f() && x();
}
returns true 50% of the time.
So let's find the probability of x (p(x)), given p(f) and our desired total probability (1/2):
p(f) * p(x) = 1/2
3/4 * p(x) = 1/2
p(x) = (1/2) / 3/4
p(x) = 2/3
So x() should return TRUE with a probability of 2/3, since 2/3 * 3/4 = 6/12 = 1/2;
Thus the following should work for g():
function g() {
return f() && (rand() < 2/3);
}
Assuming
P(f[x] == 0) = 1/4
P(f[x] == 1) = 3/4
and requiring a function g[x] with the following assumptions
P(g[x] == 0) = 1/2
P(g[x] == 1) = 1/2
I believe the following definition of g[x] is sufficient (Mathematica)
g[x_] := If[f[x] + f[x + 1] == 1, 1, 0]
or, alternatively in C
int g(int x)
{
return f(x) + f(x+1) == 1
? 1
: 0;
}
This is based on the idea that invocations of {f[x], f[x+1]} would produce the following outcomes
{
{0, 0},
{0, 1},
{1, 0},
{1, 1}
}
Summing each of the outcomes we have
{
0,
1,
1,
2
}
where a sum of 1 represents 1/2 of the possible sum outcomes, with any other sum making up the other 1/2.
Edit.
As bdk says - {0,0} is less likely than {1,1} because
1/4 * 1/4 < 3/4 * 3/4
However, I am confused myself because given the following definition for f[x] (Mathematica)
f[x_] := Mod[x, 4] > 0 /. {False -> 0, True -> 1}
or alternatively in C
int f(int x)
{
return (x % 4) > 0
? 1
: 0;
}
then the results obtained from executing f[x] and g[x] seem to have the expected distribution.
Table[f[x], {x, 0, 20}]
{0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0}
Table[g[x], {x, 0, 20}]
{1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1}
This is much like the Monty Hall paradox.
In general.
Public Class Form1
'the general case
'
'twiceThis = 2 is 1 in four chance of 0
'twiceThis = 3 is 1 in six chance of 0
'
'twiceThis = x is 1 in 2x chance of 0
Const twiceThis As Integer = 7
Const numOf As Integer = twiceThis * 2
Private Sub Button1_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles Button1.Click
Const tries As Integer = 1000
y = New List(Of Integer)
Dim ct0 As Integer = 0
Dim ct1 As Integer = 0
Debug.WriteLine("")
''show all possible values of fx
'For x As Integer = 1 To numOf
' Debug.WriteLine(fx)
'Next
'test that gx returns 50% 0's and 50% 1's
Dim stpw As New Stopwatch
stpw.Start()
For x As Integer = 1 To tries
Dim g_x As Integer = gx()
'Debug.WriteLine(g_x.ToString) 'used to verify that gx returns 0 or 1 randomly
If g_x = 0 Then ct0 += 1 Else ct1 += 1
Next
stpw.Stop()
'the results
Debug.WriteLine((ct0 / tries).ToString("p1"))
Debug.WriteLine((ct1 / tries).ToString("p1"))
Debug.WriteLine((stpw.ElapsedTicks / tries).ToString("n0"))
End Sub
Dim prng As New Random
Dim y As New List(Of Integer)
Private Function fx() As Integer
'1 in numOf chance of zero being returned
If y.Count = 0 Then
'reload y
y.Add(0) 'fx has only one zero value
Do
y.Add(1) 'the rest are ones
Loop While y.Count < numOf
End If
'return a random value
Dim idx As Integer = prng.Next(y.Count)
Dim rv As Integer = y(idx)
y.RemoveAt(idx) 'remove the value selected
Return rv
End Function
Private Function gx() As Integer
'a function g(x) using f(x) that 50% of the time returns 0
' that 50% of the time returns 1
Dim rv As Integer = 0
For x As Integer = 1 To twiceThis
fx()
Next
For x As Integer = 1 To twiceThis
rv += fx()
Next
If rv = twiceThis Then Return 1 Else Return 0
End Function
End Class
Let's say I want to check if a number n = 123 has duplicate digits. I tried:
#include <iostream>
using namespace std;
int main() {
int n = 123;
int d1 = n % 10;
int d2 = ( n / 10 ) % 10;
int d3 = ( n / 100 ) % 10;
if( d1 != d2 && d1 != d3 && d2 != d3 ) {
cout << n << " does not have duplicate digits.\n";
}
}
Is there any faster solution to this problem?
Update
Sorry for being unclear. The code above was written in C++ only for description purpose. I have to solve this problem in TI-89, with a number of 9 digits. And since the limitation of memory and speed, I'm looking for a fastest way possible.
TI-89 only has several condition keyword:
If
If ... Then
when(
For ... EndFor
While ... EndWhile
Loop ... EndLoop
Custom ... EndCustom
Thanks,
Chan
Not necessarily faster but you should measure anyway, just in case - my optimisation mantra is "measure, don't guess".
But I believe it's clearer in intent (and simple enough to be translated to a simpler calculator language. It's also able to handle arbitrarily sized integers.
int hasDupes (unsigned int n) {
// Flag to indicate digit has been used, all zero to start.
int used[10] = {0};
// More than 10 digits must have duplicates, return true quickly.
if (n > 9999999999) return 1;
// Process each digit in number.
while (n != 0) {
// If duplicate, return true as soon as found.
if (used[n%10]) return 1;
// Otherwise, mark used, go to next digit.
used[n%10] = 1;
n /= 10;
}
// No duplicates after checking all digits, return false.
return 0;
}
If you have a limited range of possibilities, you can use the time-honoured approach of sacrificing space for time. For example, let's say you're talking about numbers between 0 and 999 inclusive (the : : markers simply indicate data I've removed to keep the size of the answer manageable):
const int *hasDupes = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0 - 9
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, // 10 - 19
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, // 20 - 29
: :
0, 0, 1, 0, 0, 1, 0, 0, 0, 0, // 520 - 529
: :
0, 1, 0, 0, 0, 0, 0, 0, 1, 0, // 810 - 819
: :
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, // 970 - 979
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, // 980 - 989
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 990 - 999
};
and just do a table lookup of hasDupes[n]. The table itself could be generated (once) programmatically and then just inserted into your code for usage.
However, based on your edit where you state you need to handle nine-digit numbers, a billion-element array is probably not going to be possible on your calculator. I would therefore opt for the first solution.
template<class T, int radix = 10>
bool has_duplicate_digits(T n) {
int digits_mask = 0;
while (digits_mask |= (1 << (n % radix)), n /= radix)
if (digits_mask & (1 << (n % radix)))
return true;
return false;
}
Something like that should work as long as n is nonnegative and int has at least radix bits.
digits_mask is a bitset (bit 0 represents the occurrence of a 0 digit, bit 1 represents the occurrence of a 1 digit, etc.).
The bitmap is populated with the least significant digit of n, and the rest of the digits are shifted down. If there are more digits, and the new least significant digit is marked as having occurred previously, return true, otherwise repeat.
When there are no more digits, return false.
1 << x returns 1, 2, 4, 8, etc.: masks to use to test/set bits in the bitset.
a |= z is shorthand for a = a | z, which sets bits by the union of a from z.
a & z is the intersection of the bits in a and z, and is zero (false) if none are set and non-zero (true) if any are set.
I did a crash course in TI-89 basic to answer :)
Let's see if this works (I haven't an emulator, so can't check).
Test()
Prgm
{0,0,0,0,0,0,0,0,0,0}->A
Title "Request"
Request "Enter a number",B
EndDlog
Expr(B)->B
While B > 1
MOD(10,B)->C
if A[C+1] = 1 goto K
1->A[C+1]
B-C->B
EndWhile
Title "Done"
Text "Numbers non repeating"
Enddlog
goto J
Lbl K
Title "Done"
Text "Numbers repeating"
Enddlog
Lbl J
EndPrgm
I'm looking for an algorithm that places tick marks on an axis, given a range to display, a width to display it in, and a function to measure a string width for a tick mark.
For example, given that I need to display between 1e-6 and 5e-6 and a width to display in pixels, the algorithm would determine that I should put tickmarks (for example) at 1e-6, 2e-6, 3e-6, 4e-6, and 5e-6. Given a smaller width, it might decide that the optimal placement is only at the even positions, i.e. 2e-6 and 4e-6 (since putting more tickmarks would cause them to overlap).
A smart algorithm would give preference to tickmarks at multiples of 10, 5, and 2. Also, a smart algorithm would be symmetric around zero.
As I didn't like any of the solutions I've found so far, I implemented my own. It's in C# but it can be easily translated into any other language.
It basically chooses from a list of possible steps the smallest one that displays all values, without leaving any value exactly in the edge, lets you easily select which possible steps you want to use (without having to edit ugly if-else if blocks), and supports any range of values. I used a C# Tuple to return three values just for a quick and simple demonstration.
private static Tuple<decimal, decimal, decimal> GetScaleDetails(decimal min, decimal max)
{
// Minimal increment to avoid round extreme values to be on the edge of the chart
decimal epsilon = (max - min) / 1e6m;
max += epsilon;
min -= epsilon;
decimal range = max - min;
// Target number of values to be displayed on the Y axis (it may be less)
int stepCount = 20;
// First approximation
decimal roughStep = range / (stepCount - 1);
// Set best step for the range
decimal[] goodNormalizedSteps = { 1, 1.5m, 2, 2.5m, 5, 7.5m, 10 }; // keep the 10 at the end
// Or use these if you prefer: { 1, 2, 5, 10 };
// Normalize rough step to find the normalized one that fits best
decimal stepPower = (decimal)Math.Pow(10, -Math.Floor(Math.Log10((double)Math.Abs(roughStep))));
var normalizedStep = roughStep * stepPower;
var goodNormalizedStep = goodNormalizedSteps.First(n => n >= normalizedStep);
decimal step = goodNormalizedStep / stepPower;
// Determine the scale limits based on the chosen step.
decimal scaleMax = Math.Ceiling(max / step) * step;
decimal scaleMin = Math.Floor(min / step) * step;
return new Tuple<decimal, decimal, decimal>(scaleMin, scaleMax, step);
}
static void Main()
{
// Dummy code to show a usage example.
var minimumValue = data.Min();
var maximumValue = data.Max();
var results = GetScaleDetails(minimumValue, maximumValue);
chart.YAxis.MinValue = results.Item1;
chart.YAxis.MaxValue = results.Item2;
chart.YAxis.Step = results.Item3;
}
Take the longest of the segments about zero (or the whole graph, if zero is not in the range) - for example, if you have something on the range [-5, 1], take [-5,0].
Figure out approximately how long this segment will be, in ticks. This is just dividing the length by the width of a tick. So suppose the method says that we can put 11 ticks in from -5 to 0. This is our upper bound. For the shorter side, we'll just mirror the result on the longer side.
Now try to put in as many (up to 11) ticks in, such that the marker for each tick in the form i*10*10^n, i*5*10^n, i*2*10^n, where n is an integer, and i is the index of the tick. Now it's an optimization problem - we want to maximize the number of ticks we can put in, while at the same time minimizing the distance between the last tick and the end of the result. So assign a score for getting as many ticks as we can, less than our upper bound, and assign a score to getting the last tick close to n - you'll have to experiment here.
In the above example, try n = 1. We get 1 tick (at i=0). n = 2 gives us 1 tick, and we're further from the lower bound, so we know that we have to go the other way. n = 0 gives us 6 ticks, at each integer point point. n = -1 gives us 12 ticks (0, -0.5, ..., -5.0). n = -2 gives us 24 ticks, and so on. The scoring algorithm will give them each a score - higher means a better method.
Do this again for the i * 5 * 10^n, and i*2*10^n, and take the one with the best score.
(as an example scoring algorithm, say that the score is the distance to the last tick times the maximum number of ticks minus the number needed. This will likely be bad, but it'll serve as a decent starting point).
Funnily enough, just over a week ago I came here looking for an answer to the same question, but went away again and decided to come up with my own algorithm. I am here to share, in case it is of any use.
I wrote the code in Python to try and bust out a solution as quickly as possible, but it can easily be ported to any other language.
The function below calculates the appropriate interval (which I have allowed to be either 10**n, 2*10**n, 4*10**n or 5*10**n) for a given range of data, and then calculates the locations at which to place the ticks (based on which numbers within the range are divisble by the interval). I have not used the modulo % operator, since it does not work properly with floating-point numbers due to floating-point arithmetic rounding errors.
Code:
import math
def get_tick_positions(data: list):
if len(data) == 0:
return []
retpoints = []
data_range = max(data) - min(data)
lower_bound = min(data) - data_range/10
upper_bound = max(data) + data_range/10
view_range = upper_bound - lower_bound
num = lower_bound
n = math.floor(math.log10(view_range) - 1)
interval = 10**n
num_ticks = 1
while num <= upper_bound:
num += interval
num_ticks += 1
if num_ticks > 10:
if interval == 10 ** n:
interval = 2 * 10 ** n
elif interval == 2 * 10 ** n:
interval = 4 * 10 ** n
elif interval == 4 * 10 ** n:
interval = 5 * 10 ** n
else:
n += 1
interval = 10 ** n
num = lower_bound
num_ticks = 1
if view_range >= 10:
copy_interval = interval
else:
if interval == 10 ** n:
copy_interval = 1
elif interval == 2 * 10 ** n:
copy_interval = 2
elif interval == 4 * 10 ** n:
copy_interval = 4
else:
copy_interval = 5
first_val = 0
prev_val = 0
times = 0
temp_log = math.log10(interval)
if math.isclose(lower_bound, 0):
first_val = 0
elif lower_bound < 0:
if upper_bound < -2*interval:
if n < 0:
copy_ub = round(upper_bound*10**(abs(temp_log) + 1))
times = copy_ub // round(interval*10**(abs(temp_log) + 1)) + 2
else:
times = upper_bound // round(interval) + 2
while first_val >= lower_bound:
prev_val = first_val
first_val = times * copy_interval
if n < 0:
first_val *= (10**n)
times -= 1
first_val = prev_val
times += 3
else:
if lower_bound > 2*interval:
if n < 0:
copy_ub = round(lower_bound*10**(abs(temp_log) + 1))
times = copy_ub // round(interval*10**(abs(temp_log) + 1)) - 2
else:
times = lower_bound // round(interval) - 2
while first_val < lower_bound:
first_val = times*copy_interval
if n < 0:
first_val *= (10**n)
times += 1
if n < 0:
retpoints.append(first_val)
else:
retpoints.append(round(first_val))
val = first_val
times = 1
while val <= upper_bound:
val = first_val + times * interval
if n < 0:
retpoints.append(val)
else:
retpoints.append(round(val))
times += 1
retpoints.pop()
return retpoints
When passing in the following three data-points to the function
points = [-0.00493, -0.0003892, -0.00003292]
... the output I get (as a list) is as follows:
[-0.005, -0.004, -0.003, -0.002, -0.001, 0.0]
When passing this:
points = [1.399, 38.23823, 8309.33, 112990.12]
... I get:
[0, 20000, 40000, 60000, 80000, 100000, 120000]
When passing this:
points = [-54, -32, -19, -17, -13, -11, -8, -4, 12, 15, 68]
... I get:
[-60, -40, -20, 0, 20, 40, 60, 80]
... which all seem to be a decent choice of positions for placing ticks.
The function is written to allow 5-10 ticks, but that could easily be changed if you so please.
Whether the list of data supplied contains ordered or unordered data it does not matter, since it is only the minimum and maximum data points within the list that matter.
This simple algorithm yields an interval that is multiple of 1, 2, or 5 times a power of 10. And the axis range gets divided in at least 5 intervals. The code sample is in java language:
protected double calculateInterval(double range) {
double x = Math.pow(10.0, Math.floor(Math.log10(range)));
if (range / x >= 5)
return x;
else if (range / (x / 2.0) >= 5)
return x / 2.0;
else
return x / 5.0;
}
This is an alternative, for minimum 10 intervals:
protected double calculateInterval(double range) {
double x = Math.pow(10.0, Math.floor(Math.log10(range)));
if (range / (x / 2.0) >= 10)
return x / 2.0;
else if (range / (x / 5.0) >= 10)
return x / 5.0;
else
return x / 10.0;
}
I've been using the jQuery flot graph library. It's open source and does axis/tick generation quite well. I'd suggest looking at it's code and pinching some ideas from there.