Related
I working on a project where I have to do some K-means clustering with MLlib from Spark. The problem is that my data have 744 features. I did some research and I found out that PCA is what I need. The best part is that Spark PCA implemented, so I decided to do that.
double[][] array=new double[381][744];
int contor=0;
for (Vector vectorData : parsedTrainingData.collect()) {
contor++;
array[contor]=vectorData.toArray();
}
LinkedList<Vector> rowsList = new LinkedList<>();
for (int i = 0; i < array.length; i++) {
Vector currentRow = Vectors.dense(array[i]);
rowsList.add(currentRow);
}
JavaRDD<Vector> rows = jsc.parallelize(rowsList);
// Create a RowMatrix from JavaRDD<Vector>.
RowMatrix mat = new RowMatrix(rows.rdd());
// Compute the top 3 principal components.
Tuple2<Matrix, Vector> pc = mat.computePrincipalComponentsAndExplainedVariance(*param*);
RowMatrix projected = mat.multiply(pc._1);
// $example off$
Vector[] collectPartitions = (Vector[]) projected.rows().collect();
System.out.println("Projected vector of principal component:");
for (Vector vector : collectPartitions) {
System.out.println("\t" + vector);
}
System.out.println("\n Explanend Variance:");
System.out.println(pc._2);
double sum = 0;
for (Double val : pc._2.toArray()) {
sum += val;
}
System.out.println("\n the sum is: " + (double) sum);
About the data that I want to apply PCA I have 744 features who represents values(total seconds of active time) collected by sensors in a home on every hour, so it is something like (31 sensors * 24 h), in format(s(sensorNumber)(hour): s10, s11.....s123, s20, s21....223,.....s3123.
From what I understand one of the criteria for a reduction to not lose to much of the information is the sum of Explained Variance to be greater 0.9 (90%). After some tests I got this results:
*pram* sum
100 0.91
150 0.97
200 0.98
250 0.99
350 1
So from what I understand it will be safe to reduce my 744 features vector to a 100 features vector. My problem is that this results looks to good to be true. I search for some examples to have guidance, but I am still unsure if what I did is correct. So is this results plausible?
When I'm calculating page ranks of a set of crawled domains, using a dampening factor of 0.85. As mentioned in many page ranks papers, the sum of pageranks should converge to 1. But regardless of how many iterations I do, it seems to converge at 0.90xxx. If I lower dampening factor to 0.5, I move closer to 1 obviously.
Is it bad that the page ranks sum converge at 0.90, and what would this generally implicate?
Yes, it is bad, since it indicate a bug in your implementation. Pagerank gives as a result a probability space, and it must sum to 1 as a basic sanity check.
My guess of the problem is you did not handle 'sinks' - nodes that have no outgoing links.
Common ways to handle sinks are:
For a sink vi, regard all nodes (vi,vj) as existing except vi=vj
remove them from the graph completely (and repeat until convergence)
Link them back to all nodes that linked to them (if vi is a sink, for all edge (vj,vi), add (vi,vj) as well).
Consider the following toy example: 2 pages, A,B. A links to B, B links to nothing. The resulting matrix is:
W=
0 1
0 0
Now, using d=0.85, you get the following equations:
v = 0.85* W'v + 0.15*[1/2,1/2]
v1 = 0.85* (0*v1+0*v2) + 0.15*1/2 = 0.15*1/2 = 0.075
v2 = 0.85*(1*v1 + 0*v2) + 0.15/2 = 0.85v1 + 0.075 = 0.006375 + 0.075 = 0.13875
And the sum is not 1.
However, if you handle the sinks, in one of the suggested approach (let's examine approach (1)), you will get:
W =
0 1
1 0
You will now get the set of equations:
v = 0.85* W'v + 0.15*[1/2,1/2]
v1 = 0.85* (0*v1+1*v2) + 0.15*1/2 = 0.85v2 + 0.075
v2 = 0.85*(1*v1 + 0*v2) + 0.15/2 = 0.85v1 + 0.075 (/0.85)-> 1/0.85 * v2 = v1 + 0.075/0.85
-> (add 2 equations)
1/0.85*v2 + v1 = 0.85v2 + 0.075 + v1 + 0.075/0.85
-> (approximately)
0.326*v2 = 0.163
v2 = 0.5
As you can see, by using this method, we got a probability space and now, as expected, page rank of all nodes sum to 1.
This became the algorithm:
// data structures
private HashMap<String, Double> pageRanks;
private HashMap<String, Double> oldRanks;
private HashMap<String, Integer> numberOutlinks;
private HashMap<String, HashMap<String, Integer>> inlinks;
private HashSet<String> domainsWithNoOutlinks;
private double N;
// data parsing occluded
public void startAlgorithm() {
int maxIterations = 20;
int itr = 0;
double d = 0.85;
double dp = 0;
double dpp = (1 - d) / N;
// initialize pagerank
for (String s : oldRanks.keySet()) {
oldRanks.put(s, 1.0 / N);
}
System.out.println("Starting page rank iterations..");
while (maxIterations >= itr) {
System.out.println("Iteration: " + itr);
dp = 0;
// teleport probability
for (String domain : domainsWithNoOutlinks) {
dp = dp + d * oldRanks.get(domain) / N;
}
for (String domain : oldRanks.keySet()) {
pageRanks.put(domain, dp + dpp);
for (String inlink : inlinks.get(domain).keySet()) { // for every inlink of domain
pageRanks.put(domain, pageRanks.get(domain) + inlinks.get(domain).get(inlink) * d * oldRanks.get(inlink) / numberOutlinks.get(inlink));
}
}
// update pageranks with new values
for (String domain : pageRanks.keySet()) {
oldRanks.put(domain, pageRanks.get(domain));
}
itr++;
}
}
Where this line is the important one:
pageRanks.put(domain, pageRanks.get(domain) + inlinks.get(domain).get(inlink) * d * oldRanks.get(inlink) / numberOutlinks.get(inlink));
inlinks.get(domain).get(inlink) returns how much an inlink "like/referenced" the current domain, and we divide that by how many inlinks that current domain have. And "inlinks.get(domain).get(inlink)" is what I missed in my algorithm hence why the sum didn't converge at 1.
Read more: http://www.ccs.northeastern.edu/home/daikeshi/notes/PageRank.pdf
I've been working with this variation of dynamic programming to solve a knapsack problem:
KnapsackItem = Struct.new(:name, :cost, :value)
KnapsackProblem = Struct.new(:items, :max_cost)
def dynamic_programming_knapsack(problem)
num_items = problem.items.size
items = problem.items
max_cost = problem.max_cost
cost_matrix = zeros(num_items, max_cost+1)
num_items.times do |i|
(max_cost + 1).times do |j|
if(items[i].cost > j)
cost_matrix[i][j] = cost_matrix[i-1][j]
else
cost_matrix[i][j] = [cost_matrix[i-1][j], items[i].value + cost_matrix[i-1][j-items[i].cost]].max
end
end
end
cost_matrix
end
def get_used_items(problem, cost_matrix)
i = cost_matrix.size - 1
currentCost = cost_matrix[0].size - 1
marked = Array.new(cost_matrix.size, 0)
while(i >= 0 && currentCost >= 0)
if(i == 0 && cost_matrix[i][currentCost] > 0 ) || (cost_matrix[i][currentCost] != cost_matrix[i-1][currentCost])
marked[i] = 1
currentCost -= problem.items[i].cost
end
i -= 1
end
marked
end
This has worked great for the structure above where you simply provide a name, cost and value. Items can be created like the following:
items = [
KnapsackItem.new('david lee', 8000, 30) ,
KnapsackItem.new('kevin love', 12000, 50),
KnapsackItem.new('kemba walker', 7300, 10),
KnapsackItem.new('jrue holiday', 12300, 30),
KnapsackItem.new('stephen curry', 10300, 80),
KnapsackItem.new('lebron james', 5300, 90),
KnapsackItem.new('kevin durant', 2300, 30),
KnapsackItem.new('russell westbrook', 9300, 30),
KnapsackItem.new('kevin martin', 8300, 15),
KnapsackItem.new('steve nash', 4300, 15),
KnapsackItem.new('kyle lowry', 6300, 20),
KnapsackItem.new('monta ellis', 8300, 30),
KnapsackItem.new('dirk nowitzki', 7300, 25),
KnapsackItem.new('david lee', 9500, 35),
KnapsackItem.new('klay thompson', 6800, 28)
]
problem = KnapsackProblem.new(items, 65000)
Now, the problem I'm having is that I need to add a position for each of these players and I have to let the knapsack algorithm know that it still needs to maximize value across all players, except there is a new restriction and that restriction is each player has a position and each position can only be selected a certain amount of times. Some positions can be selected twice, others once. Items would ideally become this:
KnapsackItem = Struct.new(:name, :cost, :position, :value)
Positions would have a restriction such as the following:
PositionLimits = Struct.new(:position, :max)
Limits would be instantiated perhaps like the following:
limits = [Struct.new('PG', 2), Struct.new('C', 1), Struct.new('SF', 2), Struct.new('PF', 2), Struct.new('Util', 2)]
What makes this a little more tricky is every player can be in the Util position. If we want to disable the Util position, we will just set the 2 to 0.
Our original items array would look something like the following:
items = [
KnapsackItem.new('david lee', 'PF', 8000, 30) ,
KnapsackItem.new('kevin love', 'C', 12000, 50),
KnapsackItem.new('kemba walker', 'PG', 7300, 10),
... etc ...
]
How can position restrictions be added to the knapsack algorithm in order to still retain max value for the provided player pool provided?
There are some efficient libraries available in ruby which could suit your task , Its clear that you are looking for some constrain based optimization , there are some libraries in ruby which are a opensource so, free to use , Just include them in you project. All you need to do is generate Linear programming model objective function out of your constrains and library's optimizer would generate Solution which satisfy all your constrains , or says no solution exists if nothing can be concluded out of the given constrains .
Some such libraries available in ruby are
RGLPK
OPL
LP Solve
OPL follows the LP syntax similar to IBM CPLEX , which is widely used Optimization software, So you could get good references on how to model the LP using this , Moreover this is build on top of the RGLPK.
As I understand, the additional constraint that you are specifying is as following:
There shall be a set of elements, out which only at most k (k = 1 or
2) elements can be selected in the solution. There shall be multiple
such sets.
There are two approaches that come to my mind, neither of which are efficient enough.
Approach 1:
Divide the elements into groups of positions. So if there are 5 positions, then each element shall be assigned to one of 5 groups.
Iterate (or recur) through all the combinations by selecting 1 (or 2) element from each group and checking the total value and cost. There are ways in which you can fathom some combinations. For example, in a group if there are two elements in which one gives more value at lesser cost, then the other can be rejected from all solutions.
Approach 2:
Mixed Integer Linear Programming Approach.
Formulate the problem as follows:
Maximize summation (ViXi) {i = 1 to N}
where Vi is value and
Xi is a 1/0 variable denoting presence/absence of an element from the solution.
Subject to constraints:
summation (ciXi) <= C_MAX {total cost}
And for each group:
summation (Xj) <= 1 (or 2 depending on position)
All Xi = 0 or 1.
And then you will have to find a solver to solve the above MILP.
This problem is similar to a constraint vehicle routing problem. You can try a heuristic like the saving algorithm from Clarke&Wright. You can also try a brute-force algorithm with less players.
Considering players have Five positions your knapsack problem would be:-
Knpsk(W,N,PG,C,SF,PF,Util) = max(Knpsk(W-Cost[N],N-1,...)+Value[N],Knpsk(W,N-1,PG,C,SF,PF,Util),Knpsk(W-Cost[N],N-1,PG,C,SF,PF,Util-1)+Value[N])
if(Pos[N]=="PG") then Knpsk(W-Cost[N],N-1,....) = Knpsk(W-Cost[N],N-1,PG-1,....)
if(Pos[N]=="C") then Knpsk(W-Cost[N],N-1,....) = Knpsk(W-Cost[N],N-1,PG,C-1....)
so on...
PG,C,SF,PF,Util are current position capacities
W is current knapsack capacity
N number of items available
Dynamic Programming can be used as before using 7-D table and as in your case the values of positions are small it will slow down algorithm by factor of 16 which is great for n-p complete problem
Following is dynamic programming solution in JAVA:
public class KnapsackSolver {
HashMap CostMatrix;
// Maximum capacities for positions
int posCapacity[] = {2,1,2,2,2};
// Total positions
String[] positions = {"PG","C","SF","PF","util"};
ArrayList playerSet = new ArrayList<player>();
public ArrayList solutionSet;
public int bestCost;
class player {
int value;
int cost;
int pos;
String name;
public player(int value,int cost,int pos,String name) {
this.value = value;
this.cost = cost;
this.pos = pos;
this.name = name;
}
public String toString() {
return("'"+name+"'"+", "+value+", "+cost+", "+positions[pos]);
}
}
// Used to add player to list of available players
void additem(String name,int cost,int value,String pos) {
int i;
for(i=0;i<positions.length;i++) {
if(pos.equals(positions[i]))
break;
}
playerSet.add(new player(value,cost,i,name));
}
// Converts subproblem data to string for hashing
public String encode(int Capacity,int Totalitems,int[] positions) {
String Data = Capacity+","+Totalitems;
for(int i=0;i<positions.length;i++) {
Data = Data + "," + positions[i];
}
return(Data);
}
// Check if subproblem is in hash tables
int isDone(int capacity,int players,int[] positions) {
String k = encode(capacity,players,positions);
if(CostMatrix.containsKey(k)) {
//System.out.println("Key found: "+k+" "+(Integer)CostMatrix.get(k));
return((Integer)CostMatrix.get(k));
}
return(-1);
}
// Adds subproblem added hash table
void addEncode(int capacity,int players,int[] positions,int value) {
String k = encode(capacity,players,positions);
CostMatrix.put(k, value);
}
boolean checkvalid(int capacity,int players) {
return(!(capacity<1||players<0));
}
// Solve the Knapsack recursively with Hash look up
int solve(int capacity,int players,int[] posCapacity) {
// Check if sub problem is valid
if(checkvalid(capacity,players)) {
//System.out.println("Processing: "+encode(capacity,players,posCapacity));
player current = (player)playerSet.get(players);
int sum1 = 0,sum2 = 0,sum3 = 0;
int temp = isDone(capacity,players-1,posCapacity);
// Donot add player
if(temp>-1) {
sum1 = temp;
}
else sum1 = solve(capacity,players-1,posCapacity);
//check if current player can be added to knapsack
if(capacity>=current.cost) {
posCapacity[posCapacity.length-1]--;
temp = isDone(capacity-current.cost,players-1,posCapacity);
posCapacity[posCapacity.length-1]++;
// Add player to util
if(posCapacity[posCapacity.length-1]>0) {
if(temp>-1) {
sum2 = temp+current.value;
}
else {
posCapacity[posCapacity.length-1]--;
sum2 = solve(capacity-current.cost,players-1,posCapacity)+current.value;
posCapacity[posCapacity.length-1]++;
}
}
// Add player at its position
int i = current.pos;
if(posCapacity[i]>0) {
posCapacity[i]--;
temp = isDone(capacity-current.cost,players-1,posCapacity);
posCapacity[i]++;
if(temp>-1) {
sum3 = temp+current.value;
}
else {
posCapacity[i]--;
sum3 = solve(capacity-current.cost,players-1,posCapacity)+current.value;
posCapacity[i]++;
}
}
}
//System.out.println(sum1+ " "+ sum2+ " " + sum3 );
// Evaluate the maximum of all subproblem
int res = Math.max(Math.max(sum1,sum2), sum3);
//add current solution to Hash table
addEncode(capacity, players, posCapacity,res);
//System.out.println("Encoding: "+encode(capacity,players,posCapacity)+" Cost: "+res);
return(res);
}
return(0);
}
void getSolution(int capacity,int players,int[] posCapacity) {
if(players>=0) {
player curr = (player)playerSet.get(players);
int bestcost = isDone(capacity,players,posCapacity);
int sum1 = 0,sum2 = 0,sum3 = 0;
//System.out.println(encode(capacity,players-1,posCapacity)+" "+bestcost);
sum1 = isDone(capacity,players-1,posCapacity);
posCapacity[posCapacity.length-1]--;
sum2 = isDone(capacity-curr.cost,players-1,posCapacity) + curr.value;
posCapacity[posCapacity.length-1]++;
posCapacity[curr.pos]--;
sum3 = isDone(capacity-curr.cost,players-1,posCapacity) + curr.value;
posCapacity[curr.pos]++;
if(bestcost==0)
return;
// Check if player is not added
if(sum1==bestcost) {
getSolution(capacity,players-1,posCapacity);
}
// Check if player is added to util
else if(sum2==bestcost) {
solutionSet.add(curr);
//System.out.println(positions[posCapacity.length-1]+" added");
posCapacity[posCapacity.length-1]--;
getSolution(capacity-curr.cost,players-1,posCapacity);
posCapacity[posCapacity.length-1]++;
}
else {
solutionSet.add(curr);
//System.out.println(positions[curr.pos]+" added");
posCapacity[curr.pos]--;
getSolution(capacity-curr.cost,players-1,posCapacity);
posCapacity[curr.pos]++;
}
}
}
void getOptSet(int capacity) {
CostMatrix = new HashMap<String,Integer>();
bestCost = solve(capacity,playerSet.size()-1,posCapacity);
solutionSet = new ArrayList<player>();
getSolution(capacity, playerSet.size()-1, posCapacity);
}
public static void main(String[] args) {
KnapsackSolver ks = new KnapsackSolver();
ks.additem("david lee", 8000, 30, "PG");
ks.additem("kevin love", 12000, 50, "C");
ks.additem("kemba walker", 7300, 10, "SF");
ks.additem("jrue holiday", 12300, 30, "PF");
ks.additem("stephen curry", 10300, 80, "PG");
ks.additem("lebron james", 5300, 90, "PG");
ks.additem("kevin durant", 2300, 30, "C");
ks.additem("russell westbrook", 9300, 30, "SF");
ks.additem("kevin martin", 8300, 15, "PF");
ks.additem("steve nash", 4300, 15, "C");
ks.additem("kyle lowry", 6300, 20, "PG");
ks.additem("monta ellis", 8300, 30, "C");
ks.additem("dirk nowitzki", 7300, 25, "SF");
ks.additem("david lee", 9500, 35, "PF");
ks.additem("klay thompson", 6800, 28,"PG");
//System.out.println("Items added...");
// System.out.println(ks.playerSet);
int maxCost = 30000;
ks.getOptSet(maxCost);
System.out.println("Best Value: "+ks.bestCost);
System.out.println("Solution Set: "+ks.solutionSet);
}
}
Note: If players with certain positions are added more than its capacity then those added as util because players from any position can be added to util.
Imagine you have 3 buckets, but each of them has a hole in it. I'm trying to fill a bath tub. The bath tub has a minimum level of water it needs and a maximum level of water it can contain. By the time you reach the tub with the bucket it is not clear how much water will be in the bucket, but you have a range of possible values.
Is it possible to adequately fill the tub with water?
Pretty much you have 3 ranges (min,max), is there some sum of them that will fall within a 4th range?
For example:
Bucket 1 : 5-10L
Bucket 2 : 15-25L
Bucket 3 : 10-50L
Bathtub 100-150L
Is there some guaranteed combination of 1 2 and 3 that will fill the bathtub within the requisite range? Multiples of each bucket can be used.
EDIT: Now imagine there are 50 different buckets?
If the capacity of the tub is not very large ( not greater than 10^6 for an example), we can solve it using dynamic programming.
Approach:
Initialization: memo[X][Y] is an array to memorize the result. X = number of buckets, Y = maximum capacity of the tub. Initialize memo[][] with -1.
Code:
bool dp(int bucketNum, int curVolume){
if(curVolume > maxCap)return false; // pruning extra branches
if(curVolume>=minCap && curVolume<=maxCap){ // base case on success
return true;
}
int &ret = memo[bucketNum][curVolume];
if(ret != -1){ // this state has been visited earlier
return false;
}
ret = false;
for(int i = minC[bucketNum]; i < = maxC[bucketNum]; i++){
int newVolume = curVolume + i;
for(int j = bucketNum; j <= 3; j++){
ret|=dp(j,newVolume);
if(ret == true)return ret;
}
}
return ret;
}
Warning: Code not tested
Here's a naïve recursive solution in python that works just fine (although it doesn't find an optimal solution):
def match_helper(lower, upper, units, least_difference, fail = dict()):
if upper < lower + least_difference:
return None
if fail.get((lower,upper)):
return None
exact_match = [ u for u in units if u['lower'] >= lower and u['upper'] <= upper ]
if exact_match:
return [ exact_match[0] ]
for unit in units:
if unit['upper'] > upper:
continue
recursive_match = match_helper(lower - unit['lower'], upper - unit['upper'], units, least_difference)
if recursive_match:
return [unit] + recursive_match
else:
fail[(lower,upper)] = 1
return None
def match(lower, upper):
units = [
{ 'name': 'Bucket 1', 'lower': 5, 'upper': 10 },
{ 'name': 'Bucket 2', 'lower': 15, 'upper': 25 },
{ 'name': 'Bucket 3', 'lower': 10, 'upper': 50 }
]
least_difference = min([ u['upper'] - u['lower'] for u in units ])
return match_helper(
lower = lower,
upper = upper,
units = sorted(units, key = lambda u: u['upper']),
least_difference = min([ u['upper'] - u['lower'] for u in units ]),
)
result = match(100, 175)
if result:
lower = sum([ u['lower'] for u in result ])
upper = sum([ u['upper'] for u in result ])
names = [ u['name'] for u in result ]
print lower, "-", upper
print names
else:
print "No solution"
It prints "No solution" for 100-150, but for 100-175 it comes up with a solution of 5x bucket 1, 5x bucket 2.
Assuming you are saying that the "range" for each bucket is the amount of water that it may have when it reaches the tub, and all you care about is if they could possibly fill the tub...
Just take the "max" of each bucket and sum them. If that is in the range of what you consider the tub to be "filled" then it can.
Updated:
Given that buckets can be used multiple times, this seems to me like we're looking for solutions to a pair of equations.
Given buckets x, y and z we want to find a, b and c:
a*x.min + b*y.min + c*z.min >= bathtub.min
and
a*x.max + b*y.max + c*z.max <= bathtub.max
Re: http://en.wikipedia.org/wiki/Diophantine_equation
If bathtub.min and bathtub.max are both multiples of the greatest common divisor of a,b and c, then there are infinitely many solutions (i.e. we can fill the tub), otherwise there are no solutions (i.e. we can never fill the tub).
This can be solved with multiple applications of the change making problem.
Each Bucket.Min value is a currency denomination, and Bathtub.Min is the target value.
When you find a solution via a change-making algorithm, then apply one more constraint:
sum(each Bucket.Max in your solution) <= Bathtub.max
If this constraint is not met, throw out this solution and look for another. This will probably require a change to a standard change-making algorithm that allows you to try other solutions when one is found to not be suitable.
Initially, your target range is Bathtub.Range.
Each time you add an instance of a bucket to the solution, you reduce the target range for the remaining buckets.
For example, using your example buckets and tub:
Target Range = 100..150
Let's say we want to add a Bucket1 to the candidate solution. That then gives us
Target Range = 95..140
because if the rest of the buckets in the solution total < 95, then this Bucket1 might not be sufficient to fill the tub to 100, and if the rest of the buckets in the solution total > 140, then this Bucket1 might fill the tub over 150.
So, this gives you a quick way to check if a candidate solution is valid:
TargetRange = Bathtub.Range
foreach Bucket in CandidateSolution
TargetRange.Min -= Bucket.Min
TargetRange.Max -= Bucket.Max
if TargetRange.Min == 0 AND TargetRange.Max >= 0 then solution found
if TargetRange.Min < 0 or TargetRange.Max < 0 then solution is invalid
This still leaves the question - How do you come up with the set of candidate solutions?
Brute force would try all possible combinations of buckets.
Here is my solution for finding the optimal solution (least number of buckets). It compares the ratio of the maximums to the ratio of the minimums, to figure out the optimal number of buckets to fill the tub.
private static void BucketProblem()
{
Range bathTub = new Range(100, 175);
List<Range> buckets = new List<Range> {new Range(5, 10), new Range(15, 25), new Range(10, 50)};
Dictionary<Range, int> result;
bool canBeFilled = SolveBuckets(bathTub, buckets, out result);
}
private static bool BucketHelper(Range tub, List<Range> buckets, Dictionary<Range, int> results)
{
Range bucket;
int startBucket = -1;
int fills = -1;
for (int i = buckets.Count - 1; i >=0 ; i--)
{
bucket = buckets[i];
double maxRatio = (double)tub.Maximum / bucket.Maximum;
double minRatio = (double)tub.Minimum / bucket.Minimum;
if (maxRatio >= minRatio)
{
startBucket = i;
if (maxRatio - minRatio > 1)
fills = (int) minRatio + 1;
else
fills = (int) maxRatio;
break;
}
}
if (startBucket < 0)
return false;
bucket = buckets[startBucket];
tub.Maximum -= bucket.Maximum * fills;
tub.Minimum -= bucket.Minimum * fills;
results.Add(bucket, fills);
return tub.Maximum == 0 || tub.Minimum <= 0 || startBucket == 0 || BucketHelper(tub, buckets.GetRange(0, startBucket), results);
}
public static bool SolveBuckets(Range tub, List<Range> buckets, out Dictionary<Range, int> results)
{
results = new Dictionary<Range, int>();
buckets = buckets.OrderBy(b => b.Minimum).ToList();
return BucketHelper(new Range(tub.Minimum, tub.Maximum), buckets, results);
}
I need a reasonably smart algorithm to come up with "nice" grid lines for a graph (chart).
For example, assume a bar chart with values of 10, 30, 72 and 60. You know:
Min value: 10
Max value: 72
Range: 62
The first question is: what do you start from? In this case, 0 would be the intuitive value but this won't hold up on other data sets so I'm guessing:
Grid min value should be either 0 or a "nice" value lower than the min value of the data in range. Alternatively, it can be specified.
Grid max value should be a "nice" value above the max value in the range. Alternatively, it can be specified (eg you might want 0 to 100 if you're showing percentages, irrespective of the actual values).
The number of grid lines (ticks) in the range should be either specified or a number within a given range (eg 3-8) such that the values are "nice" (ie round numbers) and you maximise use of the chart area. In our example, 80 would be a sensible max as that would use 90% of the chart height (72/80) whereas 100 would create more wasted space.
Anyone know of a good algorithm for this? Language is irrelevant as I'll implement it in what I need to.
I've done this with kind of a brute force method. First, figure out the maximum number of tick marks you can fit into the space. Divide the total range of values by the number of ticks; this is the minimum spacing of the tick. Now calculate the floor of the logarithm base 10 to get the magnitude of the tick, and divide by this value. You should end up with something in the range of 1 to 10. Simply choose the round number greater than or equal to the value and multiply it by the logarithm calculated earlier. This is your final tick spacing.
Example in Python:
import math
def BestTick(largest, mostticks):
minimum = largest / mostticks
magnitude = 10 ** math.floor(math.log(minimum, 10))
residual = minimum / magnitude
if residual > 5:
tick = 10 * magnitude
elif residual > 2:
tick = 5 * magnitude
elif residual > 1:
tick = 2 * magnitude
else:
tick = magnitude
return tick
Edit: you are free to alter the selection of "nice" intervals. One commenter appears to be dissatisfied with the selections provided, because the actual number of ticks can be up to 2.5 times less than the maximum. Here's a slight modification that defines a table for the nice intervals. In the example, I've expanded the selections so that the number of ticks won't be less than 3/5 of the maximum.
import bisect
def BestTick2(largest, mostticks):
minimum = largest / mostticks
magnitude = 10 ** math.floor(math.log(minimum, 10))
residual = minimum / magnitude
# this table must begin with 1 and end with 10
table = [1, 1.5, 2, 3, 5, 7, 10]
tick = table[bisect.bisect_right(table, residual)] if residual < 10 else 10
return tick * magnitude
There are 2 pieces to the problem:
Determine the order of magnitude involved, and
Round to something convenient.
You can handle the first part by using logarithms:
range = max - min;
exponent = int(log(range)); // See comment below.
magnitude = pow(10, exponent);
So, for example, if your range is from 50 - 1200, the exponent is 3 and the magnitude is 1000.
Then deal with the second part by deciding how many subdivisions you want in your grid:
value_per_division = magnitude / subdivisions;
This is a rough calculation because the exponent has been truncated to an integer. You may want to tweak the exponent calculation to handle boundary conditions better, e.g. by rounding instead of taking the int() if you end up with too many subdivisions.
I use the following algorithm. It's similar to others posted here but it's the first example in C#.
public static class AxisUtil
{
public static float CalcStepSize(float range, float targetSteps)
{
// calculate an initial guess at step size
var tempStep = range/targetSteps;
// get the magnitude of the step size
var mag = (float)Math.Floor(Math.Log10(tempStep));
var magPow = (float)Math.Pow(10, mag);
// calculate most significant digit of the new step size
var magMsd = (int)(tempStep/magPow + 0.5);
// promote the MSD to either 1, 2, or 5
if (magMsd > 5)
magMsd = 10;
else if (magMsd > 2)
magMsd = 5;
else if (magMsd > 1)
magMsd = 2;
return magMsd*magPow;
}
}
CPAN provides an implementation here (see source link)
See also Tickmark algorithm for a graph axis
FYI, with your sample data:
Maple: Min=8, Max=74, Labels=10,20,..,60,70, Ticks=10,12,14,..70,72
MATLAB: Min=10, Max=80, Labels=10,20,,..,60,80
Here's another implementation in JavaScript:
var calcStepSize = function(range, targetSteps)
{
// calculate an initial guess at step size
var tempStep = range / targetSteps;
// get the magnitude of the step size
var mag = Math.floor(Math.log(tempStep) / Math.LN10);
var magPow = Math.pow(10, mag);
// calculate most significant digit of the new step size
var magMsd = Math.round(tempStep / magPow + 0.5);
// promote the MSD to either 1, 2, or 5
if (magMsd > 5.0)
magMsd = 10.0;
else if (magMsd > 2.0)
magMsd = 5.0;
else if (magMsd > 1.0)
magMsd = 2.0;
return magMsd * magPow;
};
I am the author of "Algorithm for Optimal Scaling on a Chart Axis". It used to be hosted on trollop.org, but I have recently moved domains/blogging engines.
Please see my answer to a related question.
Taken from Mark above, a slightly more complete Util class in c#. That also calculates a suitable first and last tick.
public class AxisAssists
{
public double Tick { get; private set; }
public AxisAssists(double aTick)
{
Tick = aTick;
}
public AxisAssists(double range, int mostticks)
{
var minimum = range / mostticks;
var magnitude = Math.Pow(10.0, (Math.Floor(Math.Log(minimum) / Math.Log(10))));
var residual = minimum / magnitude;
if (residual > 5)
{
Tick = 10 * magnitude;
}
else if (residual > 2)
{
Tick = 5 * magnitude;
}
else if (residual > 1)
{
Tick = 2 * magnitude;
}
else
{
Tick = magnitude;
}
}
public double GetClosestTickBelow(double v)
{
return Tick* Math.Floor(v / Tick);
}
public double GetClosestTickAbove(double v)
{
return Tick * Math.Ceiling(v / Tick);
}
}
With ability to create an instance, but if you just want calculate and throw it away:
double tickX = new AxisAssists(aMaxX - aMinX, 8).Tick;
I wrote an objective-c method to return a nice axis scale and nice ticks for given min- and max values of your data set:
- (NSArray*)niceAxis:(double)minValue :(double)maxValue
{
double min_ = 0, max_ = 0, min = minValue, max = maxValue, power = 0, factor = 0, tickWidth, minAxisValue = 0, maxAxisValue = 0;
NSArray *factorArray = [NSArray arrayWithObjects:#"0.0f",#"1.2f",#"2.5f",#"5.0f",#"10.0f",nil];
NSArray *scalarArray = [NSArray arrayWithObjects:#"0.2f",#"0.2f",#"0.5f",#"1.0f",#"2.0f",nil];
// calculate x-axis nice scale and ticks
// 1. min_
if (min == 0) {
min_ = 0;
}
else if (min > 0) {
min_ = MAX(0, min-(max-min)/100);
}
else {
min_ = min-(max-min)/100;
}
// 2. max_
if (max == 0) {
if (min == 0) {
max_ = 1;
}
else {
max_ = 0;
}
}
else if (max < 0) {
max_ = MIN(0, max+(max-min)/100);
}
else {
max_ = max+(max-min)/100;
}
// 3. power
power = log(max_ - min_) / log(10);
// 4. factor
factor = pow(10, power - floor(power));
// 5. nice ticks
for (NSInteger i = 0; factor > [[factorArray objectAtIndex:i]doubleValue] ; i++) {
tickWidth = [[scalarArray objectAtIndex:i]doubleValue] * pow(10, floor(power));
}
// 6. min-axisValues
minAxisValue = tickWidth * floor(min_/tickWidth);
// 7. min-axisValues
maxAxisValue = tickWidth * floor((max_/tickWidth)+1);
// 8. create NSArray to return
NSArray *niceAxisValues = [NSArray arrayWithObjects:[NSNumber numberWithDouble:minAxisValue], [NSNumber numberWithDouble:maxAxisValue],[NSNumber numberWithDouble:tickWidth], nil];
return niceAxisValues;
}
You can call the method like this:
NSArray *niceYAxisValues = [self niceAxis:-maxy :maxy];
and get you axis setup:
double minYAxisValue = [[niceYAxisValues objectAtIndex:0]doubleValue];
double maxYAxisValue = [[niceYAxisValues objectAtIndex:1]doubleValue];
double ticksYAxis = [[niceYAxisValues objectAtIndex:2]doubleValue];
Just in case you want to limit the number of axis ticks do this:
NSInteger maxNumberOfTicks = 9;
NSInteger numberOfTicks = valueXRange / ticksXAxis;
NSInteger newNumberOfTicks = floor(numberOfTicks / (1 + floor(numberOfTicks/(maxNumberOfTicks+0.5))));
double newTicksXAxis = ticksXAxis * (1 + floor(numberOfTicks/(maxNumberOfTicks+0.5)));
The first part of the code is based on the calculation I found here to calculate nice graph axis scale and ticks similar to excel graphs. It works excellent for all kind of data sets. Here is an example of an iPhone implementation:
Another idea is to have the range of the axis be the range of the values, but put the tick marks at the appropriate position.. i.e. for 7 to 22 do:
[- - - | - - - - | - - - - | - - ]
10 15 20
As for selecting the tick spacing, I would suggest any number of the form 10^x * i / n, where i < n, and 0 < n < 10. Generate this list, and sort them, and you can find the largest number smaller than value_per_division (as in adam_liss) using a binary search.
Using a lot of inspiration from answers already availible here, here's my implementation in C. Note that there's some extendibility built into the ndex array.
float findNiceDelta(float maxvalue, int count)
{
float step = maxvalue/count,
order = powf(10, floorf(log10(step))),
delta = (int)(step/order + 0.5);
static float ndex[] = {1, 1.5, 2, 2.5, 5, 10};
static int ndexLenght = sizeof(ndex)/sizeof(float);
for(int i = ndexLenght - 2; i > 0; --i)
if(delta > ndex[i]) return ndex[i + 1] * order;
return delta*order;
}
In R, use
tickSize <- function(range,minCount){
logMaxTick <- log10(range/minCount)
exponent <- floor(logMaxTick)
mantissa <- 10^(logMaxTick-exponent)
af <- c(1,2,5) # allowed factors
mantissa <- af[findInterval(mantissa,af)]
return(mantissa*10^exponent)
}
where range argument is max-min of domain.
Here is a javascript function I wrote to round grid intervals (max-min)/gridLinesNumber to beautiful values. It works with any numbers, see the gist with detailed commets to find out how it works and how to call it.
var ceilAbs = function(num, to, bias) {
if (to == undefined) to = [-2, -5, -10]
if (bias == undefined) bias = 0
var numAbs = Math.abs(num) - bias
var exp = Math.floor( Math.log10(numAbs) )
if (typeof to == 'number') {
return Math.sign(num) * to * Math.ceil(numAbs/to) + bias
}
var mults = to.filter(function(value) {return value > 0})
to = to.filter(function(value) {return value < 0}).map(Math.abs)
var m = Math.abs(numAbs) * Math.pow(10, -exp)
var mRounded = Infinity
for (var i=0; i<mults.length; i++) {
var candidate = mults[i] * Math.ceil(m / mults[i])
if (candidate < mRounded)
mRounded = candidate
}
for (var i=0; i<to.length; i++) {
if (to[i] >= m && to[i] < mRounded)
mRounded = to[i]
}
return Math.sign(num) * mRounded * Math.pow(10, exp) + bias
}
Calling ceilAbs(number, [0.5]) for different numbers will round numbers like that:
301573431.1193228 -> 350000000
14127.786597236991 -> 15000
-63105746.17236853 -> -65000000
-718854.2201183736 -> -750000
-700660.340487957 -> -750000
0.055717507097870114 -> 0.06
0.0008068701205775142 -> 0.00085
-8.66660070605576 -> -9
-400.09256079792976 -> -450
0.0011740548815578223 -> 0.0015
-5.3003294346854085e-8 -> -6e-8
-0.00005815960629843176 -> -0.00006
-742465964.5184875 -> -750000000
-81289225.90985894 -> -85000000
0.000901771713513881 -> 0.00095
-652726598.5496342 -> -700000000
-0.6498901364393532 -> -0.65
0.9978325804695487 -> 1
5409.4078950583935 -> 5500
26906671.095639467 -> 30000000
Check out the fiddle to experiment with the code. Code in the answer, the gist and the fiddle is slightly different I'm using the one given in the answer.
If you are trying to get the scales looking right on VB.NET charts, then I've used the example from Adam Liss, but make sure when you set the min and max scale values that you pass them in from a variable of type decimal (not of type single or double) otherwise the tick mark values end up being set to like 8 decimal places.
So as an example, I had 1 chart where I set the min Y Axis value to 0.0001 and the max Y Axis value to 0.002.
If I pass these values to the chart object as singles I get tick mark values of 0.00048000001697801, 0.000860000036482233 ....
Whereas if I pass these values to the chart object as decimals I get nice tick mark values of 0.00048, 0.00086 ......
In python:
steps = [numpy.round(x) for x in np.linspace(min, max, num=num_of_steps)]
Answer that can dynamically always plot 0, handle positive and negatives, and small and large numbers, gives the tick interval size and how many to plot; written in Go
forcePlotZero changes how the max values are rounded so it'll always make a nice multiple to then get back to zero. Example:
if forcePlotZero == false then 237 --> 240
if forcePlotZero == true then 237 --> 300
Intervals are calculated by getting the multiple of 10/100/1000 etc for max and then subtracting till the cumulative total of these subtractions is < min
Here's the output from the function, along with showing forcePlotZero
Force to plot zero
max and min inputs
rounded max and min
intervals
forcePlotZero=false
min: -104 max: 240
minned: -160 maxed: 240
intervalCount: 5 intervalSize: 100
forcePlotZero=true
min: -104 max: 240
minned: -200 maxed: 300
intervalCount: 6 intervalSize: 100
forcePlotZero=false
min: 40 max: 1240
minned: 0 maxed: 1300
intervalCount: 14 intervalSize: 100
forcePlotZero=false
min: 200 max: 240
minned: 190 maxed: 240
intervalCount: 6 intervalSize: 10
forcePlotZero=false
min: 0.7 max: 1.12
minned: 0.6 maxed: 1.2
intervalCount: 7 intervalSize: 0.1
forcePlotZero=false
min: -70.5 max: -12.5
minned: -80 maxed: -10
intervalCount: 8 intervalSize: 10
Here's the playground link https://play.golang.org/p/1IhiX_hRQvo
func getMaxMinIntervals(max float64, min float64, forcePlotZero bool) (maxRounded float64, minRounded float64, intervalCount float64, intervalSize float64) {
//STEP 1: start off determining the maxRounded value for the axis
precision := 0.0
precisionDampener := 0.0 //adjusts to prevent 235 going to 300, instead dampens the scaling to get 240
epsilon := 0.0000001
if math.Abs(max) >= 0 && math.Abs(max) < 2 {
precision = math.Floor(-math.Log10(epsilon + math.Abs(max) - math.Floor(math.Abs(max)))) //counting number of zeros between decimal point and rightward digits
precisionDampener = 1
precision = precision + precisionDampener
} else if math.Abs(max) >= 2 && math.Abs(max) < 100 {
precision = math.Ceil(math.Log10(math.Abs(max)+1)) * -1 //else count number of digits before decimal point
precisionDampener = 1
precision = precision + precisionDampener
} else {
precision = math.Ceil(math.Log10(math.Abs(max)+1)) * -1 //else count number of digits before decimal point
precisionDampener = 2
if forcePlotZero == true {
precisionDampener = 1
}
precision = precision + precisionDampener
}
useThisFactorForIntervalCalculation := 0.0 // this is needed because intervals are calculated from the max value with a zero origin, this uses range for min - max
if max < 0 {
maxRounded = (math.Floor(math.Abs(max)*(math.Pow10(int(precision)))) / math.Pow10(int(precision)) * -1)
useThisFactorForIntervalCalculation = (math.Floor(math.Abs(max)*(math.Pow10(int(precision)))) / math.Pow10(int(precision))) + ((math.Ceil(math.Abs(min)*(math.Pow10(int(precision)))) / math.Pow10(int(precision))) * -1)
} else {
maxRounded = math.Ceil(max*(math.Pow10(int(precision)))) / math.Pow10(int(precision))
useThisFactorForIntervalCalculation = maxRounded
}
minNumberOfIntervals := 2.0
maxNumberOfIntervals := 19.0
intervalSize = 0.001
intervalCount = minNumberOfIntervals
//STEP 2: get interval size (the step size on the axis)
for {
if math.Abs(useThisFactorForIntervalCalculation)/intervalSize < minNumberOfIntervals || math.Abs(useThisFactorForIntervalCalculation)/intervalSize > maxNumberOfIntervals {
intervalSize = intervalSize * 10
} else {
break
}
}
//STEP 3: check that intervals are not too large, safety for max and min values that are close together (240, 220 etc)
for {
if max-min < intervalSize {
intervalSize = intervalSize / 10
} else {
break
}
}
//STEP 4: now we can get minRounded by adding the interval size to 0 till we get to the point where another increment would make cumulative increments > min, opposite for negative in
minRounded = 0.0
if min >= 0 {
for {
if minRounded < min {
minRounded = minRounded + intervalSize
} else {
minRounded = minRounded - intervalSize
break
}
}
} else {
minRounded = maxRounded //keep going down, decreasing by the interval size till minRounded < min
for {
if minRounded > min {
minRounded = minRounded - intervalSize
} else {
break
}
}
}
//STEP 5: get number of intervals to draw
intervalCount = (maxRounded - minRounded) / intervalSize
intervalCount = math.Ceil(intervalCount) + 1 // include the origin as an interval
//STEP 6: Check that the intervalCount isn't too high
if intervalCount-1 >= (intervalSize * 2) && intervalCount > maxNumberOfIntervals {
intervalCount = math.Ceil(intervalCount / 2)
intervalSize *= 2
}
return}
This is in python and for base 10.
Doesn't cover all your questions but I think you can build on it
import numpy as np
def create_ticks(lo,hi):
s = 10**(np.floor(np.log10(hi - lo)))
start = s * np.floor(lo / s)
end = s * np.ceil(hi / s)
ticks = [start]
t = start
while (t < end):
ticks += [t]
t = t + s
return ticks