Fastest approximate counting algorithm - algorithm

Whats the fastest way to get an approximate count of number of rows of an input file or std out data stream. FYI, this is a probabilistic algorithm, I can't find many examples online.
The data could just be one or 2 columns coming from an awk script of csv file! Lets say i want an aprox groupby on one of the columns. I would use a database group by but the number of rows are over 6-7 billion. I would like the first approx result In under 3 to 4 seconds. Then run a bayes or something after decisions are made on the prior. Any ideas on a really rough initial group count?
If you can provide the algorithm example in python, or java that would be very helpful.

#Ben Allison's answer is a good way if you want to count the total lines. Since you mentioned the Bayes and the prior, I will add an answer in that direction to calculate the percentage of different groups. (see my comments on your question. I guess if you have an idea of the total and if you want to do a groupby, to estimate the percentage of different groups makes more sense).
The recursive Bayesian update:
I will start by assuming you have only two groups (extensions can be made to make it work for multiple groups, see later explanations for that.), group1 and group2.
For m group1s out of the first n lines(rows) you processed, we denote the event as M(m,n). Obviously you will see n-m group2s because we assume they are the only two possible groups. So you know the conditional probability of the event M(m,n) given the percentage of group1 (s), is given by the binomial distribution with n trials. We are trying to estimate s in a bayesian way.
The conjugate prior for binomial is beta distribution. So for simplicity, we choose Beta(1,1) as the prior (of course, you can pick your own parameters here for alpha and beta), which is a uniform distribution on (0,1). Therefor, for this beta distribution, alpha=1 and beta=1.
The recursive update formulas for a binomial + beta prior are as below:
if group == 'group1':
alpha = alpha + 1
else:
beta = beta + 1
The posterior of s is actually also a beta distribution:
s^(m+alpha-1) (1-s)^(n-m+beta-1)
p(s| M(m,n)) = ----------------------------------- = Beta (m+alpha, n-m+beta)
B(m+alpha, n-m+beta)
where B is the beta function. To report the estimate result, you can rely on Beta distribution's mean and variance, where:
mean = alpha/(alpha+beta)
var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1))
The python code: groupby.py
So a few lines of python to process your data from stdin and estimate the percentage of group1 would be something like below:
import sys
alpha = 1.
beta = 1.
for line in sys.stdin:
data = line.strip()
if data == 'group1':
alpha += 1.
elif data == 'group2':
beta += 1.
else:
continue
mean = alpha/(alpha+beta)
var = alpha*beta/((alpha+beta)**2 * (alpha+beta+1))
print 'mean = %.3f, var = %.3f' % (mean, var)
The sample data
I feed a few lines of data to the code:
group1
group1
group1
group1
group2
group2
group2
group1
group1
group1
group2
group1
group1
group1
group2
The approximate estimation result
And here is what I get as results:
mean = 0.667, var = 0.056
mean = 0.750, var = 0.037
mean = 0.800, var = 0.027
mean = 0.833, var = 0.020
mean = 0.714, var = 0.026
mean = 0.625, var = 0.026
mean = 0.556, var = 0.025
mean = 0.600, var = 0.022
mean = 0.636, var = 0.019
mean = 0.667, var = 0.017
mean = 0.615, var = 0.017
mean = 0.643, var = 0.015
mean = 0.667, var = 0.014
mean = 0.688, var = 0.013
mean = 0.647, var = 0.013
The result shows that group1 is estimated to have 64.7% percent up to the 15th row processed (based on our beta(1,1) prior). You might notice that the variance keeps shrinking because we have more and more observation points.
Multiple groups
Now if you have more than 2 groups, just change the underline distribution from binomial to multinomial, and then the corresponding conjugate prior would be Dirichlet. Everything else you just make similar changes.
Further notes
You said you would like the approximate estimate in 3-4 seconds. In this case, you just sample a portion of your data and feed the output to the above script, e.g.,
head -n100000 YOURDATA.txt | python groupby.py
That's it. Hope it helps.

If it's reasonable to assume the data are IID (so there's no bias such as certain types of records occur in certain parts of the stream), then just subsample and scale up the counts by approximate size.
Take say the first million records (this should be processable in a couple of seconds). Its size is x units (MB, chars, whatever you care about). The full stream has size y where y >> x. Now, derive counts for whatever you care about from your sample x, and simply scale them by the factor y/*x* for approximate full-counts. An example: you want to know roughly how many records have column 1 with value v in the full stream. The first million records have a file size of 100MB, while the total file size is 10GB. In the first million records, 150,000 of them have value v for column 1. So, you assume that in the full 10GB of records, you'll see 150,000 * (10,000,000,000 / 100,000,000) = 15,000,000 with that value. Any statistics you compute on the sample can simply be scaled by the same factor to produce an estimate.
If there is bias in the data such that certain records are more or less likely to be in certain places of the file then you should select your sample records at random (or evenly spaced intervals) from the total set. This is going to ensure an unbiased, representative sample, but probably incur a much greater I/O overhead.

Related

Calculate statistics on numbers entered by user

This was my tutorial given by the lecturer. I don't understand the question. I need guidance into the right direction.
Write an algorithm to read in a list of basketball scores (non-negative integers) one at a time from the user and output the following statistics:
Total number of games.
Total number of games scoring at least 90 points.
Percentage of games scoring at least 90 points.
The user entering a negative sentinel value indicates the end of the input. Note that the sentinel value is not used in computing the highest, lowest or average game score.
Requirements:
Write pseudo code for how you would solve each statistic
Example: total number of games
For each input score, increment
games by one
Determine the variables you will need and figure out the type of each variable
Define and initialize each variable
Determine what type of loop you are going to write
Start with statistic number one (total number of games) and get your loop to compute the total number of games. When you end your loop,
output the total number of games, and then move to problem two.
You only need to write one loop.
Write a complete algorithm for the above problem.
I've tried to understand the requirement and tried googling for some alternative language but unable to find so
n = 0 // number of games
o = 0 // total number of games scoring at least 90 points
for( o = 0; o <= 90; o++ )
{
input =get user input for score
n++
o += input
}
percentage = n/o *100
output percentage
Have I correctly understood the question criteria?
EDIT Answer Attempt 1 :-
int numGames = 0; //number of games
int numTotalPoints = 0; //total number of games scoring
int userInput =0; //to Track input if negative number is enterred
double average = 0.0 //to get average of the game
double gameTo90Points =0.0; //calculate total games to reach 90 points
double percentage 0.0; //to calculate the percentage
Text.put("Input the game score");
userInput = text.getInt;
while(userInput >= 0 )
{
numTotalPoints += userInput;
numGames++;
Text.put("Input the game score");
userInput = text.getInt;
}
if(numGames = 0)
{
Text.put("Not enough score to tabulate");
}
else
{
average = ((double)numTotalPoints)/numGames);
gameTo90Points = 90/average;
percentage = (gameTo90Points/90)*100
Text.put("Total number of games :" +numGames);
Text.put("Total number of games scoring at least 90 points:" +gameTo90Points);
Text.put("Percentage of games scoring at least 90 points:" +percentage);
}
As this is a task you must complete, we should not provide you with the answer to that assignment.
I will provide some comments on your current pseudo-code.
n = 0 // number of games
o = 0 // total number of games scoring at least 90 points
So far this is a good start, but it is better to use variable names that actually tell something about it (e.g. numGames, numHighScoringGames would be good candidates). Also, the assignment asks to "figure out the type of each variable". This is something you have not done yet...
for( o = 0; o <= 90; o++ )
This loop is wrong. After the loop finishes o will be a number greater than 90. But o is supposed to be a particular number of games (with a score of at least 90). This should trigger an alarm... You haven't read any input yet and you already seem to know there will be more than 90 of such games? That's not right.
The value of o should have nothing to do with whether the loop should continue or not.
input =get user input for score
Again, the data type should be determined for the variable input.
n++
This is good, but you did not take into account this part of the assignment:
The user entering a negative sentinel value indicates the end of the input.
Your code should verify if the user entered a negative sentinel value. And if so, you should not ask for more input.
o += input
The variable o is supposed to be a number of games, but now you are adding a score to it... that cannot be right. Also, you add it unconditionally... Should you not first check whether that game is "scoring at least 90 points"?
percentage = n/o *100
Here you use o as it was intended (as a number of games). But think about this... which one of the two will be greater (when not equal)? n or o? Taking that answer into account: Is your formula correct?
Secondly, could the denominator be zero? Should you protect the code from it?
output percentage
OK, but don't forget that the assignment asks for three statistics, not just one.

Why do higher learning rates in logistic regression produce NaN costs?

Summary
I am building a classifier for spam vs. ham emails using Octave and the Ling-Spam corpus; my method of classification is logistic regression.
Higher learning rates lead to NaN values being calculated for the cost, yet it does not break/decrease the performance of the classifier itself.
My Attempts
NB: My dataset is already normalised using mean normalisation.
When trying to choose my learning rate, I started with it as 0.1 and 400 iterations. This resulted in the following plot:
1 - Graph 1
When he lines completely disappear after a few iterations, it is due to a NaN value being produced; I thought this would result in broken parameter values and thus bad accuracy, but when checking the accuracy, I saw it was 95% on the test set (meaning that gradient descent was apparently still functioning). I checked different values of the learning rate and iterations to see how the graphs changed:
2 - Graph 2
The lines no longer disappeared, meaning no NaN values, BUT the accuracy was 87% which is substantially lower.
I did two more tests with more iterations and a slightly higher learning rate, and in both of them, the graphs both decreased with iterations as expected, but the accuracy was ~86-88%. No NaNs there either.
I realised that my dataset was skewed, with only 481 spam emails and 2412 ham emails. I therefore calculated the FScore for each of these different combinations, hoping to find the later ones had a higher FScore and the accuracy was due to the skew. That was not the case either - I have summed up my results in a table:
3 - Table
So there is no overfitting and the skew does not seem to be the problem; I don't know what to do now!
The only thing I can think of is that my calculations for accuracy and FScore are wrong, or that my initial debugging of the line 'disappearing' was wrong.
EDIT: This question is crucially about why the NaN values occur for those chosen learning rates. So the temporary fix I had of lowering the learning rate did not really answer my question - I always thought that higher learning rates simply diverged instead of converging, not producing NaN values.
My Code
My main.m code (bar getting the dataset from files):
numRecords = length(labels);
trainingSize = ceil(numRecords*0.6);
CVSize = trainingSize + ceil(numRecords*0.2);
featureData = normalise(data);
featureData = [ones(numRecords, 1), featureData];
numFeatures = size(featureData, 2);
featuresTrain = featureData(1:(trainingSize-1),:);
featuresCV = featureData(trainingSize:(CVSize-1),:);
featuresTest = featureData(CVSize:numRecords,:);
labelsTrain = labels(1:(trainingSize-1),:);
labelsCV = labels(trainingSize:(CVSize-1),:);
labelsTest = labels(CVSize:numRecords,:);
paramStart = zeros(numFeatures, 1);
learningRate = 0.0001;
iterations = 400;
[params] = gradDescent(featuresTrain, labelsTrain, learningRate, iterations, paramStart, featuresCV, labelsCV);
threshold = 0.5;
[accuracy, precision, recall] = predict(featuresTest, labelsTest, params, threshold);
fScore = (2*precision*recall)/(precision+recall);
My gradDescent.m code:
function [optimParams] = gradDescent(features, labels, learningRate, iterations, paramStart, featuresCV, labelsCV)
x_axis = [];
J_axis = [];
J_CV = [];
params = paramStart;
for i=1:iterations,
[cost, grad] = costFunction(features, labels, params);
[cost_CV] = costFunction(featuresCV, labelsCV, params);
params = params - (learningRate.*grad);
x_axis = [x_axis;i];
J_axis = [J_axis;cost];
J_CV = [J_CV;cost_CV];
endfor
graphics_toolkit("gnuplot")
plot(x_axis, J_axis, 'r', x_axis, J_CV, 'b');
legend("Training", "Cross-Validation");
xlabel("Iterations");
ylabel("Cost");
title("Cost as a function of iterations");
optimParams = params;
endfunction
My costFunction.m code:
function [cost, grad] = costFunction(features, labels, params)
numRecords = length(labels);
hypothesis = sigmoid(features*params);
cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis));
grad = (1/numRecords)*(features'*(hypothesis-labels));
endfunction
My predict.m code:
function [accuracy, precision, recall] = predict(features, labels, params, threshold)
numRecords=length(labels);
predictions = sigmoid(features*params)>threshold;
correct = predictions == labels;
truePositives = sum(predictions == labels == 1);
falsePositives = sum((predictions == 1) != labels);
falseNegatives = sum((predictions == 0) != labels);
precision = truePositives/(truePositives+falsePositives);
recall = truePositives/(truePositives+falseNegatives);
accuracy = 100*(sum(correct)/numRecords);
endfunction
Credit where it's due:
A big help here was this answer: https://stackoverflow.com/a/51896895/8959704 so this question is kind of a duplicate, but I didn't realise it, and it isn't obvious at first... I will do my best to try to explain why the solution works too, to avoid simply copying the answer.
Solution:
The issue was in fact the 0*log(0) = NaN result that occurred in my data. To fix it, in my calculation of the cost, it became:
cost = (-1/numRecords)*sum((labels).*log(hypothesis)+(1-labels).*log(1-hypothesis+eps(numRecords, 1)));
(see the question for the variables' values etc., it seems redundant to include the rest when just this line changes)
Explanation:
The eps() function is defined as follows:
Return a scalar, matrix or N-dimensional array whose elements are all
eps, the machine precision.
More precisely, eps is the relative spacing between any two adjacent
numbers in the machine’s floating point system. This number is
obviously system dependent. On machines that support IEEE floating
point arithmetic, eps is approximately 2.2204e-16 for double precision
and 1.1921e-07 for single precision.
When called with more than one argument the first two arguments are
taken as the number of rows and columns and any further arguments
specify additional matrix dimensions. The optional argument class
specifies the return type and may be either "double" or "single".
So this means that adding this value onto the value calculated by the Sigmoid function (which was previously so close to 0 it was taken as 0) will mean that it is the closest value to 0 that is not 0, making the log() not return -Inf.
When testing with the learning rate as 0.1 and iterations as 2000/1000/400, the full graph was plotted and no NaN values were produced when checking.
NB: Just in case anyone was wondering, the accuracy and FScores did not change after this, so the accuracy really was that good despite the error in calculating the cost with a higher learning rate.

Faster alternative to INTERSECT with 'rows' - MATLAB

I have a code written in Matlab that uses 'intersect' to find the vectors (and their indices) that intersect in two large matrices. I found that 'intersect' is the slowest line (by a large difference) in my code. Unfortunately I couldn't find a faster alternative so far.
As an example running the code below takes approx 5 seconds on my pc:
profile on
for i = 1 : 500
a = rand(10000,5);
b = rand(10000,5);
[intersectVectors, ind_a, ind_b] = intersect(a,b,'rows');
end
profile viewer
I was wondering if there is a faster way. Note that the matrices (a) and (b) have 5 columns. The number of rows don't necessary have to be the same for the two matrices.
Any help would be great.
Thanks
Discussion and solution codes
You can use an approach that leverages fast matrix multiplication in MATLAB to convert those 5 columns of input arrays into one column by considering each column as a significant "digit" of a single number. Thus, you would end up with an array with only column and then, you can use intersect or ismember without 'rows' and that must speedup the codes in a big way!
Here are the promised implementations as function codes for easy usage -
intersectrows_fast_v1.m:
function [intersectVectors, ind_a, ind_b] = intersectrows_fast_v1(a,b)
%// Calculate equivalent one-column versions of input arrays
mult = [10^ceil(log10( 1+max( [a(:);b(:)] ))).^(size(a,2)-1:-1:0)]'; %//'
acol1 = a*mult;
bcol1 = b*mult;
%// Use intersect without 'rows' option for a good speedup
[~, ind_a, ind_b] = intersect(acol1,bcol1);
intersectVectors = a(ind_a,:);
return;
intersectrows_fast_v2.m:
function [intersectVectors, ind_a, ind_b] = intersectrows_fast_v2(a,b)
%// Calculate equivalent one-column versions of input arrays
mult = [10^ceil(log10( 1+max( [a(:);b(:)] ))).^(size(a,2)-1:-1:0)]'; %//'
acol1 = a*mult;
bcol1 = b*mult;
%// Use ismember to get indices of the common elements
[match_a,idx_b] = ismember(acol1,bcol1);
%// Now, with ismember, duplicate items are not taken care of automatically as
%// are done with intersect. So, we need to find the duplicate items and
%// remove those from the outputs of ismember
[~,a_sorted_ind] = sort(acol1);
a_rm_ind =a_sorted_ind([false;diff(sort(acol1))==0]); %//indices to be removed
match_a(a_rm_ind)=0;
intersectVectors = a(match_a,:);
ind_a = find(match_a);
ind_b = idx_b(match_a);
return;
Quick tests and conclusions
With the datasizes listed in the question, the runtimes were -
-------------------------- With original approach
Elapsed time is 3.885792 seconds.
-------------------------- With Proposed approach - Version - I
Elapsed time is 0.581123 seconds.
-------------------------- With Proposed approach - Version - II
Elapsed time is 0.963409 seconds.
The results seem to suggest a big advantage in favour of the version - I of the two proposed approaches with a whooping speedup of around 6.7x over the original approach!!
Also, please note that if you don't need any one or two of the three outputs from the original intersect with 'rows' based approach, then both the proposed approaches could be further shortened for better runtime performances!

matlab curve fitting: restrictions on parameters

I have 5 non-parametric models all with 5 to 8 parameters. This models are used to fit longitudinal data y(t) with t being time. Every datafile is fitted by all 5 models for comparison. The model itself cannot be altered.
For fitting starting values are used and these are fitted into a lsqcurvefit model using a levenberg-marquardt algortihm. So I've written a script for several models and one function for curvefitting
if i perform the curve fitting a lot of the starting values are wandering off to extreme values. This is the thing I want to avoid since these parameters should stay in the proximity off it's starting values and should only change between a well defined range or so that only curve fits within a standard deviation are included.Important to note here is that this restrictions should be imposed during the curve fitting (iterative numerization techique) and not afterwards.
The function I've written to fit models into height:
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
The function I've written for one of my models
elseif strcmpi(model,'jpss')
% y = h_1(1-(1/(1+((t+0.75)^c_1/d_1)+((t+0.75)^c_2/d_2)+((t+0.75)^c_3/d_3)))
% heightModel = #(params,ages) params(1).*(1-1./(1+((ages+0.75).^params(2))./params(3) + ((ages+0.75).^params(4))./params(5) + ((ages+0.75).^params(6))./params(7)));
heightModel = #(params,ages) params(1).*(1-1./(1+(((ages+0.75)./params(3)).^params(2)) + (((ages+0.75)./params(5)).^params(4)) + ((ages+0.75)./params(7)).^params(6))); % Adapted 25/07
modelStrings = {'h1','c1','d1','c2','d2','c3','d3'};
% Define initial values
if strcmpi('male',gender)
initialValues = [174.8 0.6109 2.9743 3.614 9.88 22.393 13.59];
else
initialValues = [162.7 0.6546 2.43 4.011 8.579 18.394 11.846];
end
What I would like to do:
Is it possible to place restrictions on every startingvalue #initial values? Putting restrictions on lsqcurvefit wouldn't be a good idea I think since there are different models with different starting values and different ranges that are allowed.
I had 2 things in my mind:
1. using range and place this between the initial values
initialValues = [162.7 0.6546 2.43 4.011 8.579 18.394 11.846]`
if range a1=[150,180]; range a2=[0.3,0.8] and so one
place lb and ub restrictions seperatly on all my initialvalues between lsqcurvefit
if Heightmodel='name model'
initial value* 1.2 and lb = initial value* 0.8
Can someone give me some hints or pointers because I can't make it work.
Thanks in advance
Lucy
Could somebody help me out
You state: there are different models with different starting values and different ranges that are allowed. This is where you can use ub and lb. How to do this is outlined in the lsqcurvefit documentation:
X=LSQCURVEFIT(FUN,X0,XDATA,YDATA,LB,UB) defines a set of lower and
upper bounds on the design variables, X, so that the solution is in the
range LB <= X <= UB. Use empty matrices for LB and UB if no bounds
exist. Set LB(i) = -Inf if X(i) is unbounded below; set UB(i) = Inf if
X(i) is unbounded above.
For instance in the following example the parameters are constrained within limits during the fit. The lower bound (lb) and upper bound (ub) are set to 20% below and above the starting values, respectively.
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
lb = 0.8*initialValues; % <-- lower bound is 20% smaller than initial par values
ub = 1.2*initialValues;
[parsout,resnorm,residual] = lsqcurvefit(heightModel,initialValues,t,ht,lb,ub);

Algorithm For Ranking Items

I have a list of 6500 items that I would like to trade or invest in. (Not for real money, but for a certain game.) Each item has 5 numbers that will be used to rank it among the others.
Total quantity of item traded per day: The higher this number, the better.
The Donchian Channel of the item over the last 5 days: The higher this number, the better.
The median spread of the price: The lower this number, the better.
The spread of the 20 day moving average for the item: The lower this number, the better.
The spread of the 5 day moving average for the item: The higher this number, the better.
All 5 numbers have the same 'weight', or in other words, they should all affect the final number in the with the same worth or value.
At the moment, I just multiply all 5 numbers for each item, but it doesn't rank the items the way I would them to be ranked. I just want to combine all 5 numbers into a weighted number that I can use to rank all 6500 items, but I'm unsure of how to do this correctly or mathematically.
Note: The total quantity of the item traded per day and the donchian channel are numbers that are much higher then the spreads, which are more of percentage type numbers. This is probably the reason why multiplying them all together didn't work for me; the quantity traded per day and the donchian channel had a much bigger role in the final number.
The reason people are having trouble answering this question is we have no way of comparing two different "attributes". If there were just two attributes, say quantity traded and median price spread, would (20million,50%) be worse or better than (100,1%)? Only you can decide this.
Converting everything into the same size numbers could help, this is what is known as "normalisation". A good way of doing this is the z-score which Prasad mentions. This is a statistical concept, looking at how the quantity varies. You need to make some assumptions about the statistical distributions of your numbers to use this.
Things like spreads are probably normally distributed - shaped like a normal distribution. For these, as Prasad says, take z(spread) = (spread-mean(spreads))/standardDeviation(spreads).
Things like the quantity traded might be a Power law distribution. For these you might want to take the log() before calculating the mean and sd. That is the z score is z(qty) = (log(qty)-mean(log(quantities)))/sd(log(quantities)).
Then just add up the z-score for each attribute.
To do this for each attribute you will need to have an idea of its distribution. You could guess but the best way is plot a graph and have a look. You might also want to plot graphs on log scales. See wikipedia for a long list.
You can replace each attribute-vector x (of length N = 6500) by the z-score of the vector Z(x), where
Z(x) = (x - mean(x))/sd(x).
This would transform them into the same "scale", and then you can add up the Z-scores (with equal weights) to get a final score, and rank the N=6500 items by this total score. If you can find in your problem some other attribute-vector that would be an indicator of "goodness" (say the 10-day return of the security?), then you could fit a regression model of this predicted attribute against these z-scored variables, to figure out the best non-uniform weights.
Start each item with a score of 0. For each of the 5 numbers, sort the list by that number and add each item's ranking in that sorting to its score. Then, just sort the items by the combined score.
You would usually normalize your data entries to their respective range. Since there is no fixed range for them, you'll have to use a sliding range - or, to keep it simpler, normalize them to the daily ranges.
For each day, get all entries for a given type, get the highest and the lowest of them, determine the difference between them. Let Bottom=value of the lowest, Range=difference between highest and lowest. Then you calculate for each entry (value - Bottom)/Range, which will result in something between 0.0 and 1.0. These are the numbers you can continue to work with, then.
Pseudocode (brackets replaced by indentation to make easier to read):
double maxvalues[5];
double minvalues[5];
// init arrays with any item
for(i=0; i<5; i++)
maxvalues[i] = items[0][i];
minvalues[i] = items[0][i];
// find minimum and maximum values
foreach (items as item)
for(i=0; i<5; i++)
if (minvalues[i] > item[i])
minvalues[i] = item[i];
if (maxvalues[i] < item[i])
maxvalues[i] = item[i];
// now scale them - in this case, to the range of 0 to 1.
double scaledItems[sizeof(items)][5];
double t;
foreach(i=0; i<5; i++)
double delta = maxvalues[i] - minvalues[i];
foreach(j=sizeof(items)-1; j>=0; --j)
scaledItems[j][i] = (items[j][i] - minvalues[i]) / delta;
// linear normalization
something like that. I'll be more elegant with a good library (STL, boost, whatever you have on the implementation platform), and the normalization should be in a separate function, so you can replace it with other variations like log() as the need arises.
Total quantity of item traded per day: The higher this number, the better. (a)
The Donchian Channel of the item over the last 5 days: The higher this number, the better. (b)
The median spread of the price: The lower this number, the better. (c)
The spread of the 20 day moving average for the item: The lower this number, the better. (d)
The spread of the 5 day moving average for the item: The higher this number, the better. (e)
a + b -c -d + e = "score" (higher score = better score)

Resources