Method to choose overall winner across multiple categories [migrated] - algorithm

I have four numeric variables. All of them are measures of soil quality. Higher the variable, higher the quality. The range for all of them is different:
Var1 from 1 to 10
Var2 from 1000 to 2000
Var3 from 150 to 300
Var4 from 0 to 5
I need to combine four variables into single soil quality score which will successfully rank order.
My idea is very simple. Standardize all four variables, sum them up and whatever you get is the score which should rank-order. Do you see any problem with applying this approach. Is there any other (better) approach that you would recommend?
Thanks
Edit:
Thanks guys. A lot of discussion went into "domain expertise"... Agriculture stuff... Whereas I expected more stats-talk. In terms of technique that I will be using... It will probably be simple z-score summation + logistic regression as an experiment. Because vast majority of samples has poor quality 90% I'm going to combine 3 quality categories into one and basically have binary problem (somequality vs no-quality). I kill two birds with one stone. I increase my sample in terms of event rate and I make a use of experts by getting them to clasify my samples. Expert classified samples will then be used to fit log-reg model to maximize level of concordance / discordance with the experts.... How does that sound to you?

The proposed approach may give a reasonable result, but only by accident. At this distance--that is, taking the question at face value, with the meanings of the variables disguised--some problems are apparent:
It is not even evident that each variable is positively related to "quality." For example, what if a 10 for 'Var1' means the "quality" is worse than the quality when Var1 is 1? Then adding it to the sum is about as wrong a thing as one can do; it needs to be subtracted.
Standardization implies that "quality" depends on the data set itself. Thus the definition will change with different data sets or with additions and deletions to these data. This can make the "quality" into an arbitrary, transient, non-objective construct and preclude comparisons between datasets.
There is no definition of "quality". What is it supposed to mean? Ability to block migration of contaminated water? Ability to support organic processes? Ability to promote certain chemical reactions? Soils good for one of these purposes may be especially poor for others.
The problem as stated has no purpose: why does "quality" need to be ranked? What will the ranking be used for--input to more analysis, selecting the "best" soil, deciding a scientific hypothesis, developing a theory, promoting a product?
The consequences of the ranking are not apparent. If the ranking is incorrect or inferior, what will happen? Will the world be hungrier, the environment more contaminated, scientists more misled, gardeners more disappointed?
Why should a linear combination of variables be appropriate? Why shouldn't they be multiplied or exponentiated or combined as a posynomial or something even more esoteric?
Raw soil quality measures are commonly re-expressed. For example, log permeability is usually more useful than the permeability itself and log hydrogen ion activity (pH) is much more useful than the activity. What are the appropriate re-expressions of the variables for determining "quality"?
One would hope that soils science would answer most of these questions and indicate what the appropriate combination of the variables might be for any objective sense of "quality." If not, then you face a multi-attribute valuation problem. The Wikipedia article lists dozens of methods for addressing this. IMHO, most of them are inappropriate for addressing a scientific question. One of the few with a solid theory and potential applicability to empirical matters is Keeney & Raiffa's multiple attribute valuation theory (MAVT). It requires you to be able to determine, for any two specific combinations of the variables, which of the two should rank higher. A structured sequence of such comparisons reveals (a) appropriate ways to re-express the values; (b) whether or not a linear combination of the re-expressed values will produce the correct ranking; and (c) if a linear combination is possible, it will let you compute the coefficients. In short, MAVT provides algorithms for solving your problem provided you already know how to compare specific cases.

Anyone looked at Russell G. Congalton 'Review of Assessing the Accuracy of Classifications of Remotely Sensed Data' 1990 ?. It describes a technique known as error matrix for varing matrices, also a term he uses called ' Normalizing data' , whereby one gets all the different vectors and 'normalizes' or sets them to equal from 0 to 1. You basically change all vectors to equal ranges from 0 to 1.

One other thing you did not discuss is the scale of the measurements. V1 and V5 looks like they are of rank order and the other seem not. So standardization may be skewing the score. So you may be better transforming all of the variables into ranks, and determining a weighting for each variable, since it is highly unlikely that they have the same weight. Equal weighting is more of a "no nothing" default. You might want to do some correlation or regression analysis to come up with some a priori weights.

I had a similar problem recently and though I add my approach to the nice answers. I think in order to find a simple way to determine which variable leads to the best ranking. One could transform your problem to a gridsearch approach:
Basically use a combined score for the ranking which is composed as such:
Finel_score = Var1 * A + Var2 * B + Var3 * C ....
Then you can compute the final score with different values for A,B,C (sklearn gridsearch could be used) ... and compare the resulting ranking to an expected ranking (some ground truth is needed to determine the goodness of you ranking). The best parameters result in the weights of your individual variables.

Following up on Ralph Winters' answer, you might use PCA (principal component analysis) on the matrix of suitably standardized scores. This will give you a "natural" weight vector that you can use to combine future scores.
Do this also after all scores have been transformed into ranks. If the results are very similar, you have good reasons to continue with either method. If there are discrepancies, this will lead to interesting questions and a better understanding.

Related

Importance of covariance

If I have calculated coefficient of correlation, I already have the idea of covariance. But I have seen many data scientist calculates the covariance after it. If I have coefficient of correlation with me, I can say that data is positively or negatively correlated with the strength, while covariance give the same thing without the strength. Then what is the importance of covariance if I have coefficient of correlation.
Please suggest, apologies if my question is of low importance.
The correlation and covariance are stricty related, indeed rho=cov(x,y)/(sigma_x*sigma_y)
However, the units of a covariance are hard to interpret. For example, if we wanted to know the covariance between wages paid to employees and the number of employees at a firm, it can be shown that by converting the wages from dollars to cents, we would increase the covariance by a factor of 100. This is odd given that the underlying relationship shouldn't be different if we are talking about dollars or cents. Another way to express this is:
Cov(a*X,Y)=a*Cov(X,Y)
The correlation is always bounded between -1 and 1 and easier to be interpreted
I tend to see correlation versus covariance as an opposition between a quick dry mathematical relation overview and a more raw relation analysis. Imagine yourself joining a project in a field you know approximately nothing about:
if a team member gives you a correlation coefficient for two key variables/indicators associated with the project, you will be able to extract all the information out of this coefficient immediately, without knowing the samples respective scales
if he gives you a covariance, you will probably want to have a look at the data to appreciate exactly what it implies
Covariance is easily understood when the samples being compared live on a similar scale/have a similar nature, since the value you'll be considering will not try to compare two completely different things with an intuitively absurd compromise in nature/scale (remember that to compute the covariance you use the products of two things possibly very different with (x-mean(x))(y-mean(y))). Correlation being standardized, issues associated with varying scales and nature in the data are simply absent of your indicator, leading to an "easier interpretation" feeling.
One should therefore realize that while correlation can make it easier to understand a mathematical relationship, it obfuscates the actual nature of the data you're playing with. Looking at both can't hurt you to appreciate what's going on with your samples, and that's probably why you'll like considering both. If you aren't convinced you can also read this related stats.stackexchange question.
In case you wonder why would you want to keep close to the nature and scales of your data while trying to highlight relations between samples, a good example would be the efforts deployed in AI to extract useful features in images to feed models: you want to emphasize discriminatory descriptions of data, without filtering out other potentially interesting information with a standardization. See for example this paper that uses covariance matrices to build a dictionary on images.

Uncertainty versus randomness

I would like to know the difference between uncertainty and randomness in mathematical fashion. I tried to find it but I get confused , as some people said they are the same? But can any one provide me logical reasoning behind it. If they are not same then please explain it why?
Don't get too hung up on it.
People use different words in different situations.
It's not so much that they have different meanings, as that their meanings are situation-dependent.
Randomness is just a fuzzy general term meaning something is random.
In statistics, uncertainty is used to mean that some property of a distribution, such as its mean, is itself unknown but can be given a distribution.
For example, suppose you want to know the average weight of all people.
You could find it out exactly if you could go around to all people, get their weight, add it all up, and divide by the number of people.
But that's too hard to do, so suppose you just pick 10 people at random and get their average weight, and pretend it's the same as the average of everybody.
That's called the sample mean, but you know it isn't accurate.
It has what is called a standard error, meaning it has uncertainty.
In fact, if you were to do that experiment many times over with different people, you would get a different sample mean every time, and those sample means would themselves form a bell-shaped distribution, the standard deviation of which would be called the standard error, representing its uncertainty.
In general, if you increased the number of people you look at by a factor of 100, you can reduce the standard error, the uncertainty, by a factor of 10.
I bet you can tell that people who take polls for a living care about this stuff very much.
EDIT for the downvoter: In case the downvote is because this doesn't look like a stackoverflow question or answer,
I've made a point of advocating the random pausing method of profiling.
Profiling in large part is perceived to be about measuring (statistically) the time that programming constructs are responsible for.
Often people are inhibited from using that method because they are afraid the results have too much uncertainty.
This post gets very specific about what that uncertainty actually is.
It shows that the bogey-man fear of uncertainty has the effect of preventing people from finding really substantial speedups in their code.
So naivete' about statistics is definitely a serious programming problem.
My view looks at a scenario using three different coloured balls:
I love some of the answers given here. My own view, based on my current research, is that these are two distinct terms. Uncertainty refers to not knowing in advance which ball could be selected when a person, for instance, is given a chance to select one ball from three different coloured balls.
This remains true when each ball has an equal chance of being selected i.e. equal probabilities. However, things soon get complex when each ball has it's own distinct probability. Chances are that the one with the highest probability will be selected. This seems especially true in algorithm development which would almost always select the highest probability compromising the meaning of randomness.
Having said all of this - I believe these concepts remain confusing which has just made me realise the time I need to dedicate on clearly distinguishing between the two to make sure my current research is not confusing. My own predicament is that I need to work on stochastic vs deterministic views. Based on the current view stochastic would be more uncertain than random whereas deterministic would be more probability based i.e. knowing for certain that the highest probability would be chosen; but this seems very far from the truth.
It seems as if uncertainty holds until just before a ball is selected/touched and soon looses its meaning as soon as the ball is picked which should result to its probability being revised. I personally think the terms have theoretical differences which perhaps allows them to be used interchangeably.
Uncertainty in math and science typically means there are a lack of facts, or the facts are unobtainable. Weather forecasting is a great example of uncertainty.
Randomness has many definitions. Commonly it's used in probability / statistics as a measure or quantification of uncertainty. So in my weather example, a 30% chance of rain is a measure of uncertainty. The more general definition (which also applies to math / science) is unpredictable, or lack of order.
There is definitely a fuzzy distinction between the two.
According to the Bayesian interpretation of probability, uncertainty and randomness are just two names for the same thing.
If an experiment is random, then it is uncertain to you. If something is uncertain to you, then it has the randomness property.

What are the differences between genetic algorithms and evolution strategies?

I've read a couple of introductory sections of books as well as a few papers on both topics, and it looks to me that these two methods are pretty much exactly the same. That said, I haven't had the time to actually deeply research the topics yet, so I might be wrong.
What are the distinctions between genetic algorithms and evolution strategies? What makes them different, and where are they similar?
In evolution strategies, the individuals are coded as vectors of real numbers. On reproduction, parents are selected randomly and the fittest offsprings are selected and inserted in the next generation. ES individuals are self-adapting. The step size or "mutation strength" is encoded in the individual, so good parameters get to the next generation by selecting good individuals.
In genetic algorithms, the individuals are coded as integers. The selection is done by selecting parents proportional to their fitness. So individuals must be evaluated before the first selection is done. Genetic operators work on the bit-level (e.g. cutting a bit string into multiple pieces and interchange them with the pieces of the other parent or switching single bits).
That's the theory. In practice, it is sometimes hard to distinguish between both evolutionary algorithms, and you need to create hybrid algorithms (e.g. integer (bit-string) individuals that encodes the parameters of the genetic operators).
Just stumbled on this thread when researching Evolution Strategies (ES).
As Paul noticed before, the encoding is not really the difference here, as this is an implementation detail of specific algorithms, although it seems more common in ES.
To answer the question, we first need to do a small step back and look at internals of an ES algorithm.
In ES there is a concept of endogenous and exogenous parameters of the evolution. Endogenous parameters are associated with individuals and therefore are evolved together with them, exogenous are provided from "outside" (e.g. set constant by the developer, or there can be a function/policy which sets their value depending on the iteration no).
The individual k consists therefore of two parts:
y(k) - a set of object parameters (e.g. a vector of real/int values) which denote the individual genotype
s(k) - a set of strategy parameters (e.g. a vector of real/int values again) which e.g. can control statistical properties of mutation)
Those two vectors are being selected, mutated, recombined together.
The main difference between GA and ES is that in classic GA there is no distinction between types of algorithm parameters. In fact all the parameters are set from "outside", so in ES terms are exogenous.
There are also other minor differences, e.g. in ES the selection policy is usually one and the same and in GA there are multiple different approaches which can be interchanged.
You can find a more detailed explanation here (see Chapter 3): Evolution strategies. A comprehensive introduction
In most newer textbooks on GA, real-valued coding is introduced as an alternative to the integer one, i.e. individuals can be coded as vectors of real numbers. This is called continuous parameter GA (see e.g. Haupt & Haupt, "Practical Genetic Algorithms", J.Wiley&Sons, 1998). So this is practically identical to ES real number coding.
With respect to parent selection, there are many different strategies published for GA's. I don't know them all, but I assume selection among all (not only the best has been used for some applications).
The main difference seems to be that a genetic algorithm represents a solution using a sequence of integers, whereas an evolution strategy uses a sequence of real numbers -- reference: http://en.wikipedia.org/wiki/Evolutionary_algorithm#
As the wikipedia source (http://en.wikipedia.org/wiki/Genetic_algorithm) and #Vaughn Cato said the difference in both techniques relies on the implementation. EA use
real numbers and GA use integers.
However, in practice I think you could use integers or real numbers in the formulation of your problem and in your program. It depends on you. For instance, for protein folding you can say the set of dihedral angles form a vector. This is a vector of real numbers, but the entries
are labeled by integers so I think you can formulate your problem and write you program based
on an integer arithmetic. It is just an idea.

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Which data mining algorithm would you suggest for this particular scenario?

This is not a directly programming related question, but it's about selecting the right data mining algorithm.
I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:
there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
young people tend to live in highly populated regions whereas old people prefer countrysides, and
Internet is used more by young people than by old people.
I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with
approx. 500 different names (nominal input variable with too many classes)
20 different regions (nominal input variable)
Internet Yes/No (binary input variable)
91 distinct birthyears (numerical target variable with range: 1910-1992)
Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?
I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.
One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.
One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).
However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.
New answer
I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.
I have 4 women. Here are their data:
Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35
I would generate a matrix, A, like this (each row represents a respective woman in my list):
[[1,0,0,1,0],
[0,1,0,1,0],
[1,0,0,0,1],
[0,0,1,0,1]]
The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent
[Gertrude, Jennifer, Mary, Internet, No Internet]
You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is
b=[[57],
[23],
[60],
[35]]
You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.
You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.
My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.
I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.
The algorithm is implemented in R (library RWeka)
It also can be found in the open source machine learning software Weka
For more information see:
Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.
Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.
I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values.
You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution.
But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.

Resources