Which data mining algorithm would you suggest for this particular scenario? - algorithm

This is not a directly programming related question, but it's about selecting the right data mining algorithm.
I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:
there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
young people tend to live in highly populated regions whereas old people prefer countrysides, and
Internet is used more by young people than by old people.
I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with
approx. 500 different names (nominal input variable with too many classes)
20 different regions (nominal input variable)
Internet Yes/No (binary input variable)
91 distinct birthyears (numerical target variable with range: 1910-1992)
Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?

I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.
One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.
One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).
However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.

New answer
I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.
I have 4 women. Here are their data:
Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35
I would generate a matrix, A, like this (each row represents a respective woman in my list):
[[1,0,0,1,0],
[0,1,0,1,0],
[1,0,0,0,1],
[0,0,1,0,1]]
The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent
[Gertrude, Jennifer, Mary, Internet, No Internet]
You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is
b=[[57],
[23],
[60],
[35]]
You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.

You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.
My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.

I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.
The algorithm is implemented in R (library RWeka)
It also can be found in the open source machine learning software Weka
For more information see:
Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.
Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.

I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values.
You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution.
But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.

Related

Method to choose overall winner across multiple categories [migrated]

I have four numeric variables. All of them are measures of soil quality. Higher the variable, higher the quality. The range for all of them is different:
Var1 from 1 to 10
Var2 from 1000 to 2000
Var3 from 150 to 300
Var4 from 0 to 5
I need to combine four variables into single soil quality score which will successfully rank order.
My idea is very simple. Standardize all four variables, sum them up and whatever you get is the score which should rank-order. Do you see any problem with applying this approach. Is there any other (better) approach that you would recommend?
Thanks
Edit:
Thanks guys. A lot of discussion went into "domain expertise"... Agriculture stuff... Whereas I expected more stats-talk. In terms of technique that I will be using... It will probably be simple z-score summation + logistic regression as an experiment. Because vast majority of samples has poor quality 90% I'm going to combine 3 quality categories into one and basically have binary problem (somequality vs no-quality). I kill two birds with one stone. I increase my sample in terms of event rate and I make a use of experts by getting them to clasify my samples. Expert classified samples will then be used to fit log-reg model to maximize level of concordance / discordance with the experts.... How does that sound to you?
The proposed approach may give a reasonable result, but only by accident. At this distance--that is, taking the question at face value, with the meanings of the variables disguised--some problems are apparent:
It is not even evident that each variable is positively related to "quality." For example, what if a 10 for 'Var1' means the "quality" is worse than the quality when Var1 is 1? Then adding it to the sum is about as wrong a thing as one can do; it needs to be subtracted.
Standardization implies that "quality" depends on the data set itself. Thus the definition will change with different data sets or with additions and deletions to these data. This can make the "quality" into an arbitrary, transient, non-objective construct and preclude comparisons between datasets.
There is no definition of "quality". What is it supposed to mean? Ability to block migration of contaminated water? Ability to support organic processes? Ability to promote certain chemical reactions? Soils good for one of these purposes may be especially poor for others.
The problem as stated has no purpose: why does "quality" need to be ranked? What will the ranking be used for--input to more analysis, selecting the "best" soil, deciding a scientific hypothesis, developing a theory, promoting a product?
The consequences of the ranking are not apparent. If the ranking is incorrect or inferior, what will happen? Will the world be hungrier, the environment more contaminated, scientists more misled, gardeners more disappointed?
Why should a linear combination of variables be appropriate? Why shouldn't they be multiplied or exponentiated or combined as a posynomial or something even more esoteric?
Raw soil quality measures are commonly re-expressed. For example, log permeability is usually more useful than the permeability itself and log hydrogen ion activity (pH) is much more useful than the activity. What are the appropriate re-expressions of the variables for determining "quality"?
One would hope that soils science would answer most of these questions and indicate what the appropriate combination of the variables might be for any objective sense of "quality." If not, then you face a multi-attribute valuation problem. The Wikipedia article lists dozens of methods for addressing this. IMHO, most of them are inappropriate for addressing a scientific question. One of the few with a solid theory and potential applicability to empirical matters is Keeney & Raiffa's multiple attribute valuation theory (MAVT). It requires you to be able to determine, for any two specific combinations of the variables, which of the two should rank higher. A structured sequence of such comparisons reveals (a) appropriate ways to re-express the values; (b) whether or not a linear combination of the re-expressed values will produce the correct ranking; and (c) if a linear combination is possible, it will let you compute the coefficients. In short, MAVT provides algorithms for solving your problem provided you already know how to compare specific cases.
Anyone looked at Russell G. Congalton 'Review of Assessing the Accuracy of Classifications of Remotely Sensed Data' 1990 ?. It describes a technique known as error matrix for varing matrices, also a term he uses called ' Normalizing data' , whereby one gets all the different vectors and 'normalizes' or sets them to equal from 0 to 1. You basically change all vectors to equal ranges from 0 to 1.
One other thing you did not discuss is the scale of the measurements. V1 and V5 looks like they are of rank order and the other seem not. So standardization may be skewing the score. So you may be better transforming all of the variables into ranks, and determining a weighting for each variable, since it is highly unlikely that they have the same weight. Equal weighting is more of a "no nothing" default. You might want to do some correlation or regression analysis to come up with some a priori weights.
I had a similar problem recently and though I add my approach to the nice answers. I think in order to find a simple way to determine which variable leads to the best ranking. One could transform your problem to a gridsearch approach:
Basically use a combined score for the ranking which is composed as such:
Finel_score = Var1 * A + Var2 * B + Var3 * C ....
Then you can compute the final score with different values for A,B,C (sklearn gridsearch could be used) ... and compare the resulting ranking to an expected ranking (some ground truth is needed to determine the goodness of you ranking). The best parameters result in the weights of your individual variables.
Following up on Ralph Winters' answer, you might use PCA (principal component analysis) on the matrix of suitably standardized scores. This will give you a "natural" weight vector that you can use to combine future scores.
Do this also after all scores have been transformed into ranks. If the results are very similar, you have good reasons to continue with either method. If there are discrepancies, this will lead to interesting questions and a better understanding.

Data matching Algorithm Approach

I don't really know where to start with this project, and so I'm hoping a broad question can at least point me in the right direction.
I have 2 data sets right now, each about 5gb with 2million observations. They are the assessed and historical data gathered for property listings of a given area for a certain amount of time. What I need to do is match properties to one another. So a property may arise in the historical since it gets sold 2 or 3 times during the period. In this historical I have the seller info, the loan info, and sale info. In the assessor data I have all of the characteristics that would describe the property sold. So in order to do any pricing model, I need to match the two.
I have variables that are similar in each, however they are going to differ slightly (misspellings, abbreviations, etc). Does anyone have any recommendations for me about going through this? First off, what program would I want to do this in? I have experience in STATA, R and a little bit of SAS and Matlab, but I'd prefer to use the former two.
I read through this:
Data matching algorithm
Where he uses .NET and one user suggested a Levenshtein approach (where the distance between strings is calculated) so for fields like Address I could use this and weight the approximate accuracy between the two string. Then it was suggested maybe to use Soundex for maybe Name of the seller/owner.
But I'm really lost in how to implement any of this, and before I approach anyone in my department I really need to have some sort of idea of what I'm doing!
Any help or advice would be immensely helpful.
Yes, there are several good algorithms for the string matching problem you describe, namely:
jaro-winkler,
smith-waterman,
dice-sorense
soundex
damerau-levenshtein, and
monge-elkan
to name the few.
I recommend A Comparison of String Distance Metrics for Name-Matching Tasks, by W. W. Cohen, P. Ravikumar, S. Fienberg for an overview of what might be working the best for what.
SoftTFIDF claims to be the best one. It is available as a Java package. There are other implementations of string matching and record linkage algorithms available in:
Java (SecondString),
Python (JellyFish),
C# (FuzzyString), and
Scala StringMetric
libraries.

Human-interpretable supervised machine learning algorithm

I'm looking for a supervised machine learning algorithm that would produce transparent rules or definitions that can be easily interpreted by a human.
Most algorithms that I work with (SVMs, random forests, PLS-DA) are not very transparent. That is, you can hardly summarize the models in a table in a publication aimed at a non-computer scientist audience. What authors usually do is, for example, publish a list of variables that are important based on some criterion (for example, Gini index or mean decrease of accuracy in the case of RF), and sometimes improve this list by indicating how these variables differ between the classes in question.
What I am looking is a relatively simple output of the style "if (any of the variables V1-V10 > median or any of the variables V11-V20 < 1st quartile) and variable V21-V30 > 3rd quartile, then class A".
Is there any such thing around?
Just to constraint my question a bit: I am working with highly multidimensional data sets (tens of thousands to hundreds of thousands of often colinear variables). So for example regression trees would not be a good idea (I think).
You sound like you are describing decision trees. Why would regression trees not be a good choice? Maybe not optimal, but they work, and those are the most directly interpretable models. Anything that works on continuous values works on ordinal values.
There's a tension between wanting an accurate classifier, and wanting a simple and explainable model. You could build a random decision forest model, and constrain it in several ways to make it more interpretable:
Small max depth
High minimum information gain
Prune the tree
Only train on "understandable" features
Quantize/round decision threhsolds
The model won't be as good, necessarily.
You can find interesting research in the understanding AI methods done by Been Kim at Google Brain.

Neural network with categorical variables (enum) as inputs

I'm trying to solve some machine-learning problems using neural networks, mostly with the NEAT evolution (NeuroEvolution of Augmented Topologies).
Some of my input variables are continuous, but some of them are of a categorical nature, like:
Species: {Lion,Leopard,Tiger,Jaguar}
Branches of Trade: {Health care,Insurances,Finance,IT,Advertising}
At first I wanted to model such a variable by mapping the categories to discrete numbers, like:
{Lion:1, Leopard:2, Tiger:3, Jaguar:4}
But I'm afraid this adds some kind of arbitrary topology on the variable. A Tiger is not the sum of a Lion and a Leopard.
What approaches to this problem are usually employed?
Unfortunately there is no good solution, each leads to some kind of problems:
Your solution is adding the topology, as you mentioned; it may not be that bad, as NN can fit arbitrary functions and represent "ifs", but in many cases it will (as NN are often falling into some local minima).
You can encode your data in form of is_categorical_feature_i_equal_j, which won't induce any additional topology, but will grow the number of features quadratically. So instaed of "species" you get features "is_lion", "is_leopard", etc. and only one of them is equal 1 at the time
in case of large amount of data as compared to the possible categorical values (for example you have 10000 od data points, and only 10 possible categorical values) one can also split the problem into 10 independent ones, each trained on one particular value (so we have "neural network for lions" "neural network for jaguars" etc.)
These two first approaches are to "extreme" cases - one is very computationally cheap, but can lead to high bias, while the second introduces much complexity, but should not influence the classification process itself. The last one is rarely usable (due to assumption of small number of categorical values) yet quite reasonable in terms of machine learning.
Update
So many things changes in 8 years. Solution 2 is definitely the most popular one, and with growth of compute, wide adoption of neural networks, and support of sparse inputs, the costs is now negliegiable

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Resources