I already have a distance matrix (1609*1609) and each distance is between 0~1. I want to cluster 1609 items into natural groups by using Twostep cluster in SPSS. I want to use the distance matrix as input for Twostep cluster analysis. How to modify the syntax to do that? Or I cannot do it?
DATASET ACTIVATE dataname1.
TWOSTEP CLUSTER
/CATEGORICAL VARIABLES=ROWTYPE VARNAME
/CONTINUOUS VARIABLES=A1 to A1609 *Ignore the A2 to A1608 here.
/DISTANCE LIKELIHOOD
/NUMCLUSTERS AUTO 15 BIC
/HANDLENOISE 0
/MEMALLOCATE 64
/CRITERIA INITHRESHOLD(0) MXBRANCH(8) MXLEVEL(3)
/VIEWMODEL DISPLAY=YES
/SAVE VARIABLE=TSC_4920.
Thanks in advance.
From my understanding of that vaguely-documented "twostep" clustering, it needs to compute the mean of points.
Then it cannot be used with a distance matrix. Consider using ELKI, sklearn, or R instead. As a plus, they are open source, so you can actually check what they are doing, and customize them if e.g. they would not allow a distance matrix somewhere. That is a very big feature, being open source.
The twostep algorithm, as well as all the others in Statistics, is fully documented in the algorithms manual available from the Help menu.
Related
Since K-means cannot handle categorical variables directly, I want to know if it is correct to convert International Standard Industrial Classification of All Economic Activities or ISIC into double data types to cluster it using K-means along with other financial and transactional data? Or shall I try other techniques such as one hot encoding?
The biggest assumption is that ISIC codes are categorical not numeric variables since code “2930” refers to “Manufacture of parts and accessories for motor vehicles” and not money, kilos, feet, etc., but there is a sort of pattern in such codes since they are not assigned randomly and have a hierarchy for instance 2930 belongs to Section C “Manufacturing” and Division 29 “Manufacture of motor vehicles, trailers and semi-trailers”.
As you want to use standard K-Means, you need your data has a geometric meaning. Hence, if your mapping of the codes into the geometric space is linear, you will not get any proper clustering result. As the distance of the code does not project in their value. For example code 2930 is as close to code 2931 as code 2929. Therefore, you need a nonlinear mapping for the categorical space to the geometric space to using the standard k-mean clustering.
One solution is using from machine learning techniques similar to word-to-vec (for vectorizing words) if you have enough data for co-occurrences of these codes.
Clustering is all about distance measurement.
Discretizing numeric variable to categorical is a partial solution. As earlier highlighted, the underlying question is how to measure the distance for a discretized variable with other discretized variable and numeric variable?
In literature, there are several unsupervised algorithms for treating mixed data. Take a look at the k-prototypes algorithm and the Gower distance.
The k-prototypes in R is given in clustMixType package. The Gower distance in R is given in the function daisy in the cluster package. If using Python, you can look at this post
Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. Paper presented at the Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining,(PAKDD).
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 857-871.
K-means is designed to minimize the sum of squares.
Does minimizing the sum of squares make sense for your problem? Probably not!
While 29, 2903 and 2930 are supposedly all related 2899 likely is not very much related to 2900. Hence, a least squares approach will produce undesired results.
The method is really designed for continuous variables of the same type and scale. One-hot encoded variables cause more problems than they solve - these are a naive hack to make the function "run", but the results are statistically questionable.
Try to figure out what he right thing to do is. It's probably not least squares here.
Source:- https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/
This page has a section quoting the following passage:-
Best Prepare Data for KNN
Rescale Data: KNN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian
distribution.
Address Missing Data: Missing data will mean that the distance between samples cannot be calculated. These samples could be excluded or the missing values could be imputed.
Lower Dimensionality: KNN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. KNN can benefit from feature selection that reduces the dimensionality of the input feature space.
Please, can someone explain the Second point, i.e. Address Missing Data, in detail?
Missing data in this context means that some samples do not have all the existing features.
For example:
Suppose you have a database with age and height for a group of individuals.
This would mean that for some persons either the height or the age is missing.
Now, why this affects KNN?
Given a test sample
KNN finds the samples that are closer to it (Aka: the students with similar age and height).
KNN does this to make some inference about the test sample based on its nearest neighbors.
If you want to find these neighbors you must be able to compute the distance between samples. To compute the distance between 2 samples you must have all the features for these 2 samples.
If some of them are missing you won't be able to compute distance.
So implicitly you would be lossing the samples with missing data
I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal.
There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.
Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.
So it became clear to me that algorithms like K-means will not produce good results.
At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.
So, my question is:
Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.
Additional info:
As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.
Decision Tree
As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.
So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.
Also
I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.
For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).
K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.
However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.
Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.
Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.
There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011
A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.
In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.
I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):
There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:
data {
int<lower=0> N; //number of data points
int<lower=0> C; //number of categories
real x[N]; // normally distributed component data
int y[N]; // categorical component data
}
parameters {
real<lower=0,upper=1> theta; // mixture probability
real mu[2]; // means for the normal component
simplex[C] phi[2]; // categorical distributions for the categorical component
}
transformed parameters {
real log_theta;
real log_one_minus_theta;
vector[C] log_phi[2];
vector[C] alpha;
log_theta <- log(theta);
log_one_minus_theta <- log(1.0 - theta);
for( c in 1:C)
alpha[c] <- .5;
for( k in 1:2)
for( c in 1:C)
log_phi[k,c] <- log(phi[k,c]);
}
model {
theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
for (k in 1:2){
mu[k] ~ normal(0,10);
phi[k] ~ dirichlet(alpha);
}
for (n in 1:N) {
lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
}
}
I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:
Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.
For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.
my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php.
1) Similarity:
I treat every document as a "bag-of-words" and convert words into vectors. I use
filtering (only "real" words)
tokenization (split sentences into words)
stemming (reduce words to their base form; Porter's stemmer)
pruning (cut of words with too high & low frequency)
as methods for dimensionality reduction. After that, I'm using cosine similarity (as suggested / described on various sites on the web and here.
The result then is a similarity matrix like this:
A B C D E
A 0 30 51 75 80
B X 0 21 55 70
C X X 0 25 10
D X X X 0 15
E X X X X 0
A…E are my texts and the number is the similarity in percent; the higher, the more similar the texts are. Because sim(A,B) == sim(B,A) only half of the matrix is filled in. So the similarity of Text A to Text D is 71%.
I want to generate a a priori unknown(!) number of clusters out of this matrix now. The clusters should represent the similar items (up to a certain stopp criterion) together.
I tried a basic implementation myself, which was basically like this (60% as a fixed similarity threshold)
foreach article
get similar entries where sim > 60
foreach similar entry
check if one of the entries already has a cluster number
if no: assign new cluster number to all similar entries
if yes: use that number
It worked (somehow), but wasn't good at all and the results were often monster-clusters.
So, I want to redo this and already had a look into all kinds of clustering algorithms, but I'm still not sure which one will work best. I think it should be an agglomerative algoritm, because every pair of texts can be seen as a cluster in the beginning. But still the questions are what the stopp criterion is and if the algorithm should divide and / or merge existing clusters together.
Sorry if some of the stuff seems basic, but I am relatively new in this field. Thanks for the help.
Since you're both new to the field, have an unknown number of clusters and are already using cosine distance I would recommend the FLAME clustering algorithm.
It's intuitive, easy to implement, and has implementations in a large number of languages (not PHP though, largely because very few people use PHP for data science).
Not to mention, it's actually good enough to be used in research by a large number of people. If nothing else you can get an idea of what exactly the shortcomings are in this clustering algorithm that you want to address in moving onto another one.
Just try some. There are so many clustering algorithms out there, nobody will know all of them. Plus, it also depends a lot on your data set and the clustering structure that is there.
In the end, there also may be just this one monster cluster with respect to cosine distance and BofW features.
Maybe you can transform your similarity matrix to a dissimilarity matrix such as transforming x to 1/x, then your problem is to cluster a dissimilarity matrix. I think the hierarchical cluster may work. These may help you:hierarchical clustering and Clustering a dissimilarity matrix
I have n words and their relatedness weight that gives me a n*n matrix. I'm going to use this for a search algorithm but the problem is I need to cluster the entered keywords based on their pairwise relation. So let's say if the keywords are {tennis,federer,wimbledon,london,police} and we have the following data from our weight matrix:
tennis federer wimbledon london police
tennis 1 0.8 0.6 0.4 0.0
federer 0.8 1 0.65 0.4 0.02
wimbledon 0.6 0.65 1 0.08 0.09
london 0.4 0.4 0.08 1 0.71
police 0.0 0.02 0.09 0.71 1
I need an algorithm to to cluster them into 2 clusters : {tennis,federer,wimbledon} {london,police}. Is there any know clustering algorithm than can deal with such thing ? I did some research, it appears that K-means algorithm is the most well known algorithm being used for clustering but apparently K-means doesn't suit this case.
I would greatly appreciate any help.
You can treat it as a network clustering problem. With a recent version of mcl software (http://micans.org/mcl), you can do this (I've called your example fe.data).
mcxarray -data fe.data -skipr 1 -skipc 1 -write-tab fe.tab -write-data fe.mci -co 0 -tf 'gq(0)' -o fe.cor
# the above computes correlations (put in data file fe.cor) and a network (put in data file fe.mci).
# below proceeds with the network.
mcl fe.mci -I 3 -o - -use-tab fe.tab
# this outputs the clustering you expect. -I is the 'inflation parameter'. The latter affects
# cluster granularity. With the default parameter 2, everything ends up in a single cluster.
Disclaimer: I wrote mcl and a slew of associated network loading/conversion and analysis programs recently rebranded as 'mcl-edge'. They all come together in a single software package. Seeing your example made me curious whether it would be doable with mcl-edge, so I quickly tested it.
Consider DBSCAN. If it suits your needs, you might wish to take a closer look at an optimised version, TI-DBSCAN, which uses triangle inequality for reducing spatial query cost.
DBSCAN's advantages and disadvantages are discussed on Wikipedia. It splits input data to a set of clusters whose cardinality isn't known a priori. You'd have to transform your similarity matrix into a distance matrix, for example by taking 1 - similarity as a distance.
Check this book on Information retrieval
http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html
it explains very well what you want to do
Your weights are higher for more similar words and lower for more different words. A clustering algorithm requires similar points/words to be closer spatially and different words to be distant. You should change the matrix M into 1-M and then use any clustering method you want, including k-means.
If you've got a distance matrix, it seems a shame not to try http://en.wikipedia.org/wiki/Single_linkage_clustering. By hand, I think you get the following clustering:
((federer, tennis), wimbledon) (london, police)
The similarity for the link that joins the two main groups (either tennis-london or federer-london) is smaller than any of the similarities that build the two groups: london-police, tennis-federer, and federer-wimbledon: this characteristic is guaranteed by single linkage clustering, since it binds together closest clusters at each stage, and the two main groups are linked by the last binding found.
DBSCAN (see other answers) and successors such as OPTICS are clearly an option.
While the examples are on vector data, all that the algorithms need is a distance function. If you have a similarity matrix, that can trivially be used as distance function.
The example data set probably is a bit too small for them to produce meaningful results. If you just have this little of data, any "hierarchical clustering" should be feasible and do the job for you. You then just need to decide on the best number of clusters.