Trying to understand one-class SVM - algorithm

I am trying to use one-class SVM with Python scikit-learn.
But I do not understand what are the different variables X_outliers, n_error_train, n_error_test, n_error_outliers, etc. which are at this address. Why does X is randomly selected and is not a part of a dataset?
Scikit-learn "documentation" did not help me a lot. Also, I found very few examples on Internet
Can I use One-class SVM for outlier detection in a case of a hudge number of data and if I do not know if there are anomalies in my training set?
One-class SVM is an Unsupervised Outlier Detection (here)
One-class SVM is not an outlier-detection method, but a
novelty-detection method (here)
Is this possible?

Ok, so this is not really a Python question, more of a SVM comprehension question, but eh. A typical SVM is two-classed, and is an algorithm which is going to have two phases :
First, it will learn relationships between variables and attributes. For example, you show your algorithm tomato pictures and banana pictures, telling him each time if it's a banana or a tomato, and you tell him to count the number of red pixels in each picture. If you do it correctly, the SVM will be trained, meaning he will know that pictures with lots of red pixels are more likely to be tomatoes than bananas.
Then comes the predicting phase. You show him a picture of a tomato or a banana without telling him which it is. And since he has been trained before, he will count the red pixels, and know which it is.
In your case of a one-class SVM, it's a bit simpler, basically the training phase is showing him a bunch of variables which are all supposed to be similar. You show him a bunch of tomato pictures telling him "these are tomatoes, everything else too different from these are not tomatoes".
The code you link to is a code to test the SVM's capability of learning. You start by creating variables X_train. Then you generate two other sets, X_test which is similar to X_train (tomato pictures) and X_outliers which is very different. (banana pictures)
Then you show him the X_train variables and tell your SVM "this is the kind of variables we're looking for" with the line clf.fit(X_train). This is equivalent in my example to showing him lots of tomato images, and the SVN learning what a "tomato" is.
And then you test your SVM's capability to sort new variables, by showing him your two other sets (X_test and X_outliers), and asking him whether he thinks they are similar to X_train or not. You ask him that with the predict fuction, and predict will yield for every element in the sets either "1" i.e. "yes this is a similar element to X_train", or "-1", i.e. "this element is very different".
In an ideal case, the SVM should yield only "1" for X_test and only "-1" for X_outliers. But this code is to show you that this is not always the case. The variables n_error_ are here to count the mistakes that the SVM makes, misclassifying X_test elements as "not similar to X_train and X_outliers elements as "similar to X_train". You can see that there are even errors when the SVM is asked to predict on the very set that is has been trained on ! (n_error_train)
Why are there such errors ? Welcome to machine learning. The main difficulty of SVMs is setting the training set such that it enables the SVM to learn efficiently to distinguish between classes. So you need to set carefully the number of images you show him, (and what he has to look out for in the images (in my example, it was the number of red pixels, in the code, it is the value of the variable), but that is a different question).
In the code, the bounded but random initialization of the X sets means that for example you could during on run train the SVM on an X_train set with lots of values between -0.3 and 0 even though they are randomly initialized between -0.3 and 0.3 (espcecially if you have few elements per set, say for example 5, and you get [-0.2 -0.1 0 -0.1 0.1]). And so, when you show the SVM an element with a value of 0.2, then he will have trouble associating it to X_train, because it will have learned that X_train elements are more likely to have negative values.
This is equivalent to show your SVM a few yellow-ish tomatoes when you train him, so when you show him a really red tomato afterwards, it will have trouble clasifying it as a tomato.

This one-class SVM is a classifier to determine whether entries are similar or dissimilar to entries that the classifier has been trained with.
The script generates three sets:
A training set.
A test-set of entries that are similar to the training a set.
A test-set of entries that are dissimilar to the training set.
The error is the number of entries from each of the sets, that have been classified wrongly. That is; That have been classified as dissimilar to the training set when they were similar (for set 1 and 2), or that have been classifier as similar to the training set when they were dissimilar (set 3).
X_outliers: This is set 3.
n_error_train: The number of classification errors for the elements in the train-set (1).
n_error_test: The number of classification errors for the elements in the test-set (2).
n_error_outliers: The number of classification errors for the elements in the outlier-set (3).

This answer should be complementary to scikit-description but I agree that is a bit technical. I will elaborate some aspects of the One Class SVM algorithm (OCSVM) here. OCSVM is designed to solve the unsupervised anomaly detection problem.
Given unstructured (unlabelled) data it will find a n-dimensional space a matrix W^T with d columns (T stands for transpose).
The objective function of all SVM based methods (and OCSVM) is:
$$f(x) = sign(wT x + b)$$, where sign means sign (-1 anomalous 1 nominal) shifted by a bias term b.
In the classification problem the matrix W is associated with the distance(margin) between 2 classes but this differs in OCSVM since there is only 1 class and it maximizes from the origin (original paper of OCSVM demonstrates this ) .
As you see it is a generic algorithm because SVM is a family of models that can approximate any non linear boundary such as neural networks. To achieve something complicated you have to construct your own kernel matrix.
To do this you need to find some convenient mathematical property (suggestions to improve the answer are welcome at this point).
But in the most cases Gaussian kernel is a kernel that has some quite nice mathematical properties and associated ML theorems such as the Large
of large numbers.
The scikit implementation provides a wrapper to LIBSVM implementation for SVM and has 4 such kernels.
-nu parameter is a problem formulation parameter it allows to say to the model here is how dirty my sample is.
More formally it makes the problem a outlier detection problem where you know your data is mixed (nominal and anomalous) instead of pure where the problem is different and it is called novelty detection.
kernel parameter: One of the most important decisions. Mathematically kernel is a big matrix of numbers where by multiplying you achieve to project data in a higher dimensions. A nice read demonstrating the issue is here while the paper of Scholkopf who created OCSVMK goes into more detail.
gamma
In the case of robust kernel you essentially use a gaussian projection.
Disclaimer my interpretation: Essentially with gamma parameter you describe how big the variance of the Normal distribution $N(\mu, \sigma)$ is.
-tolerance
One class svm search the margin tha separates better among training data and the origin. The tolerance refers to the stopping criterion or how small should the tolerance for satisfaction of the quadratic optimization of the
objective function. The objective function the thing that tells SVM what the parameters should like to describe a specific margin - the space between nominal and anomalous) seen in Figure~().

Many Sklearn examples are usually based on randomly generated data. If you want to see an example of how OneClassSVM works on a real dataset for outlier detection, you can go through my post: https://justanoderbit.com/outlier-detection/one-class-svm/

Related

Shall I treat Industry Classification codes as double data type in K-means clustering?

Since K-means cannot handle categorical variables directly, I want to know if it is correct to convert International Standard Industrial Classification of All Economic Activities or ISIC into double data types to cluster it using K-means along with other financial and transactional data? Or shall I try other techniques such as one hot encoding?
The biggest assumption is that ISIC codes are categorical not numeric variables since code “2930” refers to “Manufacture of parts and accessories for motor vehicles” and not money, kilos, feet, etc., but there is a sort of pattern in such codes since they are not assigned randomly and have a hierarchy for instance 2930 belongs to Section C “Manufacturing” and Division 29 “Manufacture of motor vehicles, trailers and semi-trailers”.
As you want to use standard K-Means, you need your data has a geometric meaning. Hence, if your mapping of the codes into the geometric space is linear, you will not get any proper clustering result. As the distance of the code does not project in their value. For example code 2930 is as close to code 2931 as code 2929. Therefore, you need a nonlinear mapping for the categorical space to the geometric space to using the standard k-mean clustering.
One solution is using from machine learning techniques similar to word-to-vec (for vectorizing words) if you have enough data for co-occurrences of these codes.
Clustering is all about distance measurement.
Discretizing numeric variable to categorical is a partial solution. As earlier highlighted, the underlying question is how to measure the distance for a discretized variable with other discretized variable and numeric variable?
In literature, there are several unsupervised algorithms for treating mixed data. Take a look at the k-prototypes algorithm and the Gower distance.
The k-prototypes in R is given in clustMixType package. The Gower distance in R is given in the function daisy in the cluster package. If using Python, you can look at this post
Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. Paper presented at the Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining,(PAKDD).
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 857-871.
K-means is designed to minimize the sum of squares.
Does minimizing the sum of squares make sense for your problem? Probably not!
While 29, 2903 and 2930 are supposedly all related 2899 likely is not very much related to 2900. Hence, a least squares approach will produce undesired results.
The method is really designed for continuous variables of the same type and scale. One-hot encoded variables cause more problems than they solve - these are a naive hack to make the function "run", but the results are statistically questionable.
Try to figure out what he right thing to do is. It's probably not least squares here.

Distribution of the Training Data vs Distribution of the Test/Prediction

Does the Distribution represented by the training data need to reflect the distribution of the test data and the data that you predict on? Can I measure the quality of the training data by looking at the distribution of each feature and compare that distribution to the data I am predicting or testing with? Ideally the training data should be sufficiently representative of the real world distribution.
Short answer: similar ranges would be a good idea.
Long answer: sometimes it won't be an issue (rarely) but let's examine when.
In an ideal situation, your model will capture the true phenomenon perfectly. Imagine the simplest case: the linear model y = x. If the training data are noiseless (or have tolerable noise). Your linear regression will naturally land on a model approximately equal to y = x. The generalization of the model will work nearly perfect even outside of the training range. If your train data were {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 10:10}. The test point 500, will nicely map onto the function, returning 500.
In most modeling scenarios, this will almost certainly not be the case. If the training data are ample and the model is appropriately complex (and no more), you're golden.
The trouble is that few functions (and corresponding natural phenomena) -- especially when we consider nonlinear functions -- extend to data outside of the training range so cleanly. Imagine sampling office temperature against employee comfort. If you only look at temperatures from 40 deg to 60 deg. A linear function will behave brilliantly in the training data. Oddly enough, if you test on 60 to 80, the mapping will break down. Here, the issue is confidence in your claim that the data are sufficiently representative.
Now let's consider noise. Imagine that you know EXACTLY what the real world function is: a sine wave. Better still, you are told its amplitude and phase. What you don't know is its frequency. You have a really solid sampling between 1 and 100, the function you fit maps against the training data really well. Now if there is just enough noise, you might estimate the frequency incorrectly by a hair. When you test near the training range, the results aren't so bad. Outside of the training range, things start to get wonky. As you move further and further from the training range, the real function and the function diverge and converge based on their relative frequencies. Sometimes, the residuals are seemingly fine; sometimes they are dreadful.
There is an issue with your idea of examining the variable distributions: interaction between variables. Even if each variable is appropriately balanced in train and test, it is possible that the relationships between variables will differ (joint distributions). For a purely contrived example, consider you were predicting an individual's likelihood of being pregnant at any given time. In your training set, you had women aged 20 to 30 and men aged 30 to 40. In testing, you had the same percentage of men and women, but the age ranges were flipped. Independently, the variables look very nicely matched! But in your training set, you could very easily conclude, "only people under 30 get pregnant." Oddly enough, your testing set would demonstrate the exact opposite! The trouble is that your predictions are being made from a multivariate space, but the distributions you are thinking about are univariate. Considering the joint distributions of continuous variables against one another (and considering categorical variables appropriately) is, however, a good idea. Ideally, your fit model should have access to a similar range to your testing data.
Fundamentally, the question is about extrapolation from a limited training space. If the model fit in the training space generalizes, you can generalize; ultimately, it is usually safest to have a really well distributed training set to maximize the likelihood that you have captured the complexity of the underlying function.
Really interesting question! I hope the answer was somewhat insightful; I'll continue to build on it as resources come to mind! Let me know if any questions remain!
EDIT: a point made in the comments that I think should be read by future readers.
Ideally, training data should NEVER influence testing data in ANY way. That includes examining of the distributions, joint distributions etc. With sufficient data, distributions in the training data should converge on distributions in the testing data (think the mean, law of large nums). Manipulation to match distributions (like z-scoring before train/test split) fundamentally skews performance metrics in your favor. An appropriate technique for splitting train and test data would be something like stratified k fold for cross validation.
Sorry for the delayed response. After going through a few months of iterating, I implemented and pushed the following solution to production and it is working quite well.
The issue here boils down to how can one reduce the training/test score variance when performing cross validation. This is important as if your variance is high, the confidence in picking the best model goes down. The more representative the test data is to the train data, the less variance you get in your test scores across the cross validation set. Stratified cross validation tackles this issue especially when there is significant class imbalance, by ensuring that the label class proportions are preserved across all test/train sets. However, this doesnt address the issue with the feature distribution.
In my case, I had a few features that were very strong predictors but also very skewed in their distribution. This caused significant variance in my test scores which made it harder to pick a model with any confidence. Essentially, the solution is to ensure that the joint distribution of the label with the feature set is maintained across test/train sets. Many ways of doing this but a very simple approach is to simply take each column bucket range (if continuous) or label (if categorical) one by one and sample from these buckets when generating the test and train sets. Note that the buckets quickly gets very sparse especially when you have a lot of categorical variables. Also, the column order in which you bucket affects the sampling output greatly. Below is a solution where I bucket the label first (same like stratified CV) and then sample 1 other feature (most important feature (called score_percentage) that is known upfront).
def train_test_folds(self, label_column="label"):
# train_test is an array of tuples where each tuple is a test numpy array and train numpy array pair.
# The final iterator would return these individual elements separately.
n_folds = self.n_folds
label_classes = np.unique(self.label)
train_test = []
fmpd_copy = self.fm.copy()
fmpd_copy[label_column] = self.label
fmpd_copy = fmpd_copy.reset_index(drop=True).reset_index()
fmpd_copy = fmpd_copy.sort_values("score_percentage")
for lbl in label_classes:
fmpd_label = fmpd_copy[fmpd_copy[label_column] == lbl]
# Calculate the fold # using the label specific dataset
if (fmpd_label.shape[0] < n_folds):
raise ValueError("n_folds=%d cannot be greater than the"
" number of rows in each class."
% (fmpd_label.shape[0]))
# let's get some variance -- shuffle within each buck
# let's go through the data set, shuffling items in buckets of size nFolds
s = 0
shuffle_array = fmpd_label["index"].values
maxS = len(shuffle_array)
while s < maxS:
max = min(maxS, s + n_folds) - 1
for i in range(s, max):
j = random.randint(i, max)
if i < j:
tempI = shuffle_array[i]
shuffle_array[i] = shuffle_array[j]
shuffle_array[j] = tempI
s = s + n_folds
# print("shuffle s =",s," max =",max, " maxS=",maxS)
fmpd_label["index"] = shuffle_array
fmpd_label = fmpd_label.reset_index(drop=True).reset_index()
fmpd_label["test_set_number"] = fmpd_label.iloc[:, 0].apply(
lambda x: x % n_folds)
print("label ", lbl)
for n in range(0, n_folds):
test_set = fmpd_label[fmpd_label["test_set_number"]
== n]["index"].values
train_set = fmpd_label[fmpd_label["test_set_number"]
!= n]["index"].values
print("for label ", lbl, " test size is ",
test_set.shape, " train size is ", train_set.shape)
print("len of total size", len(train_test))
if (len(train_test) != n_folds):
# Split doesnt exist. Add it in.
train_test.append([train_set, test_set])
else:
temp_arr = train_test[n]
temp_arr[0] = np.append(temp_arr[0], train_set)
temp_arr[1] = np.append(temp_arr[1], test_set)
train_test[n] = [temp_arr[0], temp_arr[1]]
return train_test
Over time, I realized that this whole issue falls under the umbrella of covariate shift which is a well studied area within machine learning. Link below or just search google for covariate shift. The concept is how to detect and ensure that your prediction data is of similar distribution with your training data. THis is in the feature space but in theory you could have label drift as well.
https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/

Understanding Perceptrons

I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.

Practical guidelines on setting weights for examples in vowpal wabbit

I have a multi-class classification problem on a data set (with 6 target classes).The training data has a skewed distribution of the class labels: Below is a distribution of each of the class labels (1 to 6)
(array([174171, 12, 29, 8285, 9996, 11128]),
I am using vowpal wabbit's oaa scheme to classify and have tried the default weight of 1.0 for each example. However for most models this just results in the model predicting 1.0 for all examples in the evaluation (as label 1 has a very large representation in the training set).
I am trying to now experiment with different weights that I can apply to the examples of each class to help boost the performance of the classifier.
Any pointers or practical tips on techniques to decide on weights of each example would be very useful. One possible technique was to weigh the example in inverse ratio according to their frequency. Unfortunately this seems to result in the classifier being biased greatly towards Labels 2 and 3 , and predicting 2 and 3 for almost everything in the evaluation.
Would the model choice play a role in deciding the weights. I am experimenting with neural networks and logistic and hinge loss functions.
There may be better approaches, but I would start, like you did, by inverse weighting the examples based on the rarity of their labels as follows:
Sum of counts of labels = 174171 + 12 + 29 + 8285 + 9996 + 11128 = 203621 so
Label 1 appearing 174171 times (85.5% of total) would be weighted: 203621/174171 = 1.16909
Label 2 appearing 12 times (rarest) would be weighted: 203621/12 = 16968.4
and so on.
Make sure the examples in the train-set are well shuffled. This is of critical importance in online learning. Having the same label examples lumped together is a recipe for very poor online performance.
If you did shuffle well, and you get bad performance on new examples, you can reweight less aggressively, for example take the sqrt() of the inverse weights, then if that's still too aggressive, switch to log() of the inverse weights, etc.
Another approach is to use one of the new cost-sensitive multi-class options, e.g. --csoaa
The VW wiki on github has some examples with details on how to use these options and their training-set formats.
The loss function chosen should definitely have an effect. However note that generally, when using multi-class, or any other reduction-based option in vw, you should leave the --loss_function alone and let the algorithm use its built-in default. If you try a different loss function and get better results than the reduction built-in loss-function, this may be of interest to the developers of vw, please report it as a bug.

An understandable clusterization

I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal.
There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.
Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.
So it became clear to me that algorithms like K-means will not produce good results.
At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.
So, my question is:
Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.
Additional info:
As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.
Decision Tree
As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.
So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.
Also
I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.
For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).
K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.
However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.
Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.
Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.
There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011
A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.
In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.
I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):
There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:
data {
int<lower=0> N; //number of data points
int<lower=0> C; //number of categories
real x[N]; // normally distributed component data
int y[N]; // categorical component data
}
parameters {
real<lower=0,upper=1> theta; // mixture probability
real mu[2]; // means for the normal component
simplex[C] phi[2]; // categorical distributions for the categorical component
}
transformed parameters {
real log_theta;
real log_one_minus_theta;
vector[C] log_phi[2];
vector[C] alpha;
log_theta <- log(theta);
log_one_minus_theta <- log(1.0 - theta);
for( c in 1:C)
alpha[c] <- .5;
for( k in 1:2)
for( c in 1:C)
log_phi[k,c] <- log(phi[k,c]);
}
model {
theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
for (k in 1:2){
mu[k] ~ normal(0,10);
phi[k] ~ dirichlet(alpha);
}
for (n in 1:N) {
lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
}
}
I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:
Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.
For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.

Resources