Confusion Matrix of Bayesian Network - matrix

I'm trying to understand bayesian network. I have a data file which has 10 attributes, I want to acquire the confusion table of this data table ,I thought I need to calculate tp,fp, fn, tn of all fields. Is it true ? if it's then what i need to do for bayesian network.
Really need some guidance, I'm lost.

The process usually goes like this:
You have some labeled data instances
which you want to use to train a
classifier, so that it can predict
the class of new unlabeled instances.
Using your classifier
of choice (neural networks, bayes
net, SVM, etc...) we build a
model with your training data
as input.
At this point, you usually would like
to evaluate the performance of the
model before deploying it. So using a
previously unused subset of the data
(test set), we compare the model
classification for these instances
against that of the actual class. A
good way to summarize these results
is by a confusion matrix which shows
how each class of instances is
predicted.
For binary classification tasks, the convention is to assign one class as positive, and the other as negative. Thus from the confusion matrix, the percentage of positive instances that are correctly classified as positive is know as the True Positive (TP) rate. The other definitions follows the same convention...

Confusion matrix is used to evaluate the performance of a classifier, any classifier.

What you are asking is a confusion matrix with more than two classes.
Here is the steps how you do:
Build a classifier for each class, where the training set consists of
the set of documents in the class (positive labels) and its
complement (negative labels).
Given the test document, apply each classifier separately.
Assign the document to the class with the maximum score, the
maximum confidence value, or the maximum probability
Here is the reference for the paper you can have more information:
Picca, Davide, Benoît Curdy, and François Bavaud.2006.Non-linear correspondence analysis in text retrieval: A kernel view. In Proc. JADT.

Related

Question about activation function of image task in Deep Learning

Let me ask you about the image task of Deep Learning (here, image identification).
DeepLearning recognizes that it can be classified into three layers: input layer, intermediate layer, and output layer.
① Input layer → Intermediate layer
② Intermediate layer → Output layer
I understand that it is normal for ① and ② to use the activation function.
I recognize as follows.
Regarding (1), the ReLU function and sigmoid function are used.
Regarding (2), the softmax function is used.
I would like to know why (1) and (2) each use a specific function by convention.
Also, are there cases where the activation function is used, and are there any results evaluated by various functions?
If anyone knows anything about the above, please let me know.
Also, if you have a reference web page or treatise, please let me know.
The choice of activation function in the hidden layer will control how well the network model learns the training dataset. The choice of activation function in the output layer will define the type of predictions the model can make.
An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network. Many activation functions are nonlinear and may be referred to as the “nonlinearity” in the layer or the network design. Nonlinear activation functions are preferred as they allow the nodes to learn more complex structures in the data.
Hidden Layer
ReLU (rectified linear units) activation function, is now-a-days the most common function used for hidden layers because it is both simple to implement and effective at overcoming the limitations of other previously popular activation functions, such as Sigmoid and Tanh. Specifically, it is less susceptible to vanishing gradients that prevent deep models from being trained, although it can suffer from other problems like saturated or “dead” units.
A general problem with both the sigmoid and tanh functions is that they saturate. This means that large values snap to 1.0 and small values snap to -1 or 0 for tanh and sigmoid respectively. Further, the functions are only really sensitive to changes around their mid-point of their input, such as 0.5 for sigmoid and 0.0 for tanh.
The limited sensitivity and saturation of the function happen regardless of whether the summed activation from the node provided as input contains useful information or not. Once saturated, it becomes challenging for the learning algorithm to continue to adapt the weights to improve the performance of the model.
Because rectified linear units are nearly linear, they preserve many of the properties that make linear models easy to optimize with gradient-based methods. They also preserve many of the properties that make linear models generalize well.
Because the rectified function is linear for half of the input domain and nonlinear for the other half, it is referred to as a piecewise linear function or a hinge function. However, the function remains very close to linear, in the sense that is a piecewise linear function with two linear pieces.
Outer Layer
Common activation functions to consider for use in the output layer are: Linear, Logistic (Sigmoid) and Softmax.
The linear activation function is also called “identity” (multiplied by 1.0) or “no activation.” This is because the linear activation function does not change the weighted sum of the input in any way and instead returns the value directly.
The softmax function outputs a vector of values that sum to 1.0 that can be interpreted as probabilities of class membership. It is related to the argmax function that outputs a 0 for all options and 1 for the chosen option. Softmax is a “softer” version of argmax that allows a probability-like output of a winner-take-all function. As such, the input to the function is a vector of real values and the output is a vector of the same length with values that sum to 1.0 like probabilities.
Choose the activation function for your output layer based on the type of prediction problem that you are solving. Specifically, the type of variable that is being predicted.
For example, you may divide prediction problems into two main groups, predicting a categorical variable (classification) and predicting a numerical variable (regression).
If your problem is a regression problem, you should use a linear activation function.
Regression: One node, linear activation.
If your problem is a classification problem, then there are three main types of classification problems and each may use a different activation function.
Predicting a probability is not a regression problem; it is classification. In all cases of classification, your model will predict the probability of class membership (e.g. probability that an example belongs to each class) that you can convert to a crisp class label by rounding (for sigmoid) or argmax (for softmax).
If there are two mutually exclusive classes (binary classification), then your output layer will have one node and a sigmoid activation function should be used. If there are more than two mutually exclusive classes (multiclass classification), then your output layer will have one node per class and a softmax activation should be used. If there are two or more mutually inclusive classes (multilabel classification), then your output layer will have one node for each class and a sigmoid activation function is used.
Binary Classification: One node, sigmoid activation.
Multi-class Classification: One node per class, softmax activation.
Multi-label Classification: One node per class, sigmoid activation.
The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability distribution. That is, softmax is used as the activation function for multi-class classification problems where class membership is required on more than two class labels.
The function can be used as an activation function for a hidden layer in a neural network, although this is less common. It may be used when the model internally needs to choose or weight multiple different inputs at a bottleneck or concatenation layer.
Reference: machinelearningmastery.com
relu and leakyrelu and tanh activation functions in the input and hidden are used for numeric function prediction. leakyrelu and tanh find signal better for equations and linear trends. I used leakyrelu and tanh for linear problems. sigmoid activation is used for classification and binary cross entropy problems. tanh worked well for binary cross entropy problem in credit loan risk.
you can use softmax when your outputting multiple labels as a probability. In this example, I use the ufo text description to output the probable shape of the ufo.
https://github.com/dnishimoto/python-deep-learning/blob/master/UFO%20.ipynb
['Egg','Cross','Sphere','Triangle','Disk','Oval','Rectangle','Teardrop']
softmax returns a probability for each output label
model=Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(len(LABELS), activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
egg shaped ufos are the most common sightings

Shall I treat Industry Classification codes as double data type in K-means clustering?

Since K-means cannot handle categorical variables directly, I want to know if it is correct to convert International Standard Industrial Classification of All Economic Activities or ISIC into double data types to cluster it using K-means along with other financial and transactional data? Or shall I try other techniques such as one hot encoding?
The biggest assumption is that ISIC codes are categorical not numeric variables since code “2930” refers to “Manufacture of parts and accessories for motor vehicles” and not money, kilos, feet, etc., but there is a sort of pattern in such codes since they are not assigned randomly and have a hierarchy for instance 2930 belongs to Section C “Manufacturing” and Division 29 “Manufacture of motor vehicles, trailers and semi-trailers”.
As you want to use standard K-Means, you need your data has a geometric meaning. Hence, if your mapping of the codes into the geometric space is linear, you will not get any proper clustering result. As the distance of the code does not project in their value. For example code 2930 is as close to code 2931 as code 2929. Therefore, you need a nonlinear mapping for the categorical space to the geometric space to using the standard k-mean clustering.
One solution is using from machine learning techniques similar to word-to-vec (for vectorizing words) if you have enough data for co-occurrences of these codes.
Clustering is all about distance measurement.
Discretizing numeric variable to categorical is a partial solution. As earlier highlighted, the underlying question is how to measure the distance for a discretized variable with other discretized variable and numeric variable?
In literature, there are several unsupervised algorithms for treating mixed data. Take a look at the k-prototypes algorithm and the Gower distance.
The k-prototypes in R is given in clustMixType package. The Gower distance in R is given in the function daisy in the cluster package. If using Python, you can look at this post
Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. Paper presented at the Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining,(PAKDD).
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 857-871.
K-means is designed to minimize the sum of squares.
Does minimizing the sum of squares make sense for your problem? Probably not!
While 29, 2903 and 2930 are supposedly all related 2899 likely is not very much related to 2900. Hence, a least squares approach will produce undesired results.
The method is really designed for continuous variables of the same type and scale. One-hot encoded variables cause more problems than they solve - these are a naive hack to make the function "run", but the results are statistically questionable.
Try to figure out what he right thing to do is. It's probably not least squares here.

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

Trying to understand one-class SVM

I am trying to use one-class SVM with Python scikit-learn.
But I do not understand what are the different variables X_outliers, n_error_train, n_error_test, n_error_outliers, etc. which are at this address. Why does X is randomly selected and is not a part of a dataset?
Scikit-learn "documentation" did not help me a lot. Also, I found very few examples on Internet
Can I use One-class SVM for outlier detection in a case of a hudge number of data and if I do not know if there are anomalies in my training set?
One-class SVM is an Unsupervised Outlier Detection (here)
One-class SVM is not an outlier-detection method, but a
novelty-detection method (here)
Is this possible?
Ok, so this is not really a Python question, more of a SVM comprehension question, but eh. A typical SVM is two-classed, and is an algorithm which is going to have two phases :
First, it will learn relationships between variables and attributes. For example, you show your algorithm tomato pictures and banana pictures, telling him each time if it's a banana or a tomato, and you tell him to count the number of red pixels in each picture. If you do it correctly, the SVM will be trained, meaning he will know that pictures with lots of red pixels are more likely to be tomatoes than bananas.
Then comes the predicting phase. You show him a picture of a tomato or a banana without telling him which it is. And since he has been trained before, he will count the red pixels, and know which it is.
In your case of a one-class SVM, it's a bit simpler, basically the training phase is showing him a bunch of variables which are all supposed to be similar. You show him a bunch of tomato pictures telling him "these are tomatoes, everything else too different from these are not tomatoes".
The code you link to is a code to test the SVM's capability of learning. You start by creating variables X_train. Then you generate two other sets, X_test which is similar to X_train (tomato pictures) and X_outliers which is very different. (banana pictures)
Then you show him the X_train variables and tell your SVM "this is the kind of variables we're looking for" with the line clf.fit(X_train). This is equivalent in my example to showing him lots of tomato images, and the SVN learning what a "tomato" is.
And then you test your SVM's capability to sort new variables, by showing him your two other sets (X_test and X_outliers), and asking him whether he thinks they are similar to X_train or not. You ask him that with the predict fuction, and predict will yield for every element in the sets either "1" i.e. "yes this is a similar element to X_train", or "-1", i.e. "this element is very different".
In an ideal case, the SVM should yield only "1" for X_test and only "-1" for X_outliers. But this code is to show you that this is not always the case. The variables n_error_ are here to count the mistakes that the SVM makes, misclassifying X_test elements as "not similar to X_train and X_outliers elements as "similar to X_train". You can see that there are even errors when the SVM is asked to predict on the very set that is has been trained on ! (n_error_train)
Why are there such errors ? Welcome to machine learning. The main difficulty of SVMs is setting the training set such that it enables the SVM to learn efficiently to distinguish between classes. So you need to set carefully the number of images you show him, (and what he has to look out for in the images (in my example, it was the number of red pixels, in the code, it is the value of the variable), but that is a different question).
In the code, the bounded but random initialization of the X sets means that for example you could during on run train the SVM on an X_train set with lots of values between -0.3 and 0 even though they are randomly initialized between -0.3 and 0.3 (espcecially if you have few elements per set, say for example 5, and you get [-0.2 -0.1 0 -0.1 0.1]). And so, when you show the SVM an element with a value of 0.2, then he will have trouble associating it to X_train, because it will have learned that X_train elements are more likely to have negative values.
This is equivalent to show your SVM a few yellow-ish tomatoes when you train him, so when you show him a really red tomato afterwards, it will have trouble clasifying it as a tomato.
This one-class SVM is a classifier to determine whether entries are similar or dissimilar to entries that the classifier has been trained with.
The script generates three sets:
A training set.
A test-set of entries that are similar to the training a set.
A test-set of entries that are dissimilar to the training set.
The error is the number of entries from each of the sets, that have been classified wrongly. That is; That have been classified as dissimilar to the training set when they were similar (for set 1 and 2), or that have been classifier as similar to the training set when they were dissimilar (set 3).
X_outliers: This is set 3.
n_error_train: The number of classification errors for the elements in the train-set (1).
n_error_test: The number of classification errors for the elements in the test-set (2).
n_error_outliers: The number of classification errors for the elements in the outlier-set (3).
This answer should be complementary to scikit-description but I agree that is a bit technical. I will elaborate some aspects of the One Class SVM algorithm (OCSVM) here. OCSVM is designed to solve the unsupervised anomaly detection problem.
Given unstructured (unlabelled) data it will find a n-dimensional space a matrix W^T with d columns (T stands for transpose).
The objective function of all SVM based methods (and OCSVM) is:
$$f(x) = sign(wT x + b)$$, where sign means sign (-1 anomalous 1 nominal) shifted by a bias term b.
In the classification problem the matrix W is associated with the distance(margin) between 2 classes but this differs in OCSVM since there is only 1 class and it maximizes from the origin (original paper of OCSVM demonstrates this ) .
As you see it is a generic algorithm because SVM is a family of models that can approximate any non linear boundary such as neural networks. To achieve something complicated you have to construct your own kernel matrix.
To do this you need to find some convenient mathematical property (suggestions to improve the answer are welcome at this point).
But in the most cases Gaussian kernel is a kernel that has some quite nice mathematical properties and associated ML theorems such as the Large
of large numbers.
The scikit implementation provides a wrapper to LIBSVM implementation for SVM and has 4 such kernels.
-nu parameter is a problem formulation parameter it allows to say to the model here is how dirty my sample is.
More formally it makes the problem a outlier detection problem where you know your data is mixed (nominal and anomalous) instead of pure where the problem is different and it is called novelty detection.
kernel parameter: One of the most important decisions. Mathematically kernel is a big matrix of numbers where by multiplying you achieve to project data in a higher dimensions. A nice read demonstrating the issue is here while the paper of Scholkopf who created OCSVMK goes into more detail.
gamma
In the case of robust kernel you essentially use a gaussian projection.
Disclaimer my interpretation: Essentially with gamma parameter you describe how big the variance of the Normal distribution $N(\mu, \sigma)$ is.
-tolerance
One class svm search the margin tha separates better among training data and the origin. The tolerance refers to the stopping criterion or how small should the tolerance for satisfaction of the quadratic optimization of the
objective function. The objective function the thing that tells SVM what the parameters should like to describe a specific margin - the space between nominal and anomalous) seen in Figure~().
Many Sklearn examples are usually based on randomly generated data. If you want to see an example of how OneClassSVM works on a real dataset for outlier detection, you can go through my post: https://justanoderbit.com/outlier-detection/one-class-svm/

An understandable clusterization

I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal.
There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.
Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.
So it became clear to me that algorithms like K-means will not produce good results.
At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.
So, my question is:
Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.
Additional info:
As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.
Decision Tree
As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.
So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.
Also
I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.
For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).
K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.
However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.
Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.
Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.
There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011
A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.
In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.
I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):
There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:
data {
int<lower=0> N; //number of data points
int<lower=0> C; //number of categories
real x[N]; // normally distributed component data
int y[N]; // categorical component data
}
parameters {
real<lower=0,upper=1> theta; // mixture probability
real mu[2]; // means for the normal component
simplex[C] phi[2]; // categorical distributions for the categorical component
}
transformed parameters {
real log_theta;
real log_one_minus_theta;
vector[C] log_phi[2];
vector[C] alpha;
log_theta <- log(theta);
log_one_minus_theta <- log(1.0 - theta);
for( c in 1:C)
alpha[c] <- .5;
for( k in 1:2)
for( c in 1:C)
log_phi[k,c] <- log(phi[k,c]);
}
model {
theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
for (k in 1:2){
mu[k] ~ normal(0,10);
phi[k] ~ dirichlet(alpha);
}
for (n in 1:N) {
lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
}
}
I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:
Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.
For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.

Resources