What does it mean for probability P(V1|V2)=1 in Bayesian network probability query? - probability

I'm using the junction tree algorithm to calculate conditional probabilities like P(V1|V2) and got many values of 1. Can I interpret it as "if I observe V2, then I'm 100% sure that there is V2"? Is it possible for this to happen? or was there something with the calculation itself possibly?

Related

KMeans evaluation metric not converging. Is this normal behavior or no?

I'm working on a problem that necessitates running KMeans separately on ~125 different datasets. Therefore, I'm looking to mathematically calculate the 'optimal' K for each respective dataset. However, the evaluation metric continues decreasing with higher K values.
For a sample dataset, there are 50K rows and 8 columns. Using sklearn's calinski-harabaz score, I'm iterating through different K values to find the optimum / minimum score. However, my code reached k=5,600 and the calinski-harabaz score was still decreasing!
Something weird seems to be happening. Does the metric not work well? Could my data be flawed (see my question about normalizing rows after PCA)? Is there another/better way to mathematically converge on the 'optimal' K? Or should I force myself to manually pick a constant K across all datasets?
Any additional perspectives would be helpful. Thanks!
I don't know anything about the calinski-harabaz score but some score metrics will be monotone increasing/decreasing with respect to increasing K. For instance the mean squared error for linear regression will always decrease each time a new feature is added to the model so other scores that add penalties for increasing number of features have been developed.
There is a very good answer here that covers CH scores well. A simple method that generally works well for these monotone scoring metrics is to plot K vs the score and choose the K where the score is no longer improving 'much'. This is very subjective but can still give good results.
SUMMARY
The metric decreases with each increase of K; this strongly suggests that you do not have a natural clustering upon the data set.
DISCUSSION
CH scores depend on the ratio between intra- and inter-cluster densities. For a relatively smooth distribution of points, each increase in K will give you clusters that are slightly more dense, with slightly lower density between them. Try a lattice of points: vary the radius and do the computations by hand; you'll see how that works. At the extreme end, K = n: each point is its own cluster, with infinite density, and 0 density between clusters.
OTHER METRICS
Perhaps the simplest metric is sum-of-squares, which is already part of the clustering computations. Sum the squares of distances from the centroid, divide by n-1 (n=cluster population), and then add/average those over all clusters.
I'm looking for a particular paper that discusses metrics for this very problem; if I can find the reference, I'll update this answer.
N.B. With any metric you choose (as with CH), a failure to find a local minimum suggests that the data really don't have a natural clustering.
WHAT TO DO NEXT?
Render your data in some form you can visualize. If you see a natural clustering, look at the characteristics; how is it that you can see it, but the algebra (metrics) cannot? Formulate a metric that highlights the differences you perceive.
I know, this is an effort similar to the problem you're trying to automate. Welcome to research. :-)
The problem with my question is that the 'best' Calinski-Harabaz score is the maximum, whereas my question assumed the 'best' was the minimum. It is computed by analyzing the ratio of between-cluster dispersion vs. within-cluster dispersion, the former/numerator you want to maximize, the latter/denominator you want to minimize. As it turned out, in this dataset, the 'best' CH score was with 2 clusters (the minimum available for comparison). I actually ran with K=1, and this produced good results as well. As Prune suggested, there appears to be no natural grouping within the dataset.

What is a bad, decent, good, and excellent F1-measure range?

I understand F1-measure is a harmonic mean of precision and recall. But what values define how good/bad a F1-measure is? I can't seem to find any references (google or academic) answering my question.
Consider sklearn.dummy.DummyClassifier(strategy='uniform') which is a classifier that make random guesses (a.k.a bad classifier). We can view DummyClassifier as a benchmark to beat, now let's see it's f1-score.
In a binary classification problem, with balanced dataset: 6198 total sample, 3099 samples labelled as 0 and 3099 samples labelled as 1, f1-score is 0.5 for both classes, and weighted average is 0.5:
Second example, using DummyClassifier(strategy='constant'), i.e. guessing the same label every time, guessing label 1 every time in this case, average of f1-scores is 0.33, while f1 for label 0 is 0.00:
I consider these to be bad f1-scores, given the balanced dataset.
PS. summary generated using sklearn.metrics.classification_report
You did not find any reference for f1 measure range because there is not any range. The F1 measure is a combined matrix of precision and recall.
Let's say you have two algorithms, one has higher precision and lower recall. By this observation , you can not tell that which algorithm is better, unless until your goal is to maximize precision.
So, given this ambiguity about how to select superior algorithm among two (one with higher recall and other with higher precision), we use f1-measure to select superior among them.
f1-measure is a relative term that's why there is no absolute range to define how better your algorithm is.

Dynamic Time Warping Similarity Percentage

I want the DTW algorithm to output the similarity percentage between 2 arrays of values. I want 0% to mean there's no similarity between signals, and 100% to mean the 2 signals are identical. The modification I thought about is that 100% would be the shortest path from the start to end of the matrix, while 0% would be the path going all the way to the right then all the way up (m+n). The value I want to map to this interval would be the length of the path generated by the DTW algorithm. Would this produce accurate results, or is there a better way?
After carefully studying the algorithm, it turned out that minimum cost path will always lie on the diagonal of the matrix. This means that the modification I suggested above for calculating the similarity percentage, will not work.

URL path similarity/string similarity algorithm

My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process:
# GROUP 1
/robots.txt
# GROUP 2
/bot.html
# GROUP 3
/phpMyAdmin-2.5.6-rc1/scripts/setup.php
/phpMyAdmin-2.5.6-rc2/scripts/setup.php
/phpMyAdmin-2.5.6/scripts/setup.php
/phpMyAdmin-2.5.7-pl1/scripts/setup.php
/phpMyAdmin-2.5.7/scripts/setup.php
/phpMyAdmin-2.6.0-alpha/scripts/setup.php
/phpMyAdmin-2.6.0-alpha2/scripts/setup.php
# GROUP 4
//phpMyAdmin/
I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think 90% and above is a must.
I think that I need some sort of classifier, but the problem is that each portion of new data can containt path that should be classified to the new unknown class.
Could you please direct me to the right thoutht?
Thanks
Levenshtein distance is best option, but tuned distance. You have to use weighted Edit distance and possibly split path on tokens - words and numbers. So for example version like "2.5.6-rc2 and 2.5.6" can be treated as 0 weight difference, but name token like phpMyAdmin and javaMyAdmin give 1 weight difference.
When checking #jakub.gieryluk suggestion I accidentally have found solution that satisfy me - "Hobohm clustering algorithm, originally devised to reduce redundancy of biological sequence data sets."
Tests of PERL library implemented by Bruno Vecchi gave me really good results. The only problem is that I need Python implementation, but I belive that I can either find one on the Internet or reimplement code by myself.
Next thing is that I have not checked active learning ability of this algorithm yet ;)
I know it's not the exact answer to your question, but are you familiar with k-means algorithm?
I guess even the Levenshtein can work here, the difficulty however is how to compute centroids with that approach.
Perhaps you can divide input set into disjoint subsets, then for each URL in each subset compute the distance to all the other URLs in the same subset, and the URL that has lowest sum of distances, should be the centroid (of course, it depends on how big is the input set; for huge sets it might be not a good idea to do so).
The good thing about k-means is that you can start with absolutely random division, and then iteratively make it better.
The bad thing about k-means is that you have to precise k before start. However, during the run (perhaps where the situation stabilized after first couple of iterations), you can measure intra-similarity of each set, and if it is low, you can divide the set into two subsets and go on with the same algorithm.

RANSAC variation: does an inlier membership probability distribution makes sense?

I'm using RANSAC to fit a geometric model to a point cloud with outliers. I know, because of the generation process of the point cloud, that 99.9% of the inlier distances to my model are distributed following a gaussian probability density function with known μ and σ, in the interval [−3σ,−3σ].
The first question is whether do you think that it is reasonable to evaluate the total number of inliers for a certain model adding the inlier membership probability instead of adding 1 for each inlier. That is, the traditional RANSAC assumes that everything that is in an interval delimited by a threshold is an inlier; I would like to know if I can bend that, giving to some inliers more weight than others, following a probability distribution for this purpose.
In case this is reasonable, the second question is, how do you think it affects the number of samples N:
1−(1−(1−e)^s)^N=p
being e the probability that a point is an outlier, s the number of points used in a sample, N the number of samples (RANSAC iterations), p the desired probability that we get a good sample.
If none of that is reasonable, how do you suggest I may introduce my prior information of the inlier distribution?
Thanks in advance,
Federico

Resources