data mining cluster in Non-Euclidean Spaces - cluster-computing

Consider the space of strings with edit distance as the distance
measure. Give an example of a set of strings such that if we choose the clustroid
by minimizing the sum of the distances to the other points we get one point
as the clustroid, but if we choose the clustroid by minimizing the maximum
distance to the other points, another point becomes the clustroid.
I meet the challenge of this topic. Can anyone help me out?

Try this set of strings:
badger
badger
badger
badger
badger
banana
nanana
What string has the minimum sum, what string the smallest maximum distance?

Based on Anony-Mousse's answer, I think I would give this:
backer
bacher
bacfer
bacper
bacjer
banono
kanono

Related

How to do text corrections with a suffix array?

We used suffix array to implement search by keywords, for example consider a phrase:
white bathroom tile
we insert suffixes:
1) white bathroom tile
2) bathroom tile
3) tile
and now the phrase "white bathroom tile" can be found if a user types in words: "white", "bathroom" or "tile".
However, now there's a problem, a person can type in "tyle" and nothing will be found.
So, I wanted to ask how to implement some sort of fast fuzzy search for this. Basically I want this algorithm to correct the user and still find "tile".
I considered applying levenstein distance, but my attempt failed. The idea was that we could find the group of words that start with "t" and compute levenstein distance for each one of them, and then return results where the levenstein distance was minimal.
This failed, because the user can type is "iile" instead of "tile", and now nothing words, my algorithm applies levenstein distance to the words in the "i" group.
What is a good way to solve this?
You can use Edit distance algorithm algorithm find the list of words which have the minimum edit distance with the searched word.
For example, With the word tyle and ile the edit distance of the searched word tile will be 1. For the word iile, the edit distance between tile and iile will be 1 as well.
Update
If traversing all words on the suffix array and calculating edit distance is slow (it is, the edit distance is O(^2) in time complexity), I would suggest to build a prefix tree (trie) with all the suffixes of the sentence. And then during lookup, for example, for word tyle, try to traverse the prefix tree in this way:
If there is a node in the prefix tree for the current character, traverse the node
If there is no node for current character, recursively traverse all nodes and skip this character.
During lookup, calculate the number of characters you skipped. The fewer number of characters you skips, the better candidate the word is.
Found this interesting article about a data structure called BK-tree and related algorithms. So, I'm considering using a BK-tree.
Also this article talks about even more powerful methods.
Levenshtein distance is better for words, in additional you could use Cosine_similarity a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them
and for similarity sentence or paragraph, you could use TF-IDF measure

Normalization of a multi-dimensional space, what algorithm is this?

I'm not a trained statistician so I apologize for the incorrect usage of some words. I'm just trying to get some good results from the Weka Nearest Neighbor algorithms. I'll use some redundancy in my explanation as a means to try to get the concept across:
Is there a way to normalize a multi-dimensional space so that the distances between any two instances are always proportional to the effect on the dependent variable?
In other words I have a statistical data set and I want to use a "nearest neighbor" algorithm to find instances that are most similar to a specified test instance. Unfortunately my initial results are useless because two attributes that are very close in value weakly correlated to the dependent variable would incorrectly bias the distance calculation.
For example let's say you're trying to find the nearest-neighbor of a given car based on a database of cars: make, model, year, color, engine size, number of doors. We know intuitively that the make, model, and year have a bigger effect on price than the number of doors. So a car with identical color, door count, may not be the nearest neighbor to a car with different color/doors but same make/model/year. What algorithm(s) can be used to appropriately set the weights of each independent variable in the Nearest Neighbor distance calculation so that the distance will be statistically proportional (correlated, whatever) to the dependent variable?
Application: This can be used for a more accurate "show me products similar to this other product" on shopping websites. Back to the car example, this would have cars of same make and model bubbling up to the top, with year used as a tie-breaker, and then within cars of the same year, it might sort the ones with the same number of cylinders (4 or 6) ahead of the ones with the same number of doors (2 or 4). I'm looking for an algorithmic way to derive something similar to the weights that I know intuitively (make >> model >> year >> engine >> doors) and actually assign numerical values to them to be used in the nearest-neighbor search for similar cars.
A more specific example:
Data set:
Blue,Honda,6-cylinder
Green,Toyota,4-cylinder
Blue,BMW,4-cylinder
now find cars similar to:
Blue,Honda,4-cylinder
in this limited example, it would match the Green,Toyota,4-cylinder ahead of the Blue,Honda,6-cylinder because the two brands are statistically almost interchangeable and cylinder is a stronger determinant of price rather than color. BMW would match lower because that brand tends to double the price, i.e. placing the item a larger distance.
Final note: the prices are available during training of the algorithm, but not during calculation.
Possible you should look at Solr/Lucene for this aim. Solr provides a similarity search based field value frequency and it already has functionality MoreLikeThis for find similar items.
Maybe nearest neighbor is not a good algorithm for this case? As you want to classify discrete values it can become quite hard to define reasonable distances. I think an C4.5-like algorithm may better suit the application you describe. On each step the algorithm would optimize the information entropy, thus you will always select the feature that gives you the most information.
Found something in the IEEE website. The algorithm is called DKNDAW ("dynamic k-nearest-neighbor with distance and attribute weighted"). I couldn't locate the actual paper (probably needs a paid subscription). This looks very promising assuming that the attribute weights are computed by the algorithm itself.

Find all k-nearest neighbors

Problem:
I have N (~100m) strings each D (e.g. 100) characters long and with a low alphabet (eg 4 possible characters). I would like to find the k-nearest neighbors for every one of those N points ( k ~ 0.1D). Adjacent strings define via hamming distance. Solution doesn't have to be the best possible but closer the better.
Thoughts about the problem
I have a bad feeling this is a non trivial problem. I have read many papers and algorithms, however most of them has poor result in high dimension and it works when dimension is less than 5. For example this paper suggests an efficient algorithm, but its constant is related to dimension exponentially.
Currently, I am investigating on how can I reduce dimension in a way that hamming distance is preserved or can be computed.
Another option is locality sensitive hashing, Points that are close to each other under the chosen metric are mapped to the same bucket with high probability. Any Help? which option do you prefer?
One of the previously asked questions has some good discussions, so you can refer to that,
Nearest neighbors in high-dimensional data?
Other than this, you can also look at,
http://web.cs.swarthmore.edu/~adanner/cs97/s08/papers/dahl_wootters.pdf
Few papers which analyze different approaches,
http://www.jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf
https://www.cse.ust.hk/~yike/sigmod09-lsb.pdf

Better distance metrics besides Levenshtein for ordered word sets and subsequent clustering

I am trying to solve a problem that involves comparing large numbers of word sets , each of which contains a large, ordered number of words from a set of words (totaling around 600+, very high dimensionality!) for similarity and then clustering them into distinct groupings. The solution needs to be as unsupervised as possible.
The data looks like
[Apple, Banana, Orange...]
[Apple, Banana, Grape...]
[Jelly, Anise, Orange...]
[Strawberry, Banana, Orange...]
...etc
The order of the words in each set matters ([Apple, Banana, Orange] is distinct from [Apple, Orange, Banana]
The approach I have been using so far has been to use Levenshtein distance (limited by a distance threshold) as a metric calculated in a Python script with each word being the unique identifier, generate a similarity matrix from the distances, and throwing that matrix into k-Mediods in KNIME for the groupings.
My questions are:
Is Levenshtein the most appropriate distance metric to use for this problem?
Is mean/medoid prototype clustering the best way to go about the groupings?
I haven't yet put much thought into validating the choice for 'k' in the clustering. Would evaluating an SSE curve of the clustering be the best way to go about this?
Are there any flaws in my methodology?
As an extension to the solution in the future, given training data, would anyone happen to have any ideas for going about assigning probabilities to cluster assignments? For example, set 1 has a 80% chance of being in cluster 1, etc.
I hope my questions don't seem too silly or the answers painfully obvious, I'm relatively new to data mining.
Thanks!
Yes, Levenshtein is a very suitable way to do this. But if the sequences vary in size much, you might be better off normalising these distances by dividing by the sum of the sequence lengths -- otherwise you will find that observed distances tend to increase for pairs of long sequences whose "average distance" (in the sense of the average distance between corresponding k-length substrings, for some small k) is constant.
Example: The pair ([Apple, Banana], [Carrot, Banana]) could be said to have the same "average" distance as ([Apple, Banana, Widget, Xylophone], [Carrot, Banana, Yam, Xylophone]) since every 2nd item matches in both, but the latter pair's raw Levenshtein distance will be twice as great.
Also bear in mind that Levenshtein does not make special allowances for "block moves": if you take a string, and move one of its substrings sufficiently far away, then the resulting pair (of original and modified strings) will have the same Levenshtein score as if the 2nd string had completely different elements at the position where the substring was moved to. If you want to take this into account, consider using a compression-based distance instead. (Although I say there that it's useful for computing distances without respect to order, it does of course favour ordered similarity to disordered similarity.)
check out SimMetrics on sourceforge for a platform supporting a variety of metrics able to use as a means to evaluate the best for a task.
for a commercially valid version check out K-Similarity from K-Now.co.uk.

Algorithm for measuring distance between disordered sequences

The Levenshtein distance gives us a way to calculate the distance between two similar strings in terms of disordered individual characters:
quick brown fox
quikc brown fax
The Levenshtein distance = 3.
What is a similar algorithm for the distance between two strings with similar subsequences?
For example, in
quickbrownfox
brownquickfox
the Levenshtein distance is 10, but this takes no account of the fact that the strings have two similar subsequences, which makes them more "similar" than completely disordered words like
quickbrownfox
qburiocwknfox
and yet this completely disordered version has a Levenshtein distance of eight.
What distance measures exist which take the length of subsequences into account, without assuming that the subsequences can be easily broken into distinct words?
I think that you can try shingles or some combinations of them with Levenshtein distance.
One simple metric would be to take all n*(n-1)/2 substrings in each string, and see how many overlap. There are some simple variations to this approach where you only look at substrings up to a certain length.
This would be similar to the BLEU score commonly used to evaluate machine translations. In the case of BLEU, they are comparing two sentences: they take all the unigrams, bigrams, trigrams, and 4-grams of words from each sentence. They calculate a version of precision and recall for each, and essentially use an average of those scores.
Initial stab: use a diff algorithm and the count of the number of differences as your distance
I have an impression that it's NP-complete problem.
At least, I cannot see how can we avoid an exhaustive search. Moreover, I cannot even see how can we verify given solution in polynomial time.
well the problem you're referring to falls under context sensitive grammar.
You basically define a grammar, the english grammar in this case and then find the distance between a grammar and a mismatch. You'll need to parse your input first.

Resources