Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to do fuzzy matching of millions of records from multiple files. I identified two algorithms for that: Jaro-Winkler and Levenshtein edit distance.
I was not able to understand what the difference is between the two. It seems Levenshtein gives the number of edits between two strings, and Jaro-Winkler provides a normalized score between 0.0 to 1.0.
My questions:
What are the fundamental differences between the two algorithms?
What is the performance difference between the two algorithms?
Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula
1 - (edit distance / length of the larger of the two strings)
The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.
Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.
There is a wealth of detailed information on these algorithms and other fuzzy string matching algorithms on the internet. This one will give you a start:
A Comparison of Personal Name
Matching: Techniques and Practical
Issues
According to that paper, the speed of the four Jaro and Levenshtein algorithms I've mentioned are from fastest to slowest:
Jaro
Jaro-Winkler
Levenshtein
Damerau-Levenshtein
with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Is it possible to show that a task is done in the minimum amount of required commands or lines of code in a language, it is obvious that if you can do a task in one command this is the shortest way to do so but this is only going to be true of tasks like addition, if I say created an algorithm for sorting how would I know that there does or does not exist a faster way to carry out this task?
First off, minimum number of lines of code does not necessarily mean minimum number of commands. (i.e. processor commands) As the former is not really significant in an algorithmic sense, I am assuming that you are trying to find out the latter.
On that note, there are a variety of techniques to prove the minimum number of steps(not commands) needed to do some complex tasks. Finding the minimal number of steps necessary to achieve a task does not directly correspond to the minimum number of commands; but it should be relatively trivial to modify these techniques to find out the minimum number of commands essential to solve the problem. Note that these techniques may not necessarily yield a lower bound for every complex task, and whether a lower bound can be found depends on the specific task.
Incidentally, (comparison-based) sorting, which was mentioned in your question, is one of the tasks for which there is such a proof method, namely decision trees. You may find a more detailed description of the method on many sources including here but the method simply tries to find the least number of comparisons that has to be made in order to sort an array. It is a well-known technique lying at the heart of proving why comparison-based sorting algorithms have a time complexity lower bound of NlogN.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I would like to compute some sort of similarity score for search queries searched on google.
This means, among other things, that the order of the words does not necessarily matter. For example:
"adidas shoes blue" and "blue shoes adidas"
should be the considered the exact same sequence, which is not the case in many of the traditional distance algorithms I believe.
The example above could be solved with cosine similarity I guess, but what if I have:
"adiddas shoes blue"
I would like the algorithm to yield a very similar distance to the original ""adidas shoes blue"
Does such an algorithm exist?
Use the Soft Cosine Similarity and set the similarity measure between terms to the Levenshtein distance. The Soft Cosine Similarity generalizes the traditional Cosine Similarity measure by taking into account the edit distance between pairs of terms. In other words, the Soft Cosine Similarity measure compensates for the fact that the different dimensions of the vector space are not really orthogonal.
Note that you have to normalize the Levenshtein distance in such a way that similar terms have a similarity of 1 (that is, if the distance between terms is 0 then their similarity has to be 1).
More details can be found in the paper suggesting the soft similarity measure.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have n (about 10^5) points on a hypersphere of dimension m (between 10^4 to 10^6).
I am going to make a bunch of queries of the form "given a point p, find the closest of the n points to p". I'll make about n of these queries.
(Not sure if the hypersphere fact helps at all.)
The simple naive algorithm to solve this is, for each query, to compare p to all other n points. Doing this n times ends up with a runtime of O(n^2 m), which is far too big for me to be able to compute.
Is there a more efficient algorithm I can use? If I could get it to O(nm) with some log factors that'd be great.
Probably not. Having many dimensions makes efficient indexing extremely hard. That is why people look for opportunities to reduce the number of dimensions to something manageable.
See https://en.wikipedia.org/wiki/Curse_of_dimensionality and https://en.wikipedia.org/wiki/Dimensionality_reduction for more.
Divide your space up into hypercubes -- call these cells -- with edge size chosen so that on average you'll have one point per cube. You'll want a map from hypercells to the set of points they contain.
Then, given a point, check its hypercell for other points. If it is empty, look at the adjacent hypercells (I'd recommend a literal hypercube of hypercells for simplicity rather than some approximation to a hypersphere built out of hypercells). Check that for other points. Keep repeating until you get a point. Assuming your points are randomly distributed, odds are high that you'll find a second point within 1-2 expansions.
Once you find a point, check all hypercells that could possibly contain a closer point. This is possible because the point you find may be in a corner, but there's some closer point outside of the hypercube containing all the hypercells you've inspected so far.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am going through a list of algorithm that I found and try to implement them for learning purpose. Right now I am coding K mean and is confused in the following.
How do you know how many cluster there is in the original data set
Is there any particular format that I have follow in choosing the initial cluster centroid besides all centroid have to be different? For example does the algorithm converge if I choose cluster centroids that are different but close together?
Any advice would be appreciated
Thanks
With k-means you are minimizing a sum of squared distances. One approach is to try all plausible values of k. As k increases the sum of squared distances should decrease, but if you plot the result you may see that the sum of squared distances decreases quite sharply up to some value of k, and then much more slowly after that. The last value that gave you a sharp decrease is then the most plausible value of k.
k-means isn't guaranteed to find the best possible answer each run, and it is sensitive to the starting values you give it. One way to reduce problems from this is to start it many times, with different starting values, and pick the best answer. It looks a bit odd if an answer for larger k is actually larger than an answer for smaller k. One way to avoid this is to use the best answer found for k clusters as the basis (with slight modifications) for one of the starting points for k+1 clusters.
In the standard K-Means the K value is chosen by you, sometimes based on the problem itself ( when you know how many classes exists OR how many classes you want to exists) other times a "more or less" random value. Typically the first iteration consists of randomly selecting K points from the dataset to serve as centroids. In the following iterations the centroids are adjusted.
After check the K-Means algorithm, I suggest you also see the K-means++, which is an improvement of the first version, as it tries to find the best K for each problem, avoiding the sometimes poor clusterings found by the standard k-means algorithm.
If you need more specific details on implementation of some machine learning algorithm, please let me know.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have two text files which I'd like to compare. What I did is:
I've split both of them into sentences.
I've measured levenshtein distance between each of the sentences from one file with each of the sentences from second file.
I'd like to calculate average similarity between those two text files, however I have trouble to deliver any meaningful value - obviously arithmetic mean (sum of all the distances [normalized] divided by number of comparisions) is a bad idea.
How to interpret such results?
edit:
Distance values are normalized.
The levenshtein distances has a maximum value, i.e. the max. length of both input strings. It cannot get worse than that. So a normalized similarity index (0=bad, 1=match) for two strings a and b can be calculated as 1- distance(a,b)/max(a.length, b.length).
Take one sentence from File A. You said you'd compare this to each sentence of File B. I guess you are looking for a sentence out of B which has the smallest distance (i.e. the highest similarity index).
Simply calculate the average of all those 'minimum similarity indexes'. This should give you a rough estimation of the similarity of two texts.
But what makes you think that two texts which are similar might have their sentences shuffled? My personal opinion is that you should also introduce stop word lists, synonyms and all that.
Nevertheless: Please also check trigram matching which might be another good approach to what you are looking for.