Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I would like to compute some sort of similarity score for search queries searched on google.
This means, among other things, that the order of the words does not necessarily matter. For example:
"adidas shoes blue" and "blue shoes adidas"
should be the considered the exact same sequence, which is not the case in many of the traditional distance algorithms I believe.
The example above could be solved with cosine similarity I guess, but what if I have:
"adiddas shoes blue"
I would like the algorithm to yield a very similar distance to the original ""adidas shoes blue"
Does such an algorithm exist?
Use the Soft Cosine Similarity and set the similarity measure between terms to the Levenshtein distance. The Soft Cosine Similarity generalizes the traditional Cosine Similarity measure by taking into account the edit distance between pairs of terms. In other words, the Soft Cosine Similarity measure compensates for the fact that the different dimensions of the vector space are not really orthogonal.
Note that you have to normalize the Levenshtein distance in such a way that similar terms have a similarity of 1 (that is, if the distance between terms is 0 then their similarity has to be 1).
More details can be found in the paper suggesting the soft similarity measure.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have latitude and longitude points of N societies, order count of these societies, I also have latitude and longitude points of a warehouse from where the trucks will deploy and will be sent to these various societies(like Amazon deliveries). A truck can deliver maximum 350 orders (order count < 350). So no need to consider items with order count above 350 (We generally would send two trucks there or a bigger truck). Now I need to determine a pattern in which the trucks should be deployed in such a way that a minimum number of trips occur.
Considering that we determine the distance between two societies or warehouses is 'X' from this script is accurate, How do we solve this? I first thought that we could solve it using sum of subset problem, maybe? Seems like dp on graphs to me, traveling salesman problem with infinite number of salemans.
There are no restrictions on the number of trucks.
This is a typical Travelling salesman problem (TSP) which is known as NP-complete. It means that if you are looking for the optimal solution you have to test most of combinatorics. And as you know, !350 is tremendeous.
Nevertheless, as Henry suggests, you can look for a good solution which is not necessarily the best. A lot of algorithm called "heuristic" let you find one good solution in a very efficient way. Just have a look here for some examples https://en.wikipedia.org/wiki/Travelling_salesman_problem.
The most simple heuristic algorithm may be a greedy solution like always take the closest unvisited point as next society.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am finding it hard to understand the process of Linear discriminant analysis (LDA), and I was wondering if someone could explained it with a simple step by step process in English. I understand LDA is closely related to principal component analysis (PCA). but I have no idea how it gives all the probabilities with a grate precision. And how the training data is related to the actual dataset. I have refer few documents and i don't get much idea. It make more confusing and complicated.
PCA (Principal Component Analysis) is unsupervised or what is the same, it does not use class-label information. Therefore, discriminative information is not necessarily preserve.
Minimizes the projection error.
Maximizes the variance of projected points.
Example: Reducing the number of features of a face (Face detection).
LDA (Linear Discriminant Analysis): A PCA that takes class-labels into consideration, hence, it's supervised.
Maximizes distance between classes.
Minimizes distance within classes.
Example: Separating faces into male and female clusters (Face recognition).
With regard to the step by step process, you can easily find an implementation in Google.
Regarding the classification:
Project input x into PCA subspace U, and calculate its projection a
Project a into LDA subspace V
Find the class with the closest center
In simple words, project the input x and then check from which cluster center is closer.
Image from K. Etemad, R. Chellapa, Discriminant analysis for recognition of human faces. J. Opt. Soc. Am. A,Vol. 14, No. 8, August 1997
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I want to do fuzzy matching of millions of records from multiple files. I identified two algorithms for that: Jaro-Winkler and Levenshtein edit distance.
I was not able to understand what the difference is between the two. It seems Levenshtein gives the number of edits between two strings, and Jaro-Winkler provides a normalized score between 0.0 to 1.0.
My questions:
What are the fundamental differences between the two algorithms?
What is the performance difference between the two algorithms?
Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula
1 - (edit distance / length of the larger of the two strings)
The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.
Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.
There is a wealth of detailed information on these algorithms and other fuzzy string matching algorithms on the internet. This one will give you a start:
A Comparison of Personal Name
Matching: Techniques and Practical
Issues
According to that paper, the speed of the four Jaro and Levenshtein algorithms I've mentioned are from fastest to slowest:
Jaro
Jaro-Winkler
Levenshtein
Damerau-Levenshtein
with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am trying to understand basic chess algorithms. I have not read the literature in depth yet but after some cogitating here is my attempt:
1) Assign weight values to the pieces(i.e. a bishop is more valuable than a pawn)
2) Define heuristic function that attaches a value to a particular move
3) Build minimax tree to store all possible moves. Prune the tree via alpha/beta pruning.
4) Traverse the tree to find the best move for each player
Is this the core "big picture" idea of chess algorithms? Can someone point me to resources that go more in depth regarding chess algorithms?
Following is an overview of chess engine development.
1. Create a board representation.
In an object-oriented language, this will be an object that will represent a chess board in memory. The options at this stage are:
Bitboards
0x88
8x8
Bitboards is the recommended way for many reasons.
2. Create an evaluation function.
This simply takes a board and side-to-evaluate as agruments and returns a score. The method signature will look something like:
int Evaluate(Board boardPosition, int sideToEvaluateFor);
This is where you use the weights assigned to each piece. This is also where you would use any heuristics if you so desire. A simple evaluation function would add weights of sideToEvaluateFor's pieces and subtract weights of the opposite side's pieces. Such an evaluation function is of course too naive for a real chess engine.
3. Create a search function.
This will be, like you said, something on the lines of a MiniMax search with Alpha-Beta pruning. Some of the popular search algorithms are:
NegaMax
NegaScout
MTD(f)
Basic idea is to try all different variations to a certain maximum depth and choose the move recommended by the variation which results in highest score. The score for each variation is the score returned by Evaluation method for the board position at the maximum depth.
For an example of chess engine in C# have a look at https://github.com/bytefire/shutranj which I put together recently. A better open source engine to look at is StockFish (https://github.com/mcostalba/Stockfish) which is written in C++.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have two text files which I'd like to compare. What I did is:
I've split both of them into sentences.
I've measured levenshtein distance between each of the sentences from one file with each of the sentences from second file.
I'd like to calculate average similarity between those two text files, however I have trouble to deliver any meaningful value - obviously arithmetic mean (sum of all the distances [normalized] divided by number of comparisions) is a bad idea.
How to interpret such results?
edit:
Distance values are normalized.
The levenshtein distances has a maximum value, i.e. the max. length of both input strings. It cannot get worse than that. So a normalized similarity index (0=bad, 1=match) for two strings a and b can be calculated as 1- distance(a,b)/max(a.length, b.length).
Take one sentence from File A. You said you'd compare this to each sentence of File B. I guess you are looking for a sentence out of B which has the smallest distance (i.e. the highest similarity index).
Simply calculate the average of all those 'minimum similarity indexes'. This should give you a rough estimation of the similarity of two texts.
But what makes you think that two texts which are similar might have their sentences shuffled? My personal opinion is that you should also introduce stop word lists, synonyms and all that.
Nevertheless: Please also check trigram matching which might be another good approach to what you are looking for.