Is there an algorithm that tells the semantic similarity of two phrases - algorithm

input: phrase 1, phrase 2
output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing

You might want to check out this paper:
Sentence similarity based on semantic nets and corpus statistics (PDF)
I've implemented the algorithm described. Our context was very general (effectively any two English sentences) and we found the approach taken was too slow and the results, while promising, not good enough (or likely to be so without considerable, extra, effort).
You don't give a lot of context so I can't necessarily recommend this but reading the paper could be useful for you in understanding how to tackle the problem.
Regards,
Matt.

There's a short and a long answer to this.
The short answer:
Use the WordNet::Similarity Perl package. If Perl is not your language of choice, check the WordNet project page at Princeton, or google for a wrapper library.
The long answer:
Determining word similarity is a complicated issue, and research is still very hot in this area. To compute similarity, you need an appropriate represenation of the meaning of a word. But what would be a representation of the meaning of, say, 'chair'? In fact, what is the exact meaning of 'chair'? If you think long and hard about this, it will twist your mind, you will go slightly mad, and finally take up a research career in Philosophy or Computational Linguistics to find the truth™. Both philosophers and linguists have tried to come up with an answer for literally thousands of years, and there's no end in sight.
So, if you're interested in exploring this problem a little more in-depth, I highly recommend reading Chapter 20.7 in Speech and Language Processing by Jurafsky and Martin, some of which is available through Google Books. It gives a very good overview of the state-of-the-art of distributional methods, which use word co-occurrence statistics to define a measure for word similarity. You are not likely to find libraries implementing these, however.

For anyone just coming at this, i would suggest taking a look at SEMILAR - http://www.semanticsimilarity.org/ . They implement a lot of the modern research methods for calculating word and sentence similarity. It is written in Java.
SEMILAR API comes with various similarity methods based on Wordnet, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), BLEU, Meteor, Pointwise Mutual Information (PMI), Dependency based methods, optimized methods based on Quadratic Assignment, etc. And the similarity methods work in different granularities - word to word, sentence to sentence, or bigger texts.

You might want to check into the WordNet project at Princeton University. One possible approach to this would be to first run each phrase through a stop-word list (to remove "common" words such as "a", "to", "the", etc.) Then for each of the remaining words in each phrase, you could compute the semantic "similarity" between each of the words in the other phrase using a distance measure based on WordNet. The distance measure could be something like: the number of arcs you have to pass through in WordNet to get from word1 to word2.
Sorry this is pretty high-level. I've obviously never tried this. Just a quick thought.

I would look into latent semantic indexing for this. I believe you can create something similar to a vector space search index but with semantically related terms being closer together i.e. having a smaller angle between them. If I learn more I will post here.

Sorry to dig up a 6 year old question, but as I just came across this post today, I'll throw in an answer in case anyone else is looking for something similar.
cortical.io has developed a process for calculating the semantic similarity of two expressions and they have a demo of it up on their website. They offer a free API providing access to the functionality, so you can use it in your own application without having to implement the algorithm yourself.

One simple solution is to use the dot product of character n-gram vectors. This is robust over ordering changes (which many edit distance metrics are not) and captures many issues around stemming. It also prevents the AI-complete problem of full semantic understanding.
To compute the n-gram vector, just pick a value of n (say, 3), and hash every 3-word sequence in the phrase into a vector. Normalize the vector to unit length, then take the dot product of different vectors to detect similarity.
This approach has been described in
J. Mitchell and M. Lapata, “Composition in Distributional Models of Semantics,” Cognitive Science, vol. 34, no. 8, pp. 1388–1429, Nov. 2010., DOI 10.1111/j.1551-6709.2010.01106.x

I would have a look at statistical techniques that take into consideration the probability of each word to appear within a sentence. This will allow you to give less importance to popular words such as 'and', 'or', 'the' and give more importance to words that appear less regurarly, and that are therefore a better discriminating factor. For example, if you have two sentences:
1) The smith-waterman algorithm gives you a similarity measure between two strings.
2) We have reviewed the smith-waterman algorithm and we found it to be good enough for our project.
The fact that the two sentences share the words "smith-waterman" and the words "algorithms" (which are not as common as 'and', 'or', etc.), will allow you to say that the two sentences might indeed be talking about the same topic.
Summarizing, I would suggest you have a look at:
1) String similarity measures;
2) Statistic methods;
Hope this helps.

Try SimService, which provides a service for computing top-n similar words and phrase similarity.

This requires your algorithm actually knows what your talking about. It can be done in some rudimentary form by just comparing words and looking for synonyms etc, but any sort of accurate result would require some form of intelligence.

Take a look at http://mkusner.github.io/publications/WMD.pdf This paper describes an algorithm called Word Mover distance that tries to uncover semantic similarity. It relies on the similarity scores as dictated by word2vec. Integrating this with GoogleNews-vectors-negative300 yields desirable results.

Related

Methods to identify duplicate questions on Twitter?

As stated in the title, I'm simply looking for algorithms or solutions one might use to take in the twitter firehose (or a portion of it) and
a) identify questions in general
b) for a question, identify questions that could be the same, with some degree of confidence
Thanks!
(A)
I would try to identify questions using machine learning and the Bag of Words model.
Create a labeled set of twits, and label each of them with a binary
flag: question or not question.
Extract the features from the training set. The features are traditionally words, but at least for any time I tried it - using bi-grams significantly improved the results. (3-grams were not helpful for my cases).
Build a classifier from the data. I usually found out SVM gives better performance then other classifiers, but you can use others as well - such as Naive Bayes or KNN (But you will probably need feature selection algorithm for these).
Now you can use your classifier to classify a tweet.1
(B)
This issue is referred in the world of Information-Retrieval as "duplicate detection" or "near-duplicate detection".
You can at least find questions which are very similar to each other using Semantic Interpretation, as described by Markovitch and Gabrilovich in their wonderful article Wikipedia-based Semantic Interpretation for Natural Language Processing. At the very least, it will help you identify if two questions are discussing the same issues (even though not identical).
The idea goes like this:
Use wikipedia to build a vector that represents its semantics, for a term t, the entry vector_t[i] is the tf-idf score of the term i as it co-appeared with the term t. The idea is described in details in the article. Reading the 3-4 first pages are enough to understand it. No need to read it all.2
For each tweet, construct a vector which is a function of the vectors of its terms. Compare between two vectors - and you can identify if two questions are discussing the same issues.
EDIT:
On 2nd thought, the BoW model is not a good fit here, since it ignores the position of terms. However, I believe if you add NLP processing for extracting feature (for examples, for each term, also denote if it is pre-subject or post-subject, and this was determined using NLP procssing), combining with Machine Learning will yield pretty good results.
(1) For evaluation of your classifier, you can use cross-validation, and check the expected accuracy.
(2) I know Evgeny Gabrilovich published the implemented algorithm they created as an open source project, just need to look for it.

Algorithm to compare similarity of ideas (as strings)

Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?
Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).
Is there another or better way to do something like this?
Is this any different than string similarity?
Is this the right question to be asking?
The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing
It is much more difficult than string similarity. This is what you need to do at a minimum:
Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
Calculate a weight for every term.
Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
Run a clustering algorithm on document vectors.
Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.
The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
[...]
What you want is very much an open problem in NLP. #Ali's answer describes the idea at a high level, but the part "Construct a document vector for every answer" is the really hard one. There are a few obvious ways of building a document vector from a the vectors of the words it contains. Addition, multiplication and averaging are fast, but they affectively ignore the syntax. Man bites dog and Dog bites man will have the same representation, but clearly not the same meaning. Google compositional distributional semantics- as far as I know, there are people at Universities of Texas, Trento, Oxford, Sussex and at Google working in the area.

Using Sentiment Analysis to Detect Contradictory Arguments?

I don't have much background in sentiment analysis or natural language processing at all, but I have been reading a bit about it in my spare time. I would like to conduct and experiment to analyze forum threads/comments such as reddit, digg, blogs, etc. I'm particularity interested in doing something like counting the number of for, against, and neutral comments for threads of heated religious and political debates. Here's what I am thinking.
1) Find a thread that the original poster has defined a touchy political or religious topic.
2) For each comment categorize it as supporting the original poster or otherwise taking a contradicting or neutral stance.
3) Compare various mediums with the numbers of for or against arguments to determine what platforms are good "debate platforms" (i.e. balanced argument counts).
One big problem that I'm anticipating is that heated topics will invoke strong reactions from both supporting and contradicting parties so a simple happy/sad sentiment analysis won't cut it. I'm just sort of interested in this project for my own curiosities, so if anyone knows of similar research or utilities to conduct this experiment I'd be interested to hear more.
Can someone recommend a good sentiment analysis, word dictionary, training set, etc. for this task?
IMHO this is not possible without running into semantics. Consider the sentence:
Unlike many others, I am not against the abolishment of capital punishment.
Your AI may need to recognise idiomatic subfrases like "not against", or other "not ..." snippets. This is not impossible ;-)
An additional problem is, that "not" is more or less a stopword, its rank will probably be in the top-100, causing a low entropy (though it has a high "semantic" value to every sentence where it is unsed). Also note that omitting "the abolishment of", will cause the "polarity" of the sentence to flip as well.
You can try to use the bag of words [or even better: use n-grams as tokens to the bag]
The approach is basically:
Classify a set of examples, let your algorithm extract the relevant
words from the classified examples.
When a new comment is given, extract the relevant words, and use
k-nearest neighbors to decide if the new comment is a
pro/against/neutral.
Also, you might want to have a look on Apache Mahout.

Determine the difficulty of an english word

I am working a word based game. My word database contains around 10,000 english words (sorted alphabetically). I am planning to have 5 difficulty levels in the game. Level 1 shows the easiest words and Level 5 shows the most difficult words, relatively speaking.
I need to divide the 10,000 long words list into 5 levels, starting from the easiest words to difficult ones. I am looking for a program to do this for me.
Can someone tell me if there is an algorithm or a method to quantitatively measure the difficulty of an english word?
I have some thoughts revolving around using the "word length" and "word frequency" as factors, and come up with a formula or something that accomplishes this.
Get a large corpus of texts (e.g. from the Gutenberg archives), do a straight frequency analysis, and eyeball the results. If they don't look satisfying, weight each text with its Flesch-Kincaid score and run the analysis again - words that show up frequently, but in "difficult" texts will get a score boost, which is what you want.
If all you have is 10000 words, though, it will probably be quicker to just do the frequency sorting as a first pass and then tweak the results by hand.
I'm not understanding how frequency is being used... if you were to scan a newspaper, I'm sure you would see the word "thoroughly" mentioned much more frequently than the word "bop" or "moo" but that doesn't mean it's an easier word; on the contrary 'thoroughly' is one of the most disgustingly absurd spelling anomalies that gives grade school children nightmares...
Try explaining to a sane human being learning english as a second language the subtle difference between slaughter and laughter.
I agree that frequency of use is the most likely metric; there are studies supporting a high correlation between word frequency and difficulty (correct responses on tests, etc.). Check out the English Lexicon Project at http://elexicon.wustl.edu/ for some 70k(?) frequency-rated words.
Crowd-source the answer.
Create an online 'game' that lists 10 words at random.
Get the player to drag and drop them into easiest - hardest, and tick to indicate if the player has ever heard of the word.
Apply an ranking algorithm (e.g. ELO) on the result of each experiment.
Repeat.
It might even be fun to play, you could get a language proficiency score at the end.
Difficulty is a pretty amorphus concept. If you've no clear idea of what you want, perhaps you could take a look at the Porter Stemming Algorithm (see for example the original paper). That contains a more advanced idea of 'length' by defining words as being of the form [C](VC){m}[V]; C means a block of consonants and V a block of vowels and this definition says a word is an optional C followed by m VC blocks and finally an optional V. The m value is this advanced 'length'.
depending on the type of game the definition of "difficult" will change. If your game involves typing quickly (ztype-style...), "difficult" will have a different meaning than in a game where you need to define a word's meaning.
That said, Scrabble has a way to measure how "difficult" a word is which is also quite easy algoritmically.
Also you may look into defining "difficult" in terms of your game. You could beta test your game and classify words according to how "difficult" players find them in the context of your own game.
There are several factors that relate to word difficulty, including age at acquisition, imageability, concreteness, abstractness, syllables, frequency (spoken and written). There are also psycholinguistic databases that will search for word by at least some of these factors. (just do a search for "psycholinguistic database".
Word frequency is an obvious choice (of course not perfect). You can download Google n-grams V2 here, which is license under the Creative Commons Attribution 3.0 Unported License.
Format: ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE
Example:
Corpus used (from Lin, Yuri, et al. "Syntactic annotations for the google books ngram corpus." Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 2012.):
Word length is a good indicator , for word frequency , you would need data as an algorithm can obviously not determine it by itself.
You could also use some sort of scoring like the scrabble game does : each letter has a value and the final value would be the sum of the values.
It would be imo easier to find frequency data about each letter in your language .
In his article on spell correction Peter Norvig uses a dictionary to count the number of occurrences of each word (and thus determine their frequency).
You could use this as a stepping stone :)
Also, frequency should probably influence the difficulty more than length... you would have to beta-test the game for that.
In addition to metrics such as Flesch-Kincaid, you could try an approach based on the Dale-Chall readability formula, using lists of words that are familiar to readers of a particular level of ability.
Implementations of many of the readability formulae contain code for estimating the number of syllables in a word, which may also be useful.
I would guess that the grade at wich the word is introduced into normal students vocabulary is a measure of difficulty. Next would be how many standard rule violations it has. Meaning your words that have spellings or pronunciations that seem to violate the normal set off rules. Finally.. the meaning.. can be a tough concept. .. for example ... try explaining abstract to someone who's never heard the word.
Without claiming to know anything about their algorithm, there is an API that returns a 1-10 scale word difficulty: TwinWord API
I have never used it, myself, though.

Is there an algorithm that extracts meaningful tags of english text

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.
http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)
Is there any other existing algorithm to do this?
There are existing web services for this. Two Three examples:
Yahoo's Term Extraction API
Topicalizer
OpenCalais
When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.
Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.
If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.
Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.
Manning & Schütze contains a lot more introduction on text categorization.
In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.
You want to do the semantic analysis of a text.
Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...
As for other algorithms they could be based on:
Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)
BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal.
The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.
Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...
You can use this in two steps:
1 - Try topic modeling algorithms:
Latent Dirichlet Allocation
Latent word Embeddings
2 - After that you can select the most representative word of every topic as a tag

Resources