Is there an algorithm that extracts meaningful tags of english text - algorithm

I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.
http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)
Is there any other existing algorithm to do this?

There are existing web services for this. Two Three examples:
Yahoo's Term Extraction API
Topicalizer
OpenCalais

When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.

Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.
If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.
Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.
Manning & Schütze contains a lot more introduction on text categorization.

In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.

You want to do the semantic analysis of a text.
Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...
As for other algorithms they could be based on:
Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)
BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal.
The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.

Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...

You can use this in two steps:
1 - Try topic modeling algorithms:
Latent Dirichlet Allocation
Latent word Embeddings
2 - After that you can select the most representative word of every topic as a tag

Related

Finding similar images via tags

tl;dr
Fundamentally I'm looking for reasonable ways to implement a similarity rank among tag groups, where a tag group is 2 to 9 tags. Similar to ranking the similarity of 2 to 9 word sentences where the vocabulary is 200,000 words, except word order doesn't matter.
I have a collection of tagged images and I want to implement a couple of search functions:
Similar images
Similar but different images
where the similarity is based solely on the tags.
Finding identically tagged images isn't that hard, but after that I'm at a bit of a loss as to the best way to proceed. We have a hundreds of thousands of tags and no metadata on them, so we don't know that "Outlook" is related to "Microsoft" or "Windows" or "Email" and therefore cannot appreciate the difference in relatedness of an image tagged "Microsoft,Excel,Bar Graph" to an image tagged "Excel,Spreadsheet" versus one tagged "Visio,Bar Graph".
For "Similar images" we'd want to match "Microsoft,Excel,Bar Graph" to "Visio,Bar Graph" while for "Similar but different images" we'd want to match "Microsoft,Excel,Bar Graph" to "Excel,Spreadsheet".
My best guess at the moment is to treat the tags like text and throw them into Solr. On the other hand, maybe a different kind of database, like Neo4j, would be the way to go.
Any suggestions on how to take a few steps forward? I'm not expecting a full solution, but suggestions for a general approach would be appreciated.
Extra Credit:
To make things more difficult, when tags are assigned to images, they are designated as "primary" or "secondary" and of course we want to take that into account.
Update
Let's repeat the problem.
The input data consists of sets of tags = set of strings (and pointers to the associated image resources)
The strings are just character sequences, there is no additional semantic information available
However there is a weighting of strings into 'primary' (higher weight) and 'secondary' (lower weight)
This means that the search has to rely solely on some similarity measure of sets (and strings).
Examples for such measures are:
Jaccard Similarity
Dice Similarity
Tverski Similarity
Cosine Similarity
This paper from 2010: A weighted tag similarity measure based on a collaborative weight model applies several of those (and others) on the tag problem and shows how to include weighting. That should be helpful, IMHO.
Another (simpler) application can be seen in this paper from 2013: Using of Jaccard Coefficient for Keywords Similarity.
About the examples from the question
For "Similar images" we'd want to match "Microsoft, Excel, Bar Graph" to "Visio, Bar Graph"
It would have some similarity due to that one tag ("Bar Graph") is common to both sets of tags.
while for "Similar but different images" we'd want to match "Microsoft, Excel, Bar Graph" to "Excel, Spreadsheet".
Again one tag in common ("Excel"). But how should the system know that "Visio" is more similiar to the set "Microsoft, Excel, Bar Graph" than "Spreadsheet"?
That would require semantic information. I don't see how to solve this otherwise.
Old Part
I found not much which would help you with your chosen approach (you restricted it quite a bit), except for a discussion of various metrics in the 2009 paper below.
But I would like to keep the steps of my little searching on this topic online here, because it puts your problem into context.
Where others go
The research community seems to go in these directions:
using the additional information about the users who provided the tags (folksonomies, social tagging)
making use of semantic meta data (ontologies, semantic similarity)
making use of visual image content (content-based visual information retrieval)
Folksonomnies, Social Tagging
See this paper from 2009: Evaluating Similarity Measures for Emergent Semantics of Social Tagging.
Instead of the traditional approach to define similarity by comparing the graphical data of the images
I = { (x, y, colour) }
by some measure (content-based image retrieval, query by image content, content-based-visual information retrieval), those authors use information (harvest semantics) from the tags, like you intend.
Their basic model consists of user assigned tags for a resource, comparing tuples of a so-called folksonomy
F = { (user, resource, tag) }
which can be scaled down to your case of (resource, tag) tuples by different approaches to aggregate over the users resulting in different similarity measures.
Semantic Similarity
Interesting is the use of semantic similarity, e.g. Jiang-Conrad, but alas, you have no semantic meta data (ontology e.g.) for your tags, which leaves you sticking to the similarity of the string representation of the words.
Again this paper The Use of Ontologies for Improving Image Retrieval and Annotation from 2008 favours the use of ontologies, but I think it gives a nice discussion of the various approaches.
Folksonomies, social tagging systems that relies on the idea of the
wisdom of the people. One representative example of this is Flickr.
com. This approach overcomes the so much time consuming of
manual annotation but the inconsistency in tag use can difficult the
search through the entire collection of data.
Combination with Content-Based Visual Information Retrieval
Both papers above cite this paper Augmenting Navigation for Collaborative Tagging with Emergent Semantics from 2006.
However, using tags alone for searching and browsing databases clearly has
its limitations. First, people make mistakes while tagging, such as spelling mistakes,
or accidental tagging with the wrong tag. Second, there is no solution to
cope with homonymy, i.e. to distinguish different meanings of a word. Third,
synonymy or different languages can only be handled by tagging data explicitly
with all terms.
These authors combine social tagging with the initially mentioned content-based image retrieval.
Yet another link: collaborative tagging.

Algorithm to compare similarity of ideas (as strings)

Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?
Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).
Is there another or better way to do something like this?
Is this any different than string similarity?
Is this the right question to be asking?
The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing
It is much more difficult than string similarity. This is what you need to do at a minimum:
Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
Calculate a weight for every term.
Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
Run a clustering algorithm on document vectors.
Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.
The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
[...]
What you want is very much an open problem in NLP. #Ali's answer describes the idea at a high level, but the part "Construct a document vector for every answer" is the really hard one. There are a few obvious ways of building a document vector from a the vectors of the words it contains. Addition, multiplication and averaging are fast, but they affectively ignore the syntax. Man bites dog and Dog bites man will have the same representation, but clearly not the same meaning. Google compositional distributional semantics- as far as I know, there are people at Universities of Texas, Trento, Oxford, Sussex and at Google working in the area.

How do I approximate "Did you mean?" without using Google?

I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?

Algorithm for computing the relevance of a keyword to a short text (50 - 100 words)

I want to compute the relevance of a keyword to a short description text. What would be the best approach in terms of efficiency and ease of implementation. I am using C++?
Simple solution: Count the occurrences of the word in the text.
To do a good job though is a hard problem that companies like Google have been working on for years. If possible, you might want to take a look at using their technology
To expand, try the following:
Use a dictionary (e.g. WordNet to replace all synonyms with a common word
Detect similar words using Levenshtein distance
That's still only going to get you so far. You'll need to perform some natural language processing to truly understand what the description is about to distinguish between multiple texts containing the keyword the same number of times.
Refer to these previous Stack Overflow questions:
What are Useful Ranking Algorithms for Documents without Links (e.g. PDF, MS Documents, etc…)?
Algorithm for generating a 'top list' using word frequency.

Is there an algorithm that tells the semantic similarity of two phrases

input: phrase 1, phrase 2
output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing
You might want to check out this paper:
Sentence similarity based on semantic nets and corpus statistics (PDF)
I've implemented the algorithm described. Our context was very general (effectively any two English sentences) and we found the approach taken was too slow and the results, while promising, not good enough (or likely to be so without considerable, extra, effort).
You don't give a lot of context so I can't necessarily recommend this but reading the paper could be useful for you in understanding how to tackle the problem.
Regards,
Matt.
There's a short and a long answer to this.
The short answer:
Use the WordNet::Similarity Perl package. If Perl is not your language of choice, check the WordNet project page at Princeton, or google for a wrapper library.
The long answer:
Determining word similarity is a complicated issue, and research is still very hot in this area. To compute similarity, you need an appropriate represenation of the meaning of a word. But what would be a representation of the meaning of, say, 'chair'? In fact, what is the exact meaning of 'chair'? If you think long and hard about this, it will twist your mind, you will go slightly mad, and finally take up a research career in Philosophy or Computational Linguistics to find the truth™. Both philosophers and linguists have tried to come up with an answer for literally thousands of years, and there's no end in sight.
So, if you're interested in exploring this problem a little more in-depth, I highly recommend reading Chapter 20.7 in Speech and Language Processing by Jurafsky and Martin, some of which is available through Google Books. It gives a very good overview of the state-of-the-art of distributional methods, which use word co-occurrence statistics to define a measure for word similarity. You are not likely to find libraries implementing these, however.
For anyone just coming at this, i would suggest taking a look at SEMILAR - http://www.semanticsimilarity.org/ . They implement a lot of the modern research methods for calculating word and sentence similarity. It is written in Java.
SEMILAR API comes with various similarity methods based on Wordnet, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), BLEU, Meteor, Pointwise Mutual Information (PMI), Dependency based methods, optimized methods based on Quadratic Assignment, etc. And the similarity methods work in different granularities - word to word, sentence to sentence, or bigger texts.
You might want to check into the WordNet project at Princeton University. One possible approach to this would be to first run each phrase through a stop-word list (to remove "common" words such as "a", "to", "the", etc.) Then for each of the remaining words in each phrase, you could compute the semantic "similarity" between each of the words in the other phrase using a distance measure based on WordNet. The distance measure could be something like: the number of arcs you have to pass through in WordNet to get from word1 to word2.
Sorry this is pretty high-level. I've obviously never tried this. Just a quick thought.
I would look into latent semantic indexing for this. I believe you can create something similar to a vector space search index but with semantically related terms being closer together i.e. having a smaller angle between them. If I learn more I will post here.
Sorry to dig up a 6 year old question, but as I just came across this post today, I'll throw in an answer in case anyone else is looking for something similar.
cortical.io has developed a process for calculating the semantic similarity of two expressions and they have a demo of it up on their website. They offer a free API providing access to the functionality, so you can use it in your own application without having to implement the algorithm yourself.
One simple solution is to use the dot product of character n-gram vectors. This is robust over ordering changes (which many edit distance metrics are not) and captures many issues around stemming. It also prevents the AI-complete problem of full semantic understanding.
To compute the n-gram vector, just pick a value of n (say, 3), and hash every 3-word sequence in the phrase into a vector. Normalize the vector to unit length, then take the dot product of different vectors to detect similarity.
This approach has been described in
J. Mitchell and M. Lapata, “Composition in Distributional Models of Semantics,” Cognitive Science, vol. 34, no. 8, pp. 1388–1429, Nov. 2010., DOI 10.1111/j.1551-6709.2010.01106.x
I would have a look at statistical techniques that take into consideration the probability of each word to appear within a sentence. This will allow you to give less importance to popular words such as 'and', 'or', 'the' and give more importance to words that appear less regurarly, and that are therefore a better discriminating factor. For example, if you have two sentences:
1) The smith-waterman algorithm gives you a similarity measure between two strings.
2) We have reviewed the smith-waterman algorithm and we found it to be good enough for our project.
The fact that the two sentences share the words "smith-waterman" and the words "algorithms" (which are not as common as 'and', 'or', etc.), will allow you to say that the two sentences might indeed be talking about the same topic.
Summarizing, I would suggest you have a look at:
1) String similarity measures;
2) Statistic methods;
Hope this helps.
Try SimService, which provides a service for computing top-n similar words and phrase similarity.
This requires your algorithm actually knows what your talking about. It can be done in some rudimentary form by just comparing words and looking for synonyms etc, but any sort of accurate result would require some form of intelligence.
Take a look at http://mkusner.github.io/publications/WMD.pdf This paper describes an algorithm called Word Mover distance that tries to uncover semantic similarity. It relies on the similarity scores as dictated by word2vec. Integrating this with GoogleNews-vectors-negative300 yields desirable results.

Resources