Clustering algorithms for strings - algorithm

I have to implement a module in which i need to group sentences(strings) having similar meaning into different clusters. I read about k-means , EM clustering etc. But the problem which i am facing is that these algorithms are explained with vector points on a graph. I am not getting how these algorithms can be implemented for a sentence(String) having similar meaning. Please suggest some appropriate ways.
For example ,
Lets consider a classroom scenario..
1) Teacher has ample knowledge.
2) Students understand what teacher teaches.
3) Teacher is sometimes punctual in class.
4) Teacher is audible in class.
Lets say we have these 4 sentences. Then looking at them we can say that sentence 1 and 2 are of similar meaning. But sentence 3 and 4 are neither related to each other nor to the first two. In this way i need to classify the sentences. So how can it be done?

First of all you should make yourself familiar with the bag of words concept.
The basic idea ist to map each word in a sentence on the number of occurrences, e.g., for the sentences hello world, hello tanay would get mapped onto
Hello World Tanay
1 1 0
1 0 1
This allows you to use one of the standard approaches.
Also worthwhile would be having a look at TF/DF it is made to reweigh words in a bag of words representation, with their importance to distinguish the documents (or sentences in your case)
Secondly, you should look at LDA which was made specifically for clustering words to concepts. Nevertheless, it is made of a view concepts.
Most promising to me sounds like a combination of these approaches. Generate, bags of words, reweigh the bag of words using TF/DF, run LDA and augment the reweighed bag of words with the LDA concepts and then use a standard clustering algorithm.

Clustering cannot do this.
Because it looks for stucture in data, but you want to cluster by the abstract human concept of meaning that is hard to capture using statistics...
So you first need to solve the really hard task of making the computer understand language reliably. And not on a "best match" basis, but good enough to quantify similarities.
There are actualy some attempts in this direction, usually involving massive data and deep learning. They can do this on some toy examples such as "Paris - France + USA = ?" - sometimes. Google for IBM Watson and Google word2vec.
Good luck. You will need high-performance GPUs and Exabytes of training data.

Related

What matching algorithm could I use?

I would need some help because I don't know what algorithm i could use for the following (I use python) :
Steve is 25 and he buys everyday orange juice
Maria is 23 and she likes to buy smoothies
Steve & Maria tastes are pretty much the same.
Juan is 16 and he only drinks sodas
Juan tastes are not the same as Steve and Maria.
====================================================
I would like to use a matching algorithm that will detect the users who have the same drink preference and a close age. To continue with the example, Steve and Maria would be matched together but not Juan. Which one should I use ?
I agree with #klutt that your task is pretty vague. There are two approaches that come to mind, but not knowing more details about your problem does limit the details I can provide in my answer that would help you. I am interpreting the question as if you are taking in raw text and might want to process more sentences that have very similar semantic and syntactical structure.
An algorithmic approach:
Assuming that your word choices are static in their semantic meaning (Maria is 23 ... Steve is 25), we can parse each sentence and identify tokens like is or and or same and essentially perform lexical analysis on the text... from here, you could continue thinking about how you would go about matching and so forth... but this is rather complicated...
Neural Network approach:
If you are taking in raw text in the form of sentences, it's a problem that's not straight forward to solve using a top-down algorithmic approach.
You could take an approach with neural networks that trains a model to solve your problem, but then again what you seem to be asking is quite complex since there are multiple "facts" within each sentence that are not semantically related. For example, your second sentence identifies that Maria is 23 but at the end of that sentence there is a comparison between Steve and Maria. And your first sentence only identifies Steve as 25.
Even if you chunk raw text into sentences, you would have to have a very fine tuned neural network architecture and a lot of training data to get remotely close to your goal.
Now, both of those solutions are very complex... but if you wanted to create an application that collects this data (via a form or prompt) and puts it into a structured format (like a json or xml object) to organize and store the data in memory (perhaps writing out to a database or file for persistent storage), that might be a good route to go down.
This can serve as a good lesson in how to think about data as well. It is one thing if you have a pool of thousands of sentences, just raw data that you need to organize for quantitative purposes (classic qualitative -> quantitative problems). It is another thing if you are going to be collecting this data. If you are going to be collecting data, having a program that collects and organizes names, ages, and drink preferences (and then organizes that data within certain data structures), then we can talk about matching algorithms.
I will also add here that if you do have structured data, Collaborative filtering (mentioned by Shridhar) is a great starting place.
Collaborative filtering best suits your needs.
In the newer, narrower sense, collaborative filtering is a method of
making automatic predictions (filtering) about the interests of a user
by collecting preferences or taste information from many users
(collaborating). The underlying assumption of the collaborative
filtering approach is that if a person A has the same opinion as a
person B on an issue, A is more likely to have B's opinion on a
different issue than that of a randomly chosen person. For example, a
collaborative filtering recommendation system for television tastes
could make predictions about which television show a user should like
given a partial list of that user's tastes (likes or dislikes).[3]
Note that these predictions are specific to the user, but use
information gleaned from many users. This differs from the simpler
approach of giving an average (non-specific) score for each item of
interest, for example based on its number of votes.

What are the Machine Learning Algorithms to use when identifying given words

I'm new to the machine learning and I want to identify the given words using an algorithm.
As an example,
construct the triangle ABC such that AB=7cm, BAC=60 and AC=5.5cm
construct the square that is 7cm long each side.
in this example I need to identify the words triangle and square.
So it seems like you want to be the algorithm to be intelligence rather than just identifying couple of words. So in order to that you should go for Natural language processing. There you can identifying nouns of different geometrical objects and in order to gather those information, I mean if you want to list that AB=7, BAC=60 and AC=505 then learn recurrent neural networks (RNN). RNN models can remember what you have said at the beginning of the sentence and identify what details are belongs to that.
ex - Anna lives in Paris and she's fluent in French.
RNN can identify the word French.
So just by using a machine learning algorithm it is not possible to identify those words and gather details when provide such a sentence.
You can read this article for further understanding.
Understanding LSTM Networks

Given a list of words, how to develop an algorithmic way to semantically group them?

I am working with the Google Places API, and they contain a list of 97 different locations. I want to reduce the list of locations into a lesser number
of them, as many of them are groupable. For example, atm and bank into financial; temple, church, mosque, synagogue into worship; school, university into education; subway_station, train_station, transit_station, gas_station into transportation.
But also, it should not overgeneralize; for example, pet_store, city_hall, courthouse, restaurant into something like buildings.
I tried quite a few methods to do this. First I downloaded synonyms of each of the 97 words in the list from multiple dictionaries. Then, I found out the similarity between 2 words based on what fraction of unique synonyms they share in common (Jaccard similarity):
But after that, how do I group words into clusters? Using traditional clustering methods (k-means, k-medoid, hierarchical clustering, and FCM), I am not getting any good clustering (I identified several misclassifications by scanning the results manually):
I even tried the word2vec model trained on Google news data (where each word is expressed as a vector of 300 features), and I do not get good clusters based on that as well:
You are probably looking for something related to vector space dimensionality reduction. In these techniques, you'll need a corpus of text that uses the locations as words in the text. Dimensionality reduction will then group the terms together. You can do some reading on Latent Dirichlet Allocation and Latent semantic indexing. A good reference is "Introduction to Information Retrieval" by Manning et al., chapter 18. Note that this book is from 2009, so a lot of advances are not captured. As you noted, there has been a lot of work such as word2vec. Another good reference is "Speech and Language Processing" by Jurafsky and Martin, chapter 16.
You need much more data.
No algorithm ever, without additional data, will relate ATM and bank to financial. Because that requires knowledge of these terms.
Jaccard similarity doesn't have access to such knowledge, it can only work on the words. And then "river bank" and "bank branch" are very similar.
So don't expect magic to happen by the algorithm. You need the magic to be in the data...

Yahoo! LDA Implementation Questions

All,
I have been running Y!LDA (https://github.com/shravanmn/Yahoo_LDA) on a set of documents and the results look great (or at least what I would expect). Now I want to use the resulting topics to perform a reverse query against the corpus. Does anyone know if the 3 human readable text files that are generated after the learntopics executable is run is the final output for this library? If so, is that what I need to parse to perform my queries? I am stuck with a little shoulder shrugging at this point...
Thanks,
Adam
If LDA is working the way I think it is (I use a java implementation, so explanations may vary) then what you get out are the three following things:
P(word,concept) -- The probability of getting a word given a concept. So, when LDA finishes figuring out what concepts exist within the corpus, this P(w,c) will tell you (in theory) which words map to which concepts.
A very naive method of determining concepts would be to load this file into a matrix and combine all these probabilities for all possible concepts for a test document in some method (add, multiply, Root-mean-squared) and rank order the concepts.
Do note that the above method does not recognize the various biases introduced by weakly represented topics or dominating topics in LDA. To accommodate that, you need more complicated algorithms (Gibbs sampling, for instance), but this will get you some results.
P(concept,document) -- If you are attempting to find the intrinsic concepts in the documents in the corpus, you would look here. You can use the documents as examples of documents that have a particular concept distribution, and compare your documents to the LDA corpus documents... There are uses for this, but it may not be as useful as the P(w,c).
Something else probably relating to the weights of words, documents, or concepts. This could be as simple as a set of concept examples with beta weights (for the concepts), or some other variables that are output from LDA. These may or may not be important depending on what you are doing. (If you are attempting to add a document to the LDA space, having the alpha or beta values -- very important.)
To answer your 'reverse lookup' question, to determine the concepts of the test document, use P(w,c) for each word w in the test document.
To determine which document is the most like the test document, determine the above concepts, then compare them to the concepts for each document found in P(c,d) (using each concept as a dimension in vector-space and then determining a cosine between the two documents tends to work alright).
To determine the similarity between two documents, same thing as above, just determine the cosine between the two concept-vectors.
Hope that helps.

Algorithm to determine how positive or negative a statement/text is

I need an algorithm to determine if a sentence, paragraph or article is negative or positive in tone... or better yet, how negative or positive.
For instance:
Jason is the worst SO user I have ever witnessed (-10)
Jason is an SO user (0)
Jason is the best SO user I have ever seen (+10)
Jason is the best at sucking with SO (-10)
While, okay at SO, Jason is the worst at doing bad (+10)
Not easy, huh? :)
I don't expect somebody to explain this algorithm to me, but I assume there is already much work on something like this in academia somewhere. If you can point me to some articles or research, I would love it.
Thanks.
There is a sub-field of natural language processing called sentiment analysis that deals specifically with this problem domain. There is a fair amount of commercial work done in the area because consumer products are so heavily reviewed in online user forums (ugc or user-generated-content). There is also a prototype platform for text analytics called GATE from the university of sheffield, and a python project called nltk. Both are considered flexible, but not very high performance. One or the other might be good for working out your own ideas.
In my company we have a product which does this and also performs well. I did most of the work on it. I can give a brief idea:
You need to split the paragraph into sentences and then split each sentence into smaller sub sentences - splitting based on commas, hyphen, semi colon, colon, 'and', 'or', etc.
Each sub sentence will be exhibiting a totally seperate sentiment in some cases.
Some sentences even if it is split, will have to be joined together.
Eg: The product is amazing, excellent and fantastic.
We have developed a comprehensive set of rules on the type of sentences which need to be split and which shouldn't be (based on the POS tags of the words)
On the first level, you can use a bag of words approach, meaning - have a list of positive and negative words/phrases and check in every sub sentence. While doing this, also look at the negation words like 'not', 'no', etc which will change the polarity of the sentence.
Even then if you can't find the sentiment, you can go for a naive bayes approach. This approach is not very accurate (about 60%). But if you apply this to only sentence which fail to pass through the first set of rules - you can easily get to 80-85% accuracy.
The important part is the positive/negative word list and the way you split things up. If you want, you can go even a level higher by implementing HMM (Hidden Markov Model) or CRF (Conditional Random Fields). But I am not a pro in NLP and someone else may fill you in that part.
For the curious people, we implemented all of this is python with NLTK and the Reverend Bayes module.
Pretty simple and handles most of the sentences. You may however face problems when trying to tag content from the web. Most people don't write proper sentences on the web. Also handling sarcasm is very hard.
This falls under the umbrella of Natural Language Processing, and so reading about that is probably a good place to start.
If you don't want to get in to a very complicated problem, you can just create lists of "positive" and "negative" words (and weight them if you want) and do word counts on sections of text. Obviously this isn't a "smart" solution, but it gets you some information with very little work, where doing serious NLP would be very time consuming.
One of your examples would potentially be marked positive when it was in fact negative using this approach ("Jason is the best at sucking with SO") unless you happen to weight "sucking" more than "best".... But also this is a small text sample, if you're looking at paragraphs or more of text, then weighting becomes more reliable unless you have someone purposefully trying to fool your algorithm.
As pointed out, this comes under sentiment analysis under natural language processing. Afaik GATE doesn't have any component that does sentiment analysis.
In my experience, I have implemented an algorithm which is an adaptation of the one in the paper 'Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis' by Theresa Wilson, Janyce Wiebe, Paul Hoffmann (this) as a GATE plugin, which gives reasonable good results. It could help you if you want to bootstrap the implementation.
Depending on your application you could do it via a Bayesian Filtering algorithm (which is often used in spam filters).
One way to do it would be to have two filters. One for positive documents and another for negative documents. You would seed the positive filter with positive documents (whatever criteria you use) and the negative filter with negative documents. The trick would be to find these documents. Maybe your could set it up so your users effectively rate documents.
The positive filter (once seeded) would look for positive words. Maybe it would end up with words like love, peace, etc. The negative filter would be seeded appropriately as well.
Once your filters are setup, then you run the test text through them to come up with positive and negative scores. Based on these scores and some weighting, you could come up with your numeric score.
Bayesian Filters, though simple, are surprisingly effective.
You can do like this:
Jason is the worst SO user I have ever witnessed (-10)
worst (-), the rest is (+). so, that would be (-) + (+) = (-)
Jason is an SO user (0)
( ) + ( ) = ( )
Jason is the best SO user I have ever seen (+10)
best (+) , the rest is ( ). so, that would be (+) + ( ) = (+)
Jason is the best at sucking with SO (-10)
best (+), sucking (-). so, (+) + (-) = (-)
While, okay at SO, Jason is the worst at doing bad (+10)
worst (-), doing bad (-). so, (-) + (-) = (+)
There are many machine learning approaches for this kind of Sentiment Analysis. I used most of the machine learning algorithms, which are already implemented. my case I have used
weka classification algorithms
SVM
naive basian
J48
Only you have to do this train the model to your context , add featured vector and rule based tune up. In my case I got some (61% accuracy). So We move into stanford core nlp ( they trained their model for movie reviews) and we used their training set and add our training set. we could achieved 80-90% accuracy.
This is an old question, but I happened upon it looking for a tool that could analyze article tone and found Watson Tone Analyzer by IBM. It allows 1000 api calls monthly for free.
It's all about context, I think. If you're looking for the people who are best at sucking with SO. Sucking the best can be a positive thing. For determination what is bad or good and how much I could recommend looking into Fuzzy Logic.
It's a bit like being tall. Someone who's 1.95m can considered to be tall. If you place that person in a group with people all over 2.10m, he looks short.
Maybe essay grading software could be used to estimate tone? WIRED article.
Possible reference. (I couldn't read it.)
This report compares writing skill to the Flesch-Kincaid Grade Level needed to read it!
Page 4 of e-rator says that they look at mispelling and such. (Maybe bad post are misspelled too!)
Slashdot article.
You could also use an email filter of some sort for negativity instead of spam-ness.
How about sarcasm:
Jason is the best SO user I have ever seen, NOT
Jason is the best SO user I have ever seen, right
Ah, I remember one java library for this called LingPipe (commercial license) that we evaluated. It would work fine for the example corpus that is available at the site, but for real data it sucks pretty bad.
Most of the sentiment analysis tools are lexicon based and none of them is perfect. Also, sentiment analysis can be described as a trinary sentiment classification or binary sentiment classification. Moreover, it is a domain specific task. Meaning that tools which work well on news dataset may not do a good job on informal and unstructured tweets.
I would suggest using several tools and have an aggregation or vote based mechanism to decide the intensity of the sentiment. The best survey study on sentiment analysis tools that I have come across is SentiBench. You will find it helpful.
use Algorithm::NaiveBayes;
my $nb = Algorithm::NaiveBayes->new;
$nb->add_instance
(attributes => {foo => 1, bar => 1, baz => 3},
label => 'sports');
$nb->add_instance
(attributes => {foo => 2, blurp => 1},
label => ['sports', 'finance']);
... repeat for several more instances, then:
$nb->train;
# Find results for unseen instances
my $result = $nb->predict
(attributes => {bar => 3, blurp => 2});

Resources