I see that pyLDAvis visualize each word's saliency under each topic.
But do we have a way to extract each word's saliency under each topic? Or how to calculate each word's saliency directly using Gensim LDA?
So finally, I want to get a pandas dataframe such that one row represents one word, each column represents each topic and its value represents the word's saliency under the corresponding topic.
Many thanks in advance.
Gensim's LDA support does not have out-of-the-box support for this particular 'saliency' calculation from Chuang et al (2012).
Still, I suspect the model's .get_term_topics() and/or .get_topic_terms() methods are the proper supporting data for implementing that calculation. In particular, one or the other of those methods might provide the p( w | t ) term, but a deeper read of the paper would be required to know for sure. (I suspect the P(t) term might require a separate survey of the training data.)
From the class docs:
https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_term_topics
Returns The relevant topics represented as pairs of their ID and their
assigned probability, sorted by relevance to the given word.
https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_topic_terms
Returns Word ID - probability pairs for the most relevant words generated by the topic.
I hadn't come across this particular 'saliency' calculation before, but if it is popular among LDA users, or of potential general use, and you figure out how to calculate it, it'd likely be a welcome contribution to the Gensim project - especially if it can be a simple extra convenience method on LdaModel.
Adding to #gojomo's reply: Yes, there is no direct way of getting the list of most salient words as proposed by Chuang et al. (2012). But, there is a library named TMToolkit that offers a way of extracting this. They provide a method called word_saliency that can give you what you are looking for. The problem is this method expects you to provide the following items:
topic_word_distribution
doc_topic_distribution
doc_lengths
If you are using gensim LDA, then providing doc_topic_distribution will become a significant challenge as Gensim does not provide this out of the box. In that case, you can utilize _extract_data method that is part of PyLDAvis library. As this method is designed for Gensim specifically, you should have all the parameters required for this method. This will yield a dictionary that will contain topic_word_distribution, doc_topic_distribution, and doc_lengths. However, you might want to sort the output of TMToolkit.
A word of caution about TMToolkit: it is notorious for downgrading most of the helpful packages like numpy, pandas, etc. So it is highly recommended to install it using virtual environments.
Related
How are keyword clouds constructed?
I know there are a lot of nlp methods, but I'm not sure how they solve the following problem:
You can have several items that each have a list of keywords relating to them.
(In my own program, these items are articles where I can use nlp methods to detect proper nouns, people, places, and (?) possibly subjects. This will be a very large list given a sufficiently sized article, but I will assume that I can winnow the list down using some method by comparing articles. How to do this properly is what I am confused about).
Each item can have a list of keywords, but how do they pick keywords such that the keywords aren't overly specific or overly general between each item?
For example, trivially "the" can be a keyword that is a lot of items.
While "supercalifragilistic" could only be in one.
I suppose that I could create a heuristic where if a word exists in n% of the items where n is sufficiently small, but will return a nice sublist (say 5% of 1000 articles is 50, which seems reasonable) then I could just use that. However, the issue that I take with this approach is that given two different sets of entirely different items, there is most likely some difference in interrelatedness between the items, and I'm throwing away that information.
This is very unsatisfying.
I feel that given the popularity of keyword clouds there must have been a solution created already. I don't want to use a library however as I want to understand and manipulate the assumptions in the math.
If anyone has any ideas please let me know.
Thanks!
EDIT:
freenode/programming/guardianx has suggested https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is ok btw, but the issue is that the weighting needs to be determined apriori. Given that two distinct collections of documents will have a different inherent similarity between documents, assuming an apriori weighting does not feel correct
freenode/programming/anon suggested https://en.wikipedia.org/wiki/Word2vec
I'm not sure I want something that uses a neural net (a little complicated for this problem?), but still considering.
Tf-idf is still a pretty standard method for extracting keywords. You can try a demo of a tf-idf-based keyword extractor (which has the idf vector, as you say apriori determined, estimated from Wikipedia). A popular alternative is the TextRank algorithm based on PageRank that has an off-the-shelf implementation in Gensim.
If you decide for your own implementation, note that all algorithms typically need plenty of tuning and text preprocessing to work correctly.
The minimum you need to do is removing stopwords that you know that they never can be a keyword (prepositions, articles, pronouns, etc.). If you want something fancier, you can use for instance Spacy to keep only desired parts of speech (nouns, verbs, adjectives). You can also include frequent multiword expressions (gensim has good function for automatic collocation detection), named entities (spacy can do it). You can get better results if you run coreference resolution and substitute pronouns with what they refer to... There are endless options for improvements.
I am working with the Google Places API, and they contain a list of 97 different locations. I want to reduce the list of locations into a lesser number
of them, as many of them are groupable. For example, atm and bank into financial; temple, church, mosque, synagogue into worship; school, university into education; subway_station, train_station, transit_station, gas_station into transportation.
But also, it should not overgeneralize; for example, pet_store, city_hall, courthouse, restaurant into something like buildings.
I tried quite a few methods to do this. First I downloaded synonyms of each of the 97 words in the list from multiple dictionaries. Then, I found out the similarity between 2 words based on what fraction of unique synonyms they share in common (Jaccard similarity):
But after that, how do I group words into clusters? Using traditional clustering methods (k-means, k-medoid, hierarchical clustering, and FCM), I am not getting any good clustering (I identified several misclassifications by scanning the results manually):
I even tried the word2vec model trained on Google news data (where each word is expressed as a vector of 300 features), and I do not get good clusters based on that as well:
You are probably looking for something related to vector space dimensionality reduction. In these techniques, you'll need a corpus of text that uses the locations as words in the text. Dimensionality reduction will then group the terms together. You can do some reading on Latent Dirichlet Allocation and Latent semantic indexing. A good reference is "Introduction to Information Retrieval" by Manning et al., chapter 18. Note that this book is from 2009, so a lot of advances are not captured. As you noted, there has been a lot of work such as word2vec. Another good reference is "Speech and Language Processing" by Jurafsky and Martin, chapter 16.
You need much more data.
No algorithm ever, without additional data, will relate ATM and bank to financial. Because that requires knowledge of these terms.
Jaccard similarity doesn't have access to such knowledge, it can only work on the words. And then "river bank" and "bank branch" are very similar.
So don't expect magic to happen by the algorithm. You need the magic to be in the data...
I have been trying some frameworks and algorithms, and I can't find one that do what I want - which is classify the column of the data based on the value.
I tried to use Bayes algorithm, but it isn't very precise because I can't expect that the data that is being searched for is in the training set - but I can expect that the pattern is in the training.
I don't have background in Machine Learning / AI, but I was looking for some working example before really going deeper in the implementation.
I built a smaller ARFF to exemplify. Also tried lots of Weka classifying algorithms but none of them gave me good results.
#relation recommend
#attribute class {name,email,taxid,phone}
#attribute text String
#data
name,'Erik Kolh'
name,'Eric Candid'
name,'Allan Pavinan'
name,'Jubaru Guttenberg'
name,'Barabara Bere'
name,'Chuck Azul'
email,'erik#gmail.com'
email,'steven#spielberg.com'
email,'dogs#cats.com'
taxid,'123611216'
taxid,'123545413'
taxid,'562321677'
taxid,'671312678'
taxid,'123123216'
phone,'438-597-7427'
phone,'478-711-7678'
phone,'321-651-5468'
My expectation is train a huge dataset like the above one and get recommendations based on the pattern, e.g.:
joao#bing.com -> email
Joao Vitor -> name
400-123-5519 -> phone
Can you please suggest any algorithms, examples or ideas to research?
I couldn't find a good fit, maybe it's just lack of vocabulary.
Thank you!
What you are trying to do is called named entity recognition (NER). Weka is most likely not a real help here. The library Mallet (http://mallet.cs.umass.edu) might be a good fit. I would recommend a Conditional Random Field (CRF) based approach.
If you would like to stay with weka, you need to change your feature space. Then Naive bayes will be do ok on your data as presented
E.g. add a features for
whether the word has only characters
whether it is alphanumeric
whether it is numeric data
number of Numbers,
whether it starts captilized
... (just be creative)
All,
I have been running Y!LDA (https://github.com/shravanmn/Yahoo_LDA) on a set of documents and the results look great (or at least what I would expect). Now I want to use the resulting topics to perform a reverse query against the corpus. Does anyone know if the 3 human readable text files that are generated after the learntopics executable is run is the final output for this library? If so, is that what I need to parse to perform my queries? I am stuck with a little shoulder shrugging at this point...
Thanks,
Adam
If LDA is working the way I think it is (I use a java implementation, so explanations may vary) then what you get out are the three following things:
P(word,concept) -- The probability of getting a word given a concept. So, when LDA finishes figuring out what concepts exist within the corpus, this P(w,c) will tell you (in theory) which words map to which concepts.
A very naive method of determining concepts would be to load this file into a matrix and combine all these probabilities for all possible concepts for a test document in some method (add, multiply, Root-mean-squared) and rank order the concepts.
Do note that the above method does not recognize the various biases introduced by weakly represented topics or dominating topics in LDA. To accommodate that, you need more complicated algorithms (Gibbs sampling, for instance), but this will get you some results.
P(concept,document) -- If you are attempting to find the intrinsic concepts in the documents in the corpus, you would look here. You can use the documents as examples of documents that have a particular concept distribution, and compare your documents to the LDA corpus documents... There are uses for this, but it may not be as useful as the P(w,c).
Something else probably relating to the weights of words, documents, or concepts. This could be as simple as a set of concept examples with beta weights (for the concepts), or some other variables that are output from LDA. These may or may not be important depending on what you are doing. (If you are attempting to add a document to the LDA space, having the alpha or beta values -- very important.)
To answer your 'reverse lookup' question, to determine the concepts of the test document, use P(w,c) for each word w in the test document.
To determine which document is the most like the test document, determine the above concepts, then compare them to the concepts for each document found in P(c,d) (using each concept as a dimension in vector-space and then determining a cosine between the two documents tends to work alright).
To determine the similarity between two documents, same thing as above, just determine the cosine between the two concept-vectors.
Hope that helps.
Here's the problem. I have a bunch of large text files with paragraphs and paragraphs of written matter. Each para contains references to a few people (names), and documents a few topics (places, objects).
How do I data mine this pile to assemble some categorised library? ... in general, 2 things.
I don't know what I'm looking for, so I need a program to get the most used words/multiple words ("Jacob Smith" or "bluewater inn" or "arrow").
Then knowing the keywords, I need a program to help me search for related paras, then sort and refine results (manually by hand).
Your question is a tiny bit open-ended :)
Chances are, you will find modules for whatever analysis you want to do in the UIMA framework:
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA is made of many things
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
You may also find Open Calais a useful API for text analysis; depending on how big your heap of documents is, it may be more or less appropriate.
If you want it quick and dirty -- create an inverted index that stores all locations of words (basically a big map of words to all file ids in which they occur, paragraphs in those files, lines in the paragraphs, etc). Also index tuples so that given a fileid and paragraph you can look up all the neighbors. This will do what you describe, but it takes quite a bit of tweaking to get it to pull up meaningful correlations (some keywords to start you off on your search: information retrieval, TF-IDF, Pearson correlation coefficient).
Looks like you're trying to create an index?
I think Learning Perl has information on finding the frequency of words in a text file, so that's not a particularly hard problem.
But do you really want to know that "the" or "a" is the most common word?
If you're looking for some kind of topical index, the words you actually care about are probably down the list a bit, intermixed with more words you don't care about.
You could start by getting rid of "stop words" at the front of the list to filter your results a bit, but nothing would beat associating keywords that actually reflect the topic of the paragraphs, and that requires context.
Anyway, I could be off base, but there you go. ;)
The problem with what you ask is that you don't know what you're looking for. If you had some sort of weighted list of terms that you cared about, then you'd be in good shape.
Semantically, the problem is twofold:
Generally the most-used words are the least relevant. Even if you use a stop-words file, a lot of chaff remains
Generally, the least-used words are the most relevant. For example, "bluewater inn" is probably infrequent.
Let's suppose that you had something that did what you ask, and produced a clean list of all the keywords that appear in your texts. There would be thousands of such keywords. Finding "bluewater inn" in a list of 1000s of terms is actually harder than finding it in the paragraph (assuming you don't know what you're looking for) because you can skim the texts and you'll find the paragraph that contains "bluewater inn" because of its context, but you can't find it in a list because the list has no context.
Why don't you talk more about your application and process and then perhaps we can help you better??
I think what you want to do is called "entity extraction". This Wikipedia article has a good overview and a list of apps, including open source ones. I used to work on one of the commercial tools in the list, but not in a programming capacity, so I can't help you there.
Ned Batchelder gave a great talk at DevDays Boston about Python.
He presented a spell-corrector written in Python that does pretty much exactly what you want.
You can find the slides and source code here:
http://nedbatchelder.com/text/devdays.html
I recommend that you have a look at R. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.