how to get probability of words of topics in Mallet - probability

I am using LDA in mallet to explore my data. I do not have any problem with running, just I need to have the probability of top words (let's say 20 words)
I use this query:
bin\mallet train-topics --input tutorial.mallet --num-topics 40 --optimize-interval 20 --output-state topic-state_doc_40t.gz --output-topic-keys tutorial_keys_doc_40t.txt --output-doc-topics tutorial_composition_doc_40t.txt
I do not know what would be the query for words' probabilities.

You should be able to use the --topic-word-weights-file FILENAME option.
The format for the output file is
topic [tab] word [tab] weight
where weight is proportional to the probability of the word in the topic. Divide by the sum of the weights for a topic to get the normalized probability.

Late answer, but who knows, it might help someone else.
MALLET 2.0.8 has a new feature to output a very interesting diagnostics file containing a bunch of metrics for each topic and its top words. Word probability is one of them.
Simply add --diagnostics-file FILENAME to your train-topics command.
Number of words described for each topic is the same than defined by "--num-top-words".
Here is the link to a detailed documentation: http://mallet.cs.umass.edu/diagnostics.php. If you don't want to re-train your topic, you can output the diagnostics file anyway by using your "state" file. Everything is described in the link.

Related

Representing a string numerically with different properties than a hashcode

Is there a function that works similar to a hashcode where a string or set of bits is passed in and converted to a number. However this algorithm works such that strings that are more similar to one another would result in numbers closer to one another.
i.e.
f("abcdefg") - f("abcdef") < f("lorem ipsum dolor") - f("abcde")
The algorithm doesn't have to be perfect, I'm just trying to turn some descriptions into a numerical representation as one more input for an ML experiment. I know this string data has value to the algorithm I'm just trying to come up with some simple ways to turn it into a number.
What i understand from your post is very similar to a tpic of my interest. There is an great tool or process for doing the task you are asking for.
The tool i am referring to is known as word2vec. It gives a vectorization of each word in the string. It was found by Google. In this model each word is given a vectorizatipon based on the words in the vocabulary and its nearby words (next word and prev word). Go through this word2vec topic from google or youtube and you will have a clear idea of it.
The power of this tool is so much that u can do unexpected things. An example would be
King - Man + Woman = Queen
This tool is mainly being used for semantics analysis.

Predicting Missing Word in Sentence

How can I predict a word that's missing from a sentence?
I've seen many papers on predicting the next word in a sentence using an n-grams language model with frequency distributions from a set of training data. But instead I want to predict a missing word that's not necessarily at the end of the sentence. For example:
I took my ___ for a walk.
I can't seem to find any algorithms that take advantage of the words after the blank; I guess I could ignore them, but they must add some value. And of course, a bi/trigram model doesn't work for predicting the first two words.
What algorithm/pattern should I use? Or is there no advantage to using the words after the blank?
Tensorflow has a tutorial to do this: https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html
Incidentally it does a bit more and generates word embeddings, but to get there they train a model to predict the (next/missing) word. They also show using only the previous words, but you can apply the same ideas and add the words that follow.
They also have a bunch of suggestions on how to improve the precision (skip ngrams).
Somewhere at the bottom of the tutorial you have links to working source-code.
The only thing to be worried about is to have sufficient training data.
So, when I've worked with bigrams/trigrams, an example query generally looked something like "Predict the missing word in 'Would you ____'". I'd then go through my training data and gather all the sets of three words matching that pattern, and count the things in the blanks. So, if my training data looked like:
would you not do that
would you kindly pull that lever
would you kindly push that button
could you kindly pull that lever
I would get two counts for "kindly" and one for "not", and I'd predict "kindly". All you have to do for your problem is consider the blank in a different place: "____ you kindly" would get two counts for "would" and one for "could", so you'd predict "would". As far as the computer is concerned, there's nothing special about the word order - you can describe whatever pattern you want, from your training data. Does that make sense?

Categorizing Words and Category Values

We were set an algorithm problem in class today, as a "if you figure out a solution you don't have to do this subject". SO of course, we all thought we will give it a go.
Basically, we were provided a DB of 100 words and 10 categories. There is no match between either the words or the categories. So its basically a list of 100 words, and 10 categories.
We have to "place" the words into the correct category - that is, we have to "figure out" how to put the words into the correct category. Thus, we must "understand" the word, and then put it in the most appropriate category algorthmically.
i.e. one of the words is "fishing" the category "sport" --> so this would go into this category. There is some overlap between words and categories such that some words could go into more than one category.
If we figure it out, we have to increase the sample size and the person with the "best" matching % wins.
Does anyone have ANY idea how to start something like this? Or any resources ? Preferably in C#?
Even a keyword DB or something might be helpful ? Anyone know of any free ones?
First of all you need sample text to analyze, to get the relationship of words.
A categorization with latent semantic analysis is described in Latent Semantic Analysis approaches to categorization.
A different approach would be naive bayes text categorization. Sample text with the assigned category are needed. In a learning step the program learns the different categories and the likelihood that a word occurs in a text assigned to a category, see bayes spam filtering. I don't know how well that works with single words.
Really poor answer (demonstrates no "understanding") - but as a crazy stab you could hit google (through code) for (for example) "+Fishing +Sport", "+Fishing +Cooking" etc (i.e. cross join each word and category) - and let the google fight win! i.e. the combination with the most "hits" gets chosen...
For example (results first):
weather: fish
sport: ball
weather: hat
fashion: trousers
weather: snowball
weather: tornado
With code (TODO: add threading ;-p):
static void Main() {
string[] words = { "fish", "ball", "hat", "trousers", "snowball","tornado" };
string[] categories = { "sport", "fashion", "weather" };
using(WebClient client = new WebClient()){
foreach(string word in words) {
var bestCategory = categories.OrderByDescending(
cat => Rank(client, word, cat)).First();
Console.WriteLine("{0}: {1}", bestCategory, word);
}
}
}
static int Rank(WebClient client, string word, string category) {
string s = client.DownloadString("http://www.google.com/search?q=%2B" +
Uri.EscapeDataString(word) + "+%2B" +
Uri.EscapeDataString(category));
var match = Regex.Match(s, #"of about \<b\>([0-9,]+)\</b\>");
int rank = match.Success ? int.Parse(match.Groups[1].Value, NumberStyles.Any) : 0;
Debug.WriteLine(string.Format("\t{0} / {1} : {2}", word, category, rank));
return rank;
}
Maybe you are all making this too hard.
Obviously, you need an external reference of some sort to rank the probability that X is in category Y. Is it possible that he's testing your "out of the box" thinking and that YOU could be the external reference? That is, the algorithm is a simple matter of running through each category and each word and asking YOU (or whoever sits at the terminal) whether word X is in the displayed category Y. There are a few simple variations on this theme but they all involve blowing past the Gordian knot by simply cutting it.
Or not...depends on the teacher.
So it seems you have a couple options here, but for the most part I think if you want accurate data you are going to need to use some outside help. Two options that I can think of would be to make use of a dictionary search, or crowd sourcing.
In regards to a dictionary search, you could just go through the database, query it and parse the results to see if one of the category names is displayed on the page. For example, if you search "red" you will find "color" on the page and likewise, searching for "fishing" returns "sport" on the page.
Another, slightly more outside the box option would be to make use of crowd sourcing, consider the following:
Start by more or less randomly assigning name-value pairs.
Output the results.
Load the results up on Amazon Mechanical Turk (AMT) to get feedback from humans on how well the pairs work.
Input the results of the AMT evaluation back into the system along with the random assignments.
If everything was approved, then we are done.
Otherwise, retain the correct hits and process them to see if any pattern can be established, generate a new set of name-value pairs.
Return to step 3.
Granted this would entail some financial outlay, but it might also be one of the simplest and accurate versions of the data you are going get on a fairly easy basis.
You could do a custom algorithm to work specifically on that data, for instance words ending in 'ing' are verbs (present participle) and could be sports.
Create a set of categorization rules like the one above and see how high an accuracy you get.
EDIT:
Steal the wikipedia database (it's free anyway) and get the list of articles under each of your ten categories. Count the occurrences of each of your 100 words in all the articles under each category, and the category with the highest 'keyword density' of that word (e.g. fishing) wins.
This sounds like you could use some sort of Bayesian classification as it is used in spam filtering. But this would still require "external data" in the form of some sort of text base that provides context.
Without that, the problem is impossible to solve. It's not an algorithm problem, it's an AI problem. But even AI (and natural intelligence as well, for that matter) needs some sort of input to learn from.
I suspect that the professor is giving you an impossible problem to make you understand at what different levels you can think about a problem.
The key question here is: who decides what a "correct" classification is? What is this decision based on? How could this decision be reproduced programmatically, and what input data would it need?
I am assuming that the problem allows using external data, because otherwise I cannot conceive of a way to deduce the meaning from words algorithmically.
Maybe something could be done with a thesaurus database, and looking for minimal distances between 'word' words and 'category' words?
Fire this teacher.
The only solution to this problem is to already have the solution to the problem. Ie. you need a table of keywords and categories to build your code that puts keywords into categories.
Unless, as you suggest, you add a system which "understands" english. This is the person sitting in front of the computer, or an expert system.
If you're building an expert system and doesn't even know it, the teacher is not good at giving problems.
Google is forbidden, but they have almost a perfect solution - Google Sets.
Because you need to unterstand the semantics of the words you need external datasources. You could try using WordNet. Or you could maybe try using Wikipedia - find the page for every word (or maybe only for the categories) and look for other words appearing on the page or linked pages.
Yeah I'd go for the wordnet approach.
Check this tutorial on WordNet-based semantic similarity measurement. You can query Wordnet online at princeton.edu (google it) so it should be relatively easy to code a solution for your problem.
Hope this helps,
X.
Interesting problem. What you're looking at is word classification. While you can learn and use traditional information retrieval methods like LSA and categorization based on such - I'm not sure if that is your intent (if it is, then do so by all means! :)
Since you say you can use external data, I would suggest using wordnet and its link between words. For instance, using wordnet,
# S: (n) **fishing**, sportfishing (the act of someone who fishes as a diversion)
* direct hypernym / inherited hypernym / sister term
o S: (n) **outdoor sport, field sport** (a sport that is played outdoors)
+ direct hypernym / inherited hypernym / sister term
# S: (n) **sport**, athletics
(an active diversion requiring physical exertion and competition)
What we see here is a list of relationships between words. The term fishing relates to outdoor sport, which relates to sport.
Now, if you get the drift - it is possible to use this relationship to compute a probability of classifying "fishing" to "sport" - say, based on the linear distance of the word-chain, or number of occurrences, et al. (should be trivial to find resources on how to construct similarity measures using wordnet. when the prof says "not to use google", I assume he means programatically and not as a means to get information to read up on!)
As for C# with wordnet - how about http://opensource.ebswift.com/WordNet.Net/
My first thought would be to leverage external data. Write a program that google-searches each word, and takes the 'category' that appears first/highest in the search results :)
That might be considered cheating, though.
Well, you can't use Google, but you CAN use Yahoo, Ask, Bing, Ding, Dong, Kong...
I would do a few passes. First query the 100 words against 2-3 search engines, grab the first y resulting articles (y being a threshold to experiment with. 5 is a good start I think) and scan the text. In particular I"ll search for the 10 categories. If a category appears more than x time (x again being some threshold you need to experiment with) its a match.
Based on that x threshold (ie how many times a category appears in the text) and how may of the top y pages it appears in you can assign a weigh to a word-category pair.
for better accuracy you can then do another pass with those non-google search engines with the word-category pair (with a AND relationship) and apply the number of resulting pages to the weight of that pair. Them simply assume the word-category pair with highest weight is the right one (assuming you'll even have more than one option). You can also multi assign a word to a multiple category if the weights are close enough (z threshold maybe).
Based on that you can introduce any number of words and any number of categories. And You'll win your challenge.
I also think this method is good to evaluate the weight of potential adwords in advertising. but that's another topic....
Good luck
Harel
Use (either online, or download) WordNet, and find the number of relationships you have to follow between words and each category.
Use an existing categorized large data set such as RCV1 to train your system of choice. You could do worse then to start reading existing research and benchmarks.
Appart from Google there exist other 'encyclopedic" datasets you can build of, some of them hosted as public data sets on Amazon Web Services, such as a complete snapshot of the English language Wikipedia.
Be creative. There is other data out there besides Google.
My attempt would be to use the toolset of CRM114 to provide a way to analyze a big corpus of text. Then you can utilize the matchings from it to give a guess.
My naive approach:
Create a huge text file like this (read the article for inspiration)
For every word, scan the text and whenever you match that word, count the 'categories' that appear in N (maximum, aka radio) positions left and right of it.
The word is likely to belong in the category with the greatest counter.
Scrape delicious.com and search for each word, looking at collective tag counts, etc.
Not much more I can say about that, but delicious is old, huge, incredibly-heavily tagged and contains a wealth of current relevant semantic information to draw from. It would be very easy to build a semantics database this way, using your word list as a basis from scraping.
The knowledge is in the tags.
As you don't need to attend the subject when you solve this 'riddle' it's not supposed to be easy I think.
Nevertheless I would do something like this (told in a very simplistic way)
Build up a Neuronal Network which you give some input (a (e)book, some (e)books)
=> no google needed
this network classifies words (Neural networks are great for 'unsure' classification). I think you may simply know which word belongs to which category because of the occurences in the text. ('fishing' is likely to be mentioned near 'sports').
After some training of the neural network it should "link" you the words to the categories.
You might be able to put use the WordNet database, create some metric to determine how closely linked two words (the word and the category) are and then choose the best category to put the word in.
You could implement a learning algorithm to do this using a monte carlo method and human feedback. Have the system randomly categorize words, then ask you to vote them as "match" or "not match." If it matches, the word is categorized and can be eliminated. If not, the system excludes it from that category in future iterations since it knows it doesn't belong there. This will get very accurate results.
This will work for the 100 word problem fairly easily. For the larger problem, you could combine this with educated guessing to make the process work faster. Here, as many people above have mentioned, you will need external sources. The google method would probably work the best, since google's already done a ton of work on it, but barring that you could, for example, pull data from your facebook account using the facebook apis and try to figure out which words are statistically more likely to appear with previously categorized words.
Either way, though, this cannot be done without some kind of external input that at some point came from a human. Unless you want to be cheeky and, for example, define the categories by some serialized value contained in the ascii text for the name :P

What is search.twitter.com's "trending topics" algorithm?

What algorithm does twitter use to determine the 10 topics that you can see at search.twitter.com? I would like to implement that algorithm and I would also like to show the 50 most popular topics (instead of 10). Can you describe the most efficient algorithm?
Thanks!
(Twitters API can be found at- http://apiwiki.twitter.com/REST%20API%20Documentation)
Also, I would like to be able to implement the algorithm by searching through the public timeline- http://twitter.com/statuses/public_timeline.rss
Twitter's trending algorithm is not just volume of keywords. That's part of it, but there's also a decay factor so that "justin beiber" isn't top trending forever.
This post on quora backs this up. http://www.quora.com/Trending-Topics-Twitter/What-is-the-basis-of-Twitters-current-Trending-Topics-algorithm?q=trending+algorithm
decay is typically done by using the relative age of the post in the algorithm, giving more weight to newer topics/posts/etc.
see also http://www.quora.com/What-tools-algorithms-or-data-structures-would-you-use-to-build-a-Trending-Topics-algorithm-for-a-high-velocity-stream?q=trending+algorithm
So what Twitter probably does is it counts the number of mentions of a particular term minus stop words (stop words like : do, me, you, I, not, on etc)
So "the cat is out of the bag" and "my dog ate my cat" would mean that cat ,dog and bag would be the terms it extracted (the rest are all stop words)
And it then counts 'cat' as 2 references, so 'cat' would be a trending topic in this case.

How do I compare phrases for similarity?

When entering a question, stackoverflow presents you with a list of questions that it thinks likely to cover the same topic. I have seen similar features on other sites or in other programs, too (Help file systems, for example), but I've never programmed something like this myself. Now I'm curious to know what sort of algorithm one would use for that.
The first approach that comes to my mind is splitting the phrase into words and look for phrases containing these words. Before you do that, you probably want to throw away insignificant words (like 'the', 'a', 'does' etc), and then you will want to rank the results.
Hey, wait - let's do that for web pages, and then we can have a ... watchamacallit ... - a "search engine", and then we can sell ads, and then ...
No, seriously, what are the common ways to solve this problem?
One approach is the so called bag-of-words model.
As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.
You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words:
"aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.
Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:
[2 0 0 ... 70k zeroes ... 0].
After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.
This is a simple version and there other more advanced techniques. May the Wikipedia be with you.
#Hanno you should try the Levenshtein distance algorithm. Given an input string s and a list of of strings t iterate for each string u in t and return the one with the minimum Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance
See a Java implementation example in http://www.javalobby.org/java/forums/t15908.html
To augment the bag-of-words idea:
There are a few ways you can also pay some attention to n-grams, strings of two or more words kept in order. You might want to do this because a search for "space complexity" is much more than a search for things with "space" AND "complexity" in them, since the meaning of this phrase is more than the sum of its parts; that is, if you get a result that talks about the complexity of outer space and the universe, this is probably not what the search for "space complexity" really meant.
A key idea from natural language processing here is that of mutual information, which allows you (algorithmically) to judge whether or not a phrase is really a specific phrase (such as "space complexity") or just words which are coincidentally adjacent. Mathematically, the main idea is to ask, probabilistically, if these words appear next to each other more often than you would guess by their frequencies alone. If you see a phrase with a high mutual information score in your search query (or while indexing), you can get better results by trying to keep these words in sequence.
From my (rather small) experience developing full-text search engines: I would look up questions which contain some words from query (in your case, query is your question).
Sure, noise words should be ignored and we might want to check query for 'strong' words like 'ASP.Net' to narrow down search scope.
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices'>Inverted indexes are commonly used to find questions with words we are interested in.
After finding questions with words from query, we might want to calculate distance between words we are interested in in questions, so question with 'phrases similarity' text ranks higher than question with 'discussing similarity, you hear following phrases...' text.
Here is the bag of words solution with tfidfvectorizer in python 3
#from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
s=set(stopwords.words('english'))
train_x_cleaned = []
for i in train_x:
sentence = filter(lambda w: not w in s,i.split(","))
train_x_cleaned.append(' '.join(sentence))
vectorizer = TfidfVectorizer(binary=True)
train_x_vectors = vectorizer.fit_transform(train_x_cleaned)
print(vectorizer.get_feature_names_out())
print(train_x_vectors.toarray())
from sklearn import svm
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
test_x = vectorizer.transform(["test phrase 1", "test phrase 2", "test phrase 3"])
print (type(test_x))
clf_svm.predict(test_x)

Resources