Graph of word frequencies - word-frequency

I want to make a function which from a text input produces a word frequency graph like this in the picture. This picture is taken from a report, so I am not sure how they have made it.

There's much more work here that a single function. The following links contain code of projects that perform similar tasks. You could reuse them in your program.
http://www.codeproject.com/Articles/224231/Word-Cloud-Tag-Cloud-Generator-Control-for-NET-Win
http://www.codeproject.com/Articles/19968/WordCloud-A-Squarified-Treemap-of-Word-
Frequency
https://github.com/whydoidoit/WordCloud (Silverlight)
Also, check out the accepted answer of Algorithm to implement a word cloud like Wordle, it's an answer by the creator of Wordle in which he explain the basic algorithm.

Related

NLP, algorithms for determining if block of text is "similar" to other (after already having matched for keyword)

I've been reading up on NLP as much as I can and searching on here but haven't found anything that seems to address exactly what I am trying to do. I am pretty new to NLP, only having had some minor exposure before, so far I have gotten the NLP processor I'm using working to where I am able to extract the POS from the text.
I am just working with a small sample document and then with one "input phrase" that I am basically trying to find a match for. The code I've written so far basically does this:
takes the input phrase and the "searchee (document being searched on)" and breaks them down into Lists of individual words, then also gets the POS for each word. User also puts in one kewyord that is in the input phrase (and should be in doc being searched)
both Lists are searched for the keyword that the user input, then, for the first place this keyword is found in each document, a set number of words before and after are taken (such as 5). These are put into a dataset for processing, so if one article had:
keyword: football
"A lot of sports are fun, football is a great, yet very physical sport."
- Then my process would truncate this down to "are fun, football is a"
My goal is to compare the pieces, such as the "are fun, football is a" for similarity as far as if they are likely to be used in a similar context, etc.
I'm wondering if anyone can point me in the right direction as far as patterns that could be used for this, algorithms, etc. The example above is simplistic, just to give an idea, but I would be planning to make this more complex if I can find the right place to learn more about this. Thanks for any info
It seems you're solving the good old KWIC problem. That can be done with indexing, or just a simple for loop through the words in a text:
for i = 0 to length(text):
if text[i] == word:
emit(text[i-2], text[i-1], text[i], text[i+1], text[i+2])
Where emit might mean print them, store them in a hashtable, whatever.
What you are trying to do is more of a classic Information Retrieval problem than NLP, though they are very similar. You are building a Term-Frequency dictionary.
I'm not sure what you mean by POS, but you are trying to extract "shingles" of phrases from the text and compare them with other shingles in your corpus. You can compute similar via cosine similarity or by calculating the String Edit Distance between the phrases.
It may help to review some introductory IR slides to clarify these concepts. Dr. Rao Kambhampati generously makes slides and audio lectures available on his site.
If you just want to generate a text you can look here http://phpir.com/text-generation. If you want to look for similarities you can look for a trigram-search or more simple a wildcard search with a trie: http://phpir.com/tries-and-wildcards. Here is a good article about shingling:http://phpir.com/shingling-near-duplicate-detection

Liblinear how to use it

I'm fairly new at machine learning and text mining in general. It has come to my attention the presence of a ruby library called Liblinear https://github.com/tomz/liblinear-ruby-swig.
What I want to do so far is train the software to identify whether a text mentions anything related to bicycles or not.
Can someone please highlight the steps that I should be following (i.e: preprocessing text and how), share resources and ideally share a simple example to get me going.
Any help will do, thanks!
The classical approach is:
Collect a representative sample of input texts, each labeled as related/unrelated.
Divide the sample into training and test sets.
Extract all the terms in all the documents of the training set; call this the vocabulary, V.
For each document in the training set, convert it into a vector of booleans where the i'th element is true/1 iff the i'th term in the vocabulary occurs in the document.
Feed the vectorized training set to the learning algorithm.
Now, to classify a document, vectorize it as in step 4. and feed it to the classifier to get a related/unrelated label for it. Compare this with the actual label to see if it went right. You should be able to get at least some 80% accuracy with this simple method.
To improve this method, replace the booleans with term counts, normalized by document length, or, even better, tf-idf scores.

How do I approximate "Did you mean?" without using Google?

I am aware of the duplicates of this question:
How does the Google “Did you mean?” Algorithm work?
How do you implement a “Did you mean”?
... and many others.
These questions are interested in how the algorithm actually works. My question is more like: Let's assume Google did not exist or maybe this feature did not exist and we don't have user input. How does one go about implementing an approximate version of this algorithm?
Why is this interesting?
Ok. Try typing "qualfy" into Google and it tells you:
Did you mean: qualify
Fair enough. It uses Statistical Machine Learning on data collected from billions of users to do this. But now try typing this: "Trytoreconnectyou" into Google and it tells you:
Did you mean: Try To Reconnect You
Now this is the more interesting part. How does Google determine this? Have a dictionary handy and guess the most probably words again using user input? And how does it differentiate between a misspelled word and a sentence?
Now considering that most programmers do not have access to input from billions of users, I am looking for the best approximate way to implement this algorithm and what resources are available (datasets, libraries etc.). Any suggestions?
Assuming you have a dictionary of words (all the words that appear in the dictionary in the worst case, all the phrases that appear in the data in your system in the best case) and that you know the relative frequency of the various words, you should be able to reasonably guess at what the user meant via some combination of the similarity of the word and the number of hits for the similar word. The weights obviously require a bit of trial and error, but generally the user will be more interested in a popular result that is a bit linguistically further away from the string they entered than in a valid word that is linguistically closer but only has one or two hits in your system.
The second case should be a bit more straightforward. You find all the valid words that begin the string ("T" is invalid, "Tr" is invalid, "Try" is a word, "Tryt" is not a word, etc.) and for each valid word, you repeat the algorithm for the remaining string. This should be pretty quick assuming your dictionary is indexed. If you find a result where you are able to decompose the long string into a set of valid words with no remaining characters, that's what you recommend. Of course, if you're Google, you probably modify the algorithm to look for substrings that are reasonably close typos to actual words and you have some logic to handle cases where a string can be read multiple ways with a loose enough spellcheck (possibly using the number of results to break the tie).
From the horse's mouth: How to Write a Spelling Corrector
The interesting thing here is how you don't need a bunch of query logs to approximate the algorithm. You can use a corpus of mostly-correct text (like a bunch of books from Project Gutenberg).
I think this can be done using a spellchecker along with N-grams.
For Trytoreconnectyou, we first check with all 1-grams (all dictionary words) and find a closest match that's pretty terrible. So we try 2-grams (which can be built by removing spaces from phrases of length 2), and then 3-grams and so on. When we try a 4-gram, we find that there is a phrase that is at 0 distance from our search term. Since we can't do better than that, we return that answer as the suggestion.
I know this is very inefficient, but Peter Norvig's post here suggests clearly that Google uses spell correcters to generate it's suggestions. Since Google has massive paralellization capabilities, they can accomplish this task very quickly.
Impressive tutroail one how its work you can found here http://alias-i.com/lingpipe-3.9.3/demos/tutorial/querySpellChecker/read-me.html.
In few word it is trade off of query modification(on character or word level) to increasing coverage in search documents. For example "aple" lead to 2mln documents, but "apple" lead to 60mln and modification is only one character, therefore it is obvious that you mean apple.
Datasets/tools that might be useful:
WordNet
Corpora such as the ukWaC corpus
You can use WordNet as a simple dictionary of terms, and you can boost that with frequent terms extracted from a corpus.
You can use the Peter Norvig link mentioned before as a first attempt, but with a large dictionary, this won't be a good solution.
Instead, I suggest you use something like locality sensitive hashing (LSH). This is commonly used to detect duplicate documents, but it will work just as well for spelling correction. You will need a list of terms and strings of terms extracted from your data that you think people may search for - you'll have to choose a cut-off length for the strings. Alternatively if you have some data of what people actually search for, you could use that. For each string of terms you generate a vector (probably character bigrams or trigrams would do the trick) and store it in LSH.
Given any query, you can use an approximate nearest neighbour search on the LSH described by Charikar to find the closest neighbour out of your set of possible matches.
Note: links removed as I'm a new user - sorry.
#Legend - Consider using one of the variations of the Soundex algorithm. It has some known flaws, but it works decently well in most applications that need to approximate misspelled words.
Edit (2011-03-16):
I suddenly remembered another Soundex-like algorithm that I had run across a couple of years ago. In this Dr. Dobb's article, Lawrence Philips discusses improvements to his Metaphone algorithm, dubbed Double Metaphone.
You can find a Python implementation of this algorithm here, and more implementations on the same site here.
Again, these algorithms won't be the same as what Google uses, but for English language words they should get you very close. You can also check out the wikipedia page for Phonetic Algorithms for a list of other similar algorithms.
Take a look at this: How does the Google "Did you mean?" Algorithm work?

Predict next event occurrence, based on past occurrences

I'm looking for an algorithm or example material to study for predicting future events based on known patterns. Perhaps there is a name for this, and I just don't know/remember it. Something this general may not exist, but I'm not a master of math or algorithms, so I'm here asking for direction.
An example, as I understand it would be something like this:
A static event occurs on January 1st, February 1st, March 3rd, April 4th. A simple solution would be to average the days/hours/minutes/something between each occurrence, add that number to the last known occurrence, and have the prediction.
What am I asking for, or what should I study?
There is no particular goal in mind, or any specific variables to account for. This is simply a personal thought, and an opportunity for me to learn something new.
I think some topics that might be worth looking into include numerical analysis, specifically interpolation, extrapolation, and regression.
This could be overkill, but Markov chains can lead to some pretty cool pattern recognition stuff. It's better suited to, well, chains of events: the idea is, based on the last N steps in a chain of events, what will happen next?
This is well suited to text: process a large sample of Shakespeare, and you can generate paragraphs full of Shakespeare-like nonsense! Unfortunately, it takes a good deal more data to figure out sparsely-populated events. (Detecting patterns with a period of a month or more would require you to track a chain of at least a full month of data.)
In pseudo-python, here's a rough sketch of a Markov chain builder/prediction script:
n = how_big_a_chain_you_want
def build_map(eventChain):
map = defaultdict(list)
for events in get_all_n_plus_1_item_slices_of(eventChain):
slice = events[:n]
last = events[-1]
map[slice].append(last)
def predict_next_event(whatsHappenedSoFar, map):
slice = whatsHappenedSoFar[-n:]
return random_choice(map[slice])
There is no single 'best' canned solution, it depends on what you need. For instance, you might want to average the values as you say, but using weighted averages where the old values do not contribute as much to the result as the new ones. Or you might try some smoothing. Or you might try to see if the distribution of events fits a well-kjnown distribution (like normal, Poisson, uniform).
If you have a model in mind (such as the events occur regularly), then applying a Kalman filter to the parameters of that model is a common technique.
The only technique I've worked with for trying to do something like that would be training a neural network to predict the next step in the series. That implies interpreting the issue as a problem in pattern classification, which doesn't seem like that great a fit; I have to suspect there are less fuzzy ways of dealing with it.
The task is very similar to language modelling task where given a sequence of history words the model tries to predict a probability distribution over vocabulary for the next word.
There are open source softwares such as SRILM and NLTK that can simply get your sequences as input sentences (each event_id is a word) and do the job.
if you merely want to find the probability of an event occurring after n days given prior data of its frequency, you'll want to fit to an appropriate probability distribution, which generally requires knowing something about the source of the event (maybe it should be poisson distributed, maybe gaussian). if you want to find the probability of an event happening given that prior events happened, you'll want to look at bayesian statistics and how to build a markov chain from that.
You should google Genetic Programming Algorithms
They (sort of like the Neural Networks mentioned by Chaos) will enable you to generate solutions programmatically, then have the program modify itself based on a criteria, and create new solutions which are hopefully closer to accurate.
Neural Networks would have to be trained by you, but with genetic programming, the program will do all the work.
Although it is a hell of a lot of work to get them running in the first place!

How does the Google "Did you mean?" Algorithm work? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly respond to queries with "Did you mean: xxxx".
I need to be able to intelligently take a user query and respond with not only raw search results but also with a "Did you mean?" response when there is a highly likely alternative answer etc
[I'm developing in ASP.NET (VB - don't hold it against me! )]
UPDATE:
OK, how can I mimic this without the millions of 'unpaid users'?
Generate typos for each 'known' or 'correct' term and perform lookups?
Some other more elegant method?
Here's the explanation directly from the source ( almost )
Search 101!
at min 22:03
Worth watching!
Basically and according to Douglas Merrill former CTO of Google it is like this:
1) You write a ( misspelled ) word in google
2) You don't find what you wanted ( don't click on any results )
3) You realize you misspelled the word so you rewrite the word in the search box.
4) You find what you want ( you click in the first links )
This pattern multiplied millions of times, shows what are the most common misspells and what are the most "common" corrections.
This way Google can almost instantaneously, offer spell correction in every language.
Also this means if overnight everyone start to spell night as "nigth" google would suggest that word instead.
EDIT
#ThomasRutter: Douglas describe it as "statistical machine learning".
They know who correct the query, because they know which query comes from which user ( using cookies )
If the users perform a query, and only 10% of the users click on a result and 90% goes back and type another query ( with the corrected word ) and this time that 90% clicks on a result, then they know they have found a correction.
They can also know if those are "related" queries of two different, because they have information of all the links they show.
Furthermore, they are now including the context into the spell check, so they can even suggest different word depending on the context.
See this demo of google wave ( # 44m 06s ) that shows how the context is taken into account to automatically correct the spelling.
Here it is explained how that natural language processing works.
And finally here is an awesome demo of what can be done adding automatic machine translation ( # 1h 12m 47s ) to the mix.
I've added anchors of minute and seconds to the videos to skip directly to the content, if they don't work, try reloading the page or scrolling by hand to the mark.
I found this article some time ago: How to Write a Spelling Corrector, written by Peter Norvig (Director of Research at Google Inc.).
It's an interesting read about the "spelling correction" topic. The examples are in Python but it's clear and simple to understand, and I think that the algorithm can be easily
translated to other languages.
Below follows a short description of the algorithm.
The algorithm consists of two steps, preparation and word checking.
Step 1: Preparation - setting up the word database
Best is if you can use actual search words and their occurence.
If you don't have that a large set of text can be used instead.
Count the occurrence (popularity) of each word.
Step 2. Word checking - finding words that are similar to the one checked
Similar means that the edit distance is low (typically 0-1 or 0-2). The edit distance is the minimum number of inserts/deletes/changes/swaps needed to transform one word to another.
Choose the most popular word from the previous step and suggest it as a correction (if other than the word itself).
For the theory of "did you mean" algorithm you can refer to Chapter 3 of Introduction to Information Retrieval. It is available online for free. Section 3.3 (page 52) exactly answers your question. And to specifically answer your update you only need a dictionary of words and nothing else (including millions of users).
Hmm... I thought that google used their vast corpus of data (the internet) to do some serious NLP (Natural Language Processing).
For example, they have so much data from the entire internet that they can count the number of times a three-word sequence occurs (known as a trigram). So if they see a sentence like: "pink frugr concert", they could see it has few hits, then find the most likely "pink * concert" in their corpus.
They apparently just do a variation of what Davide Gualano was saying, though, so definitely read that link. Google does of course use all web-pages it knows as a corpus, so that makes its algorithm particularly effective.
My guess is that they use a combination of a Levenshtein distance algorithm and the masses of data they collect regarding the searches that are run. They could pull a set of searches that have the shortest Levenshtein distance from the entered search string, then pick the one with the most results.
Normally a production spelling corrector utilizes several methodologies to provide a spelling suggestion. Some are:
Decide on a way to determine whether spelling correction is required. These may include insufficient results, results which are not specific or accurate enough (according to some measure), etc. Then:
Use a large body of text or a dictionary, where all, or most are known to be correctly spelled. These are easily found online, in places such as LingPipe. Then to determine the best suggestion you look for a word which is the closest match based on several measures. The most intuitive one is similar characters. What has been shown through research and experimentation is that two or three character sequence matches work better. (bigrams and trigrams). To further improve results, weigh a higher score upon a match at the beginning, or end of the word. For performance reasons, index all these words as trigrams or bigrams, so that when you are performing a lookup, you convert to n-gram, and lookup via hashtable or trie.
Use heuristics related to potential keyboard mistakes based on character location. So that "hwllo" should be "hello" because 'w' is close to 'e'.
Use a phonetic key (Soundex, Metaphone) to index the words and lookup possible corrections. In practice this normally returns worse results than using n-gram indexing, as described above.
In each case you must select the best correction from a list. This may be a distance metric such as levenshtein, the keyboard metric, etc.
For a multi-word phrase, only one word may be misspelled, in which case you can use the remaining words as context in determining a best match.
Use Levenshtein distance, then create a Metric Tree (or Slim tree) to index words.
Then run a 1-Nearest Neighbour query, and you got the result.
Google apparently suggests queries with best results, not with those which are spelled correctly. But in this case, probably a spell-corrector would be more feasible, Of course you could store some value for every query, based on some metric of how good results it returns.
So,
You need a dictionary (english or based on your data)
Generate a word trellis and calculate probabilities for the transitions using your dictionary.
Add a decoder to calculate minimum error distance using your trellis. Of course you should take care of insertions and deletions when calculating distances. Fun thing is that QWERTY keyboard maximizes the distance if you hit keys close to each other.(cae would turn car, cay would turn cat)
Return the word which has the minimum distance.
Then you could compare that to your query database and check if there is better results for other close matches.
Here is the best answer I found, Spelling corrector implemented and described by Google's Director of Research Peter Norvig.
If you want to read more about the theory behind this, you can read his book chapter.
The idea of this algorithm is based on statistical machine learning.
I saw something on this a few years back, so may have changed since, but apparently they started it by analysing their logs for the same users submitting very similar queries in a short space of time, and used machine learning based on how users had corrected themselves.
As a guess... it could
search for words
if it is not found use some algorithm to try to "guess" the word.
Could be something from AI like Hopfield network or back propagation network, or something else "identifying fingerprints", restoring broken data, or spelling corrections as Davide mentioned already ...
Simple. They have tons of data. They have statistics for every possible term, based on how often it is queried, and what variations of it usually yield results the users click... so, when they see you typed a frequent misspelling for a search term, they go ahead and propose the more usual answer.
Actually, if the misspelling is in effect the most frequent searched term, the algorythm will take it for the right one.
regarding your question how to mimic the behavior without having tons of data - why not use tons of data collected by google? Download the google sarch results for the misspelled word and search for "Did you mean:" in the HTML.
I guess that's called mashup nowadays :-)
Apart from the above answers, in case you want to implement something by yourself quickly, here is a suggestion -
Algorithm
You can find the implementation and detailed documentation of this algorithm on GitHub.
Create a Priority Queue with a comparator.
Create a Ternay Search Tree and insert all english words (from Norvig's post) along with their frequencies.
Start traversing the TST and for every word encountered in TST, calculate its Levenshtein Distance(LD) from input_word
If LD ≤ 3 then put it in a Priority Queue.
At Last extract 10 words from the Priority Queue and display.
You mean to say spell checker? If it is a spell checker rather than a whole phrase then I've got a link about the spell checking where the algorithm is developed in python. Check this link
Meanwhile, I am also working on project that includes searching databases using text. I guess this would solve your problem
This is an old question, and I'm surprised that nobody suggested the OP using Apache Solr.
Apache Solr is a full text search engine that besides many other functionality also provides spellchecking or query suggestions. From the documentation:
By default, the Lucene Spell checkers sort suggestions first by the
score from the string distance calculation and second by the frequency
(if available) of the suggestion in the index.
There is a specific data structure - ternary search tree - that naturally supports partial matches and near-neighbor matches.
Easiest way to figure it out is to Google dynamic programming.
It's an algorithm that's been borrowed from Information Retrieval and is used heavily in modern day bioinformatics to see how similiar two gene sequences are.
Optimal solution uses dynamic programming and recursion.
This is a very solved problem with lots of solutions. Just google around until you find some open source code.

Resources