I have a database where users upload articles.
I would like to make an algorithm where my web app will suggest similar texts according to the one the user reads.
I saw some examples like Levenshtein distance. But those algorithms measures distance for strings and not for whole articles. Is there a way to extract most significant keywords from text? Surely, I understand that "most significant" is an ambiguous term.
How do other sites manage this?
thanks a lot
Is there a way to extract most significant keywords from text?
Yes. Basically, you extract all the words from the text, sort the words by frequency, eliminate the common words (a, an, the, etc.) by matching them against a common word dictionary, and save the top 20 or more words, along with their frequency, from each article.
The number of top words you save is related to both the length of the article and the subject matter of all the articles. Less words work for general interest articles, while more words are necessary for special interest articles, like answers to programming questions.
Articles that match more than half of the top words could be considered related. The degree of relatedness would depend on the number of matching top words and the frequencies of the matching words.
You could calculate a relatedness score by multiplying the frequencies of each matched word from the two articles and summing all the products. The higher the score, the more the articles are related.
You might try to correct the 'weight' of each word by the frequency it appears in all the articles. So the best indicators of similarity would be the words that appear only in the two compared ones and nowhere else. This would automatically disregard the common words (a, an, the, etc.) mentioned by #Gilbert Le Blanc.
Related
I have about 3 million words coming from many paper researches.
I want to filter that researches according to meta data.
the research is about cars, books, foods.
for example, I have a document with meta data Toyota
I have another document with meta data Toiota
notice that Toiota is the same as Toyota
what are the available approaches to solve that problem please?
What I have tried
I used a stem to take the root of the word.
I stem the first word to take the root
I stem the second word to take the root
compare the two roots.
My problem
The stem works just on words that have meaning. for example, eating, eat, ate. but when the word doesn't have meaning like Toyota, the root of it is the exact same word.
Another problem
The stem also doesn't work in this case:
united state doesn't equal US but logically they are the same.
anyone has a better approach too?
I don't know what are the available tags in StackOverFlow that works with me problem so you are welcome to add tags.
Update 1
I want to search this problem in gooogle but I don't know the correct words to use when searching, could you help me pelase?
If you want Toiota to mean the same as Toyota, there are a few options:
Hard code the translation
Auto "spell check" the query/document. If Toiota does not exist in your dictionary, then return the closest word, if it's close. See Norvig's spelling corrector.
Compare documents on character similarity and not exact word matches {t,o,y,o,t,a} has 83% overlap with {t,o,i,o,t,a} . Check out Jaro-Winkler distance too.
For US/United States you probably want a synonym file (of countries and their abbreviations), and add synonyms for each document. Another approach would be to take words and auto-abbreviate them and add that in your index. Example
abbrev('United States') = {'united,'states','us'} --take first letter of each word in multi-part words
abbrev('Canada') = {'canada', 'can'} -- take first three letters of single letter words
We have a list of about 150,000 words, and when the user enters a free text, the system should present a list of words from the dictionary, that are very close to words in the free text.
For instance, the user enters: "I would like to buy legoe toys in Walmart". If the dictionary contains "Lego", "Car" and "Walmart", the system should present "Lego" and "Walmart" in the list. "Walmart" is obvious because it is identical to a word in the sentence, but "Lego" is similar enough to "Legoe" to be mentioned, too. However, nothing is similar to "Car", so that word is not shown.
Showing the list should be realtime, meaning that when the user has entered the sentence, the list of words must be present on the screen. Does anybody know a good algorithm for this?
The dictionary actually contains concepts which may include a space. For instance, "Lego spaceship". The perfect solution recognizes these multi-word concepts, too.
Any suggestions are appreciated.
Take a look at http://norvig.com/spell-correct.html for a simple algorithm. The article uses Python, but there are links to implementations in other languages at the end.
You will be doing quite a few lookups of words against a fixed dictionary. Therefore you need to prepare your dictionary. Logically, you can quickly eliminate candidates that are "just too different".
For instance, the words car and dissimilar may share a suffix, but they're obviously not misspellings of each other. Now why is that so obvious to us humans? For starters, the length is entirely different. That's an immediate disqualification (but with one exception - below). So, your dictionary should be sorted by word length. Match your input word with words of similar length. For short words that means +/- 1 character; longer words should have a higher margin (exactly how well can your demographic spell?)
Once you've restricted yourself to candidate words of similar length, you'd want to strip out words that are entirely dissimilar. With this I mean that they use entirely different letters. This is easiest to compare if you sort the letters in a word alphabetically. E.g. car becomes "acr"; rack becomes "ackr". You'll do this in preprocessing for your dictionary, and for each input word. The reason is that it's cheap to determine the (size of an) difference of two sorted sets. (Add a comment if you need explanation). car and rack have an difference of size 1, car and hat have a difference of size 2. This narrows down your set of candidates even further. Note that for longer words, you can bail out early when you've found too many differences. E.g. dissimilar and biography have a total difference of 13, but considering the length (8/9) you can probably bail out once you've found 5 differences.
This leaves you with a set of candidate words that use almost the same letters, and also are almost the same length. At this point you can start using more refined algorithms; you don't need to run 150.000 comparisons per input word anymore.
Now, for the length exception mentioned before: The problem is in "words" like greencar. It doesn't really match a word of length 8, and yet for humans it's quite obvious what was meant. In this case, you can't really break the input word at any random boundary and run an additional N-1 inexact matches against both halves. However, it is feasible to check for just a missing space. Just do a lookup for all possible prefixes. This is efficient because you'll be using the same part of the dictionary over and over, e.g. g gr, gre, gree, etc. For every prefix that you've found, check if the remaining suffixis also in the dictionery, e.g. reencar, eencar. If both halves of the input word are in the dictionary, but the word itself isn't, you can assume a missing space.
You would likely want to use an algorithm which calculates the Levenshtein distance.
However, since your data set is quite large, and you'll be comparing lots of words against it, a direct implementation of typical algorithms that do this won't be practical.
In order to find words in a reasonable amount of time, you will have to index your set of words in some way that facilitates fuzzy string matching.
One of these indexing methods would be to use a suffix tree. Another approach would be to use n-grams.
I would lean towards using a suffix tree since I find it easier to wrap my head around it and I find it more suited to the problem.
It might be of interest to look at a some algorithms such as the Levenshtein distance, which can calculate the amount of difference between 2 strings.
I'm not sure what language you are thinking of using but PHP has a function called levenshtein that performs this calculation and returns the distance. There's also a function called similar_text that does a similar thing. There's a code example here for the levenshtein function that checks a word against a dictionary of possible words and returns the closest words.
I hope this gives you a bit of insight into how a solution could work!
I'm trying to devise a method that will be able to classify a given number of english words into 2 sets - "rare" and "common" - the reference being to how much they are used in the language.
The number of words I would like to classify is bounded - currently at around 10,000, and include everything from articles, to proper nouns that could be borrowed from other languages (and would thus be classified as "rare"). I've done some frequency analysis from within the corpus, and I have a distribution of these words (ranging from 1 use, to tops about 100).
My intuition for such a system was to use word lists (such as the BNC word frequency corpus, wordnet, internal corpus frequency), and assign weights to its occurrence in one of them.
For instance, a word that has a mid level frequency in the corpus, (say 50), but appears in a word list W - can be regarded as common since its one of the most frequent in the entire language. My question was - whats the best way to create a weighted score for something like this? Should I go discrete or continuous? In either case, what kind of a classification system would work best for this?
Or do you recommend an alternative method?
Thanks!
EDIT:
To answer Vinko's question on the intended use of the classification -
These words are tokenized from a phrase (eg: book title) - and the intent is to figure out a strategy to generate a search query string for the phrase, searching a text corpus. The query string can support multiple parameters such as proximity, etc - so if a word is common, these params can be tweaked.
To answer Igor's question -
(1) how big is your corpus?
Currently, the list is limited to 10k tokens, but this is just a training set. It could go up to a few 100k once I start testing it on the test set.
2) do you have some kind of expected proportion of common/rare words in the corpus?
Hmm, I do not.
Assuming you have a way to evaluate the classification, you can use the "boosting" approach to machine learning. Boosting classifiers use a set of weak classifiers combined to a strong classifier.
Say, you have your corpus and K external wordlists you can use.
Pick N frequency thresholds. For example, you may have 10 thresholds: 0.1%, 0.2%, ..., 1.0%.
For your corpus and each of the external word lists, create N "experts", one expert per threshold per wordlist/corpus, total of N*(K+1) experts. Each expert is a weak classifier, with a very simple rule: if the frequency of the word is higher than its threshold, they consider the word to be "common". Each expert has a weight.
The learning process is as follows: assign the weight 1 to each expert. For each word in your corpus, make the experts vote. Sum their votes: 1 * weight(i) for "common" votes and (-1) * weight(i) for "rare" votes. If the result is positive, mark the word as common.
Now, the overall idea is to evaluate the classification and increase the weight of experts that were right and decrease the weight of the experts that were wrong. Then repeat the process again and again, until your evaluation is good enough.
The specifics of the weight adjustment depends on the way how you evaluate the classification. For example, if you don't have per-word evaluation, you may still evaluate the classification as "too many common" or "too many rare" words. In the first case, promote all the pro-"rare" experts and demote all pro-"common" experts, or vice-versa.
Your distribution is most likely a Pareto distribution (a superset of Zipf's law as mentioned above). I am shocked that the most common word is used only 100 times - this is including "a" and "the" and words like that? You must have a small corpus if that is the same.
Anyways, you will have to choose a cutoff for "rare" and "common". One potential choice is the mean expected number of appearances (see the linked wiki article above to calculate the mean). Because of the "fat tail" of the distribution, a fairly small number of words will have appearances above the mean -- these are the "common". The rest are "rare". This will have the effect that many more words are rare than common. Not sure if that is what you are going for but you can just move the cutoff up and down to get your desired distribution (say, all words with > 50% of expected value are "common").
While this is not an answer to your question, you should know that you are inventing a wheel here.
Information Retrieval experts have devised ways to weight search words according to their frequency. A very popular weight is TF-IDF, which uses a word's frequency in a document and its frequency in a corpus. TF-IDF is also explained here.
An alternative score is the Okapi BM25, which uses similar factors.
See also the Lucene Similarity documentation for how TF-IDF is implemented in a popular search library.
Here at work, we often need to find a string from the list of strings that is the closest match to some other input string. Currently, we are using Needleman-Wunsch algorithm. The algorithm often returns a lot of false-positives (if we set the minimum-score too low), sometimes it doesn't find a match when it should (when the minimum-score is too high) and, most of the times, we need to check the results by hand. We thought we should try other alternatives.
Do you have any experiences with the algorithms?
Do you know how the algorithms compare to one another?
I'd really appreciate some advice.
PS: We're coding in C#, but you shouldn't care about it - I'm asking about the algorithms in general.
Oh, I'm sorry I forgot to mention that.
No, we're not using it to match duplicate data. We have a list of strings that we are looking for - we call it search-list. And then we need to process texts from various sources (like RSS feeds, web-sites, forums, etc.) - we extract parts of those texts (there are entire sets of rules for that, but that's irrelevant) and we need to match those against the search-list. If the string matches one of the strings in search-list - we need to do some further processing of the thing (which is also irrelevant).
We can not perform the normal comparison, because the strings extracted from the outside sources, most of the times, include some extra words etc.
Anyway, it's not for duplicate detection.
OK, Needleman-Wunsch(NW) is a classic end-to-end ("global") aligner from the bioinformatics literature. It was long ago available as "align" and "align0" in the FASTA package. The difference was that the "0" version wasn't as biased about avoiding end-gapping, which often allowed favoring high-quality internal matches easier. Smith-Waterman, I suspect you're aware, is a local aligner and is the original basis of BLAST. FASTA had it's own local aligner as well that was slightly different. All of these are essentially heuristic methods for estimating Levenshtein distance relevant to a scoring metric for individual character pairs (in bioinformatics, often given by Dayhoff/"PAM", Henikoff&Henikoff, or other matrices and usually replaced with something simpler and more reasonably reflective of replacements in linguistic word morphology when applied to natural language).
Let's not be precious about labels: Levenshtein distance, as referenced in practice at least, is basically edit distance and you have to estimate it because it's not feasible to compute it generally, and it's expensive to compute exactly even in interesting special cases: the water gets deep quick there, and thus we have heuristic methods of long and good repute.
Now as to your own problem: several years ago, I had to check the accuracy of short DNA reads against reference sequence known to be correct and I came up with something I called "anchored alignments".
The idea is to take your reference string set and "digest" it by finding all locations where a given N-character substring occurs. Choose N so that the table you build is not too big but also so that substrings of length N are not too common. For small alphabets like DNA bases, it's possible to come up with a perfect hash on strings of N characters and make a table and chain the matches in a linked list from each bin. The list entries must identify the sequence and start position of the substring that maps to the bin in whose list they occur. These are "anchors" in the list of strings to be searched at which an NW alignment is likely to be useful.
When processing a query string, you take the N characters starting at some offset K in the query string, hash them, look up their bin, and if the list for that bin is nonempty then you go through all the list records and perform alignments between the query string and the search string referenced in the record. When doing these alignments, you line up the query string and the search string at the anchor and extract a substring of the search string that is the same length as the query string and which contains that anchor at the same offset, K.
If you choose a long enough anchor length N, and a reasonable set of values of offset K (they can be spread across the query string or be restricted to low offsets) you should get a subset of possible alignments and often will get clearer winners. Typically you will want to use the less end-biased align0-like NW aligner.
This method tries to boost NW a bit by restricting it's input and this has a performance gain because you do less alignments and they are more often between similar sequences. Another good thing to do with your NW aligner is to allow it to give up after some amount or length of gapping occurs to cut costs, especially if you know you're not going to see or be interested in middling-quality matches.
Finally, this method was used on a system with small alphabets, with K restricted to the first 100 or so positions in the query string and with search strings much larger than the queries (the DNA reads were around 1000 bases and the search strings were on the order of 10000, so I was looking for approximate substring matches justified by an estimate of edit distance specifically). Adapting this methodology to natural language will require some careful thought: you lose on alphabet size but you gain if your query strings and search strings are of similar length.
Either way, allowing more than one anchor from different ends of the query string to be used simultaneously might be helpful in further filtering data fed to NW. If you do this, be prepared to possibly send overlapping strings each containing one of the two anchors to the aligner and then reconcile the alignments... or possibly further modify NW to emphasize keeping your anchors mostly intact during an alignment using penalty modification during the algorithm's execution.
Hope this is helpful or at least interesting.
Related to the Levenstein distance: you might wish to normalize it by dividing the result with the length of the longer string, so that you always get a number between 0 and 1 and so that you can compare the distance of pair of strings in a meaningful way (the expression L(A, B) > L(A, C) - for example - is meaningless unless you normalize the distance).
We are using the Levenshtein distance method to check for duplicate customers in our database. It works quite well.
Alternative algorithms to look at are agrep (Wikipedia entry on agrep),
FASTA and BLAST biological sequence matching algorithms. These are special cases of approximate string matching, also in the Stony Brook algorithm repositry. If you can specify the ways the strings differ from each other, you could probably focus on a tailored algorithm. For example, aspell uses some variant of "soundslike" (soundex-metaphone) distance in combination with a "keyboard" distance to accomodate bad spellers and bad typers alike.
Use FM Index with Backtracking, similar to the one in Bowtie fuzzy aligner
In order to minimize mismatches due to slight variations or errors in spelling, I've used the Metaphone algorithm, then Levenshtein distance (scaled to 0-100 as a percentage match) on the Metaphone encodings for a measure of closeness. That seems to have worked fairly well.
To expand on Cd-MaN's answer, it sounds like you're facing a normalization problem. It isn't obvious how to handle scores between alignments with varying lengths.
Given what you are interested in, you may want to obtain p-values for your alignment. If you are using Needleman-Wunsch, you can obtain these p-values using Karlin-Altschul statistics http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
BLAST will can local alignment and evaluate them using these statistics. If you are concerned about speed, this would be a good tool to use.
Another option is to use HMMER. HMMER uses Profile Hidden Markov Models to align sequences. Personally, I think this is a more powerful approach since it also provides positional information. http://hmmer.janelia.org/
When entering a question, stackoverflow presents you with a list of questions that it thinks likely to cover the same topic. I have seen similar features on other sites or in other programs, too (Help file systems, for example), but I've never programmed something like this myself. Now I'm curious to know what sort of algorithm one would use for that.
The first approach that comes to my mind is splitting the phrase into words and look for phrases containing these words. Before you do that, you probably want to throw away insignificant words (like 'the', 'a', 'does' etc), and then you will want to rank the results.
Hey, wait - let's do that for web pages, and then we can have a ... watchamacallit ... - a "search engine", and then we can sell ads, and then ...
No, seriously, what are the common ways to solve this problem?
One approach is the so called bag-of-words model.
As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.
You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words:
"aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.
Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:
[2 0 0 ... 70k zeroes ... 0].
After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.
This is a simple version and there other more advanced techniques. May the Wikipedia be with you.
#Hanno you should try the Levenshtein distance algorithm. Given an input string s and a list of of strings t iterate for each string u in t and return the one with the minimum Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance
See a Java implementation example in http://www.javalobby.org/java/forums/t15908.html
To augment the bag-of-words idea:
There are a few ways you can also pay some attention to n-grams, strings of two or more words kept in order. You might want to do this because a search for "space complexity" is much more than a search for things with "space" AND "complexity" in them, since the meaning of this phrase is more than the sum of its parts; that is, if you get a result that talks about the complexity of outer space and the universe, this is probably not what the search for "space complexity" really meant.
A key idea from natural language processing here is that of mutual information, which allows you (algorithmically) to judge whether or not a phrase is really a specific phrase (such as "space complexity") or just words which are coincidentally adjacent. Mathematically, the main idea is to ask, probabilistically, if these words appear next to each other more often than you would guess by their frequencies alone. If you see a phrase with a high mutual information score in your search query (or while indexing), you can get better results by trying to keep these words in sequence.
From my (rather small) experience developing full-text search engines: I would look up questions which contain some words from query (in your case, query is your question).
Sure, noise words should be ignored and we might want to check query for 'strong' words like 'ASP.Net' to narrow down search scope.
http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices'>Inverted indexes are commonly used to find questions with words we are interested in.
After finding questions with words from query, we might want to calculate distance between words we are interested in in questions, so question with 'phrases similarity' text ranks higher than question with 'discussing similarity, you hear following phrases...' text.
Here is the bag of words solution with tfidfvectorizer in python 3
#from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('stopwords')
s=set(stopwords.words('english'))
train_x_cleaned = []
for i in train_x:
sentence = filter(lambda w: not w in s,i.split(","))
train_x_cleaned.append(' '.join(sentence))
vectorizer = TfidfVectorizer(binary=True)
train_x_vectors = vectorizer.fit_transform(train_x_cleaned)
print(vectorizer.get_feature_names_out())
print(train_x_vectors.toarray())
from sklearn import svm
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
test_x = vectorizer.transform(["test phrase 1", "test phrase 2", "test phrase 3"])
print (type(test_x))
clf_svm.predict(test_x)