Algorithm to find keywords of a text - algorithm

Given a set of texts (might be books, articles, documents, etc.) how would you find relevant keywords for each text?
Common sense suggests to:
split words
exclude common words (also called stop-words, like "a,
to, for, in")
count words frequencies
give a score to each word, with a formula that takes into account the frequency of each word in the document and in other documents, the number of words of the document and the total number of words of all documents
The question is: which is a good formula to do that?

I've developed one.
For each word calculate this ratio:
(frequency of word in this text) * (total number of words in all texts)
-----------------------------------------------------------------------
(number of words in this text) * (frequency of word in all texts)
Keywords are those words whose ratio is in the highest 20% (for this doucument).
Ankerl also proposes his own formula:
tanh(curVal/curWords*200) - 5*tanh((allVal-curVal)/(allWords-curWords)*200)
Where:
curVal: How often the word to score is present in the to-be-analyzed text
curWords: Total number of words in the to-be-analyzed text
allVal: How often the word to score is present in the indexed dataset
allWords: Total number of words of the indexed dataset
Both algorithms work pretty well, and results often coincide. Do you know any way to do it better?

Related

Sort array of strings based on matching words with input string

I was not able to find any solution for below problem in a coding contest.
Problem:
We have input string of "good words" separated by underscore and list of user reviews (basically array of strings where each array element is having some words separated by underscore).
We have to sort the list of user reviews such that elements having more number of good words comes first.
Example:
input:
good words: "pool_clean_food".
user review array:["food_bedroom_environment","view_sea_desert","clean_pool_table"].
output: [2,0,1]
Explanation:
Array[2]="clean_pool_table" having 2 good words i.e. pool and clean
Array[0]="food_bedroom_environment" having 1 good word i.e. food
Array[1]="view_sea_desert" having 0 good word i.e. nil
How can I approach the problem, which data structure shall I use so that my code can handle large inputs?
Split the words of input good words by underscore and store them in a hashset.
Now for each review, assign score 0 initially. split the words by underscore as well and check if the words are present in the hashset one by one. If the word is present, add 1 to score of that word.
Now consider every reviews as <review, score> pair and sort those reviews based on their score value in ascending order. You can use any standard sorting O(nlogn) algorithm for this.
Instead of hashset, you can use Trie which will be speed up the algorithm in case the words are too big.

Algorithm for search in inverted index

Consider there are 10 billion words that people have searched for in google. Corresponding
to each word you have the sorted list of all document id's. The list looks like this:
[Word 1]->[doc_i1,doc_j1,.....]
[Word 2]->[doc_i2,doc_j2,.....]
...
...
...
[Word N]->[doc_in,doc_jn,.....]
I am looking for an algorithm to find 100 rare word-pairs.
A rare word-pair is a pair of words that occur together(not necessarily contiguous) in
exactly 1 document.
I am looking for something better than O(n^2) if possible.
You order the words according to the number of documents they occur in. The idea here is, that words that occur rarely at all, will occur rarely in pairs as well. If you find words that occur in exactly one document, just pick any other word from that document and you are done.
Then you start inverting the index, starting with the rarest word. That means you create a map where each document points to the set of words in it. At first you create that inverted index with the rarest word only. After you inserted all documents associated with that rarest word into the inverted index, you have a map where each document points to exactly one word.
Then you add the next word with all its documents, still following the order created in (1.). At some point you will find that a document associated with a word is already present in your inverted map. Here you check all words associated with that document if they form such a rare word pair.
The performance of the thing depends heavily on how far you have to go to find 100 such pairs, the idea is that you are done after processing only a small fraction of the total data set. To take advantage of the fact that you only process a small fraction of the data, you should employ in (1.) a sort algorithm that allows you to find the smallest elements long before the entire set has been sorted, like quick sort. Then the sorting can be done in like O(N*log(N1) ), with N1 being the number of words you actually need to add to the inverted index before finding 100 pairs. The complexity of the other operations, namely adding a word to the inverted index and checking if a word pair occurs in more than one document also is linear with the number of documents per word, so those operations should be speedy at the beginning and slow down later, because later you have more documents per word.
This is the opposite of "Frequent Itemset Mining"
i.e. check this recent publication: Rare pattern mining: challenges and future perspectives

Efficient algorithm to find most common phrases in a large volume of text

I am thinking about writing a program to collect for me the most common phrases in a large volume of the text. Had the problem been reduced to just finding words than that would be as simple as storing each new word in a hashmap and then increasing the count on each occurrence. But with phrases, storing each permutation of a sentence as a key seems infeasible.
Basically the problem is narrowed down to figuring out how to extract every possible phrase from a large enough text. Counting the phrases and then sorting by the number of occurrences becomes trivial.
I assume that you are searching for common patterns of consecutive words appearing in the same order (e.g. "top of the world" would not be counted as the same phrase as "top of a world" or "the world of top").
If so then I would recommend the following linear-time approach:
Split your text into words and remove things you don't consider significant (i.e. remove capitalisation, punctuation, word breaks, etc.)
Convert your text into an array of integers (one integer per unique word) (e.g. every instance of "cat" becomes 1, every "dog" becomes 2) This can be done in linear time by using a hash-based dictionary to store the conversions from words to numbers. If the word is not in the dictionary then assign a new id.
Construct a suffix-array for the array of integers (this is a sorted list of all the suffixes of your array and can be constructed by linear time - e.g. using the algorithm and C code here)
Construct the longest common prefix array for your suffix array. (This can also be done in linear-time, for example using this C code) This LCP array gives the number of common words at the start of each suffix between consecutive pairs in the suffix array.
You are now in a position to collect your common phrases.
It is not quite clear how you wish to determine the end of a phrase. One possibility is to simply collect all sequences of 4 words that repeat.
This can be done in linear time by working through your suffix array looking at places where the longest common prefix array is >= 4. Each run of indices x in the range [start+1...start+len] where the LCP[x] >= 4 (for all except the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is given by the first 4 words of, for example, suffix start+1.
Note that this approach will potentially spot phrases that cross sentence ends. You may prefer to convert some punctuation such as full stops into unique integers to prevent this.

relevant text search in a large text file

I have a large text document and I have a search query(e.g. : rock climbing). I want to return 5 most relevant sentences from the text. What are the approaches that can be followed? I am a complete newbie at this text retrieval domain, so any help is appreciated.
One approach I can think of is :
scan the file sentence by sentence, look for the whole search query in the sentence and if it matches then return the sentence.
above approach works only if some of the sentences contain the whole search query. what to do if there are no sentences containing whole query and if some of the sentences contain just one of the word? or what if they contain none of the words?
any help?
Another question I have is can we preprocess the text document to make building index easier? Is trie a good data structure for preprocessing?
In general, relevance is something you define using some sort of scoring function. I will give you an example of a naive scoring algorithm, as well as one of the common search engine ranking algorithms (used for documents, but I modified it for sentences for educational purposes).
Naive ranking
Here's an example of a naive ranking algorithm. The ranking could go as simple as:
Sentences are ranked based on the average proximity between the query terms (e.g. the biggest number of words between all possible query term pairs), meaning that a sentence "Rock climbing is awesome" is ranked higher than "I am not a fan of climbing because I am lazy like a rock."
More word matches are ranked higher, e.g. "Climbing is fun" is ranked higher than "Jogging is fun."
Pick alphabetical or random favorites in case of a tie, e.g. "Climbing is life" is ranked higher than "I am a rock."
Some common search engine ranking
BM25
BM25 is a good robust algorithm for scoring documents with relation to the query. For reference purposes, here's a Wikipedia article about BM25 ranking algorithm. You would want to modify it a little because you are dealing with sentences, but you can take a similar approach by treating each sentence as a 'document'.
Here it goes. Assuming your query consists of keywords q1, q2, ... , qm, the score of a sentence S with respect to the query Q is calculated as follows:
SCORE(S, Q) = SUM(i=1..m) (IDF(qi * f(qi, S) * (k1 + 1) / (f(qi, S) + k1 * (1 - b + b * |S| / AVG_SENT_LENGTH))
k1 and b are free parameters (could be chosen as k in [1.2, 2.0] and b = 0.75 -- you can find some good values empirically) f(qi, S) is the term frequency of qi in a sentence S (could treat is as just the number of times the term occurs), |S| is the length of your sentence (in words), and AVG_SENT_LENGTH is the average sentence length of your sentences in a document. Finally, IDF(qi) is the inverse document frequency (or, in this case, inverse sentence frequency) of the qi, which is usually computed as:
IDF(qi) = log ((N - n(qi) + 0.5) / (n(qi) + 0.5))
Where N is the total number of sentences, and n(qi) is the number of sentences containing qi.
Speed
Assume you don't store inverted index or any additional data structure for fast access.
These are the terms that could be pre-computed: N, *AVG_SENT_LENGTH*.
First, notice that the more terms are matched, the higher this sentence will be scored (because of the sum terms). So if you get top k terms from the query, you really need to compute the values f(qi, S), |S|, and n(qi), which will take O(AVG_SENT_LENGTH * m * k), or if you are ranking all the sentences in the worst case, O(DOC_LENGTH * m) time where k is the number of documents that have the highest number of terms matched and m is the number of query terms. Assuming each sentence is about AVG_SENT_LENGTH, and you have to go m times for each of the k sentences.
Inverted index
Now let's look at inverted index to allow fast text searches. We will treat your sentences as documents for educational purposes. The idea is to built a data structure for your BM25 computations. We will need to store term frequencies using inverted lists:
wordi: (sent_id1, tf1), (sent_id2, tf2), ... ,(sent_idk, tfk)
Basically, you have hashmaps where your key is word and your value is list of pairs (sent_id<sub>j</sub>, tf<sub>k</sub>) corresponding to ids of sentences and frequency of a word. For example, it could be:
rock: (1, 1), (5, 2)
This tells us that the word rock occurs in the first sentence 1 time and in the fifth sentence 2 times.
This pre-processing step will allow you to get O(1) access to the term frequencies for any particular word, so it will be fast as you want it.
Also, you would want to have another hashmap to store sentence length, which should be a fairly easy task.
How to build inverted index? I am skipping stemming and lemmatization in your case, but you are welcome to read more about it. In short, you traverse through your document, continuously creating pairs/increasing frequencies for your hashmap containing the words. Here are some slides on building the index.

Finding popular keywords in huge list

I have a huge list with about 100 000 lines like this:
ipadnews
abcipad
cddeeffipad
hellworld
iworldthis
.. and so on
And would like to find popular substrings, in this case "ipad" would be the most popular and "world" would be on second place. Minimum length should be three or four chars.
I can't predict the substrings so using a dictionary is a no no.
This is a relatively complicated problem ... but it's tractable using prefix/suffix trees. It's essentially a variation of the longest common subsequence and longest common substring problems. - which is where I would start.
There's actually quite a bit of research on problems on this form - you should be able to use the terms above to narrow your search.
You can solve this using a generalized suffix tree which can be built in O(n) time. This is effectively a play on the LCS problem.
I would go about this problem using the following flow of logic:
Extract the set of suffixes for each word. So from 'ipadnews' we get: 'ipadnews', 'padnews', 'adnews', and so on. This way, 'news' will be one of the suffixes, but not 'ipad'.
To make up for the missing substrings in the above step, extract the prefixes as well. We get 'ipadnew', 'ipadne', and so on, including 'ipad'.
For each of the substrings above, hash them towards a count, e.g. $hash{$substr}++.
At the end we will have a long hashtable with frequency of words as values. Instead of an expensive sorting, suppose you only want 10 most popular words. Keep a set from the beginning whose criteria is that any word in it must have a score more than the current min score. You can keep track of the word with min score and when you add the 11th item with score more than the min score, bump out the word with the min score and update the min score pointer.
The max number of keys in the hashtable will be 2*k*n where k is the average length of the words and n is total number of words.

Resources