relevant text search in a large text file - algorithm

I have a large text document and I have a search query(e.g. : rock climbing). I want to return 5 most relevant sentences from the text. What are the approaches that can be followed? I am a complete newbie at this text retrieval domain, so any help is appreciated.
One approach I can think of is :
scan the file sentence by sentence, look for the whole search query in the sentence and if it matches then return the sentence.
above approach works only if some of the sentences contain the whole search query. what to do if there are no sentences containing whole query and if some of the sentences contain just one of the word? or what if they contain none of the words?
any help?
Another question I have is can we preprocess the text document to make building index easier? Is trie a good data structure for preprocessing?

In general, relevance is something you define using some sort of scoring function. I will give you an example of a naive scoring algorithm, as well as one of the common search engine ranking algorithms (used for documents, but I modified it for sentences for educational purposes).
Naive ranking
Here's an example of a naive ranking algorithm. The ranking could go as simple as:
Sentences are ranked based on the average proximity between the query terms (e.g. the biggest number of words between all possible query term pairs), meaning that a sentence "Rock climbing is awesome" is ranked higher than "I am not a fan of climbing because I am lazy like a rock."
More word matches are ranked higher, e.g. "Climbing is fun" is ranked higher than "Jogging is fun."
Pick alphabetical or random favorites in case of a tie, e.g. "Climbing is life" is ranked higher than "I am a rock."
Some common search engine ranking
BM25
BM25 is a good robust algorithm for scoring documents with relation to the query. For reference purposes, here's a Wikipedia article about BM25 ranking algorithm. You would want to modify it a little because you are dealing with sentences, but you can take a similar approach by treating each sentence as a 'document'.
Here it goes. Assuming your query consists of keywords q1, q2, ... , qm, the score of a sentence S with respect to the query Q is calculated as follows:
SCORE(S, Q) = SUM(i=1..m) (IDF(qi * f(qi, S) * (k1 + 1) / (f(qi, S) + k1 * (1 - b + b * |S| / AVG_SENT_LENGTH))
k1 and b are free parameters (could be chosen as k in [1.2, 2.0] and b = 0.75 -- you can find some good values empirically) f(qi, S) is the term frequency of qi in a sentence S (could treat is as just the number of times the term occurs), |S| is the length of your sentence (in words), and AVG_SENT_LENGTH is the average sentence length of your sentences in a document. Finally, IDF(qi) is the inverse document frequency (or, in this case, inverse sentence frequency) of the qi, which is usually computed as:
IDF(qi) = log ((N - n(qi) + 0.5) / (n(qi) + 0.5))
Where N is the total number of sentences, and n(qi) is the number of sentences containing qi.
Speed
Assume you don't store inverted index or any additional data structure for fast access.
These are the terms that could be pre-computed: N, *AVG_SENT_LENGTH*.
First, notice that the more terms are matched, the higher this sentence will be scored (because of the sum terms). So if you get top k terms from the query, you really need to compute the values f(qi, S), |S|, and n(qi), which will take O(AVG_SENT_LENGTH * m * k), or if you are ranking all the sentences in the worst case, O(DOC_LENGTH * m) time where k is the number of documents that have the highest number of terms matched and m is the number of query terms. Assuming each sentence is about AVG_SENT_LENGTH, and you have to go m times for each of the k sentences.
Inverted index
Now let's look at inverted index to allow fast text searches. We will treat your sentences as documents for educational purposes. The idea is to built a data structure for your BM25 computations. We will need to store term frequencies using inverted lists:
wordi: (sent_id1, tf1), (sent_id2, tf2), ... ,(sent_idk, tfk)
Basically, you have hashmaps where your key is word and your value is list of pairs (sent_id<sub>j</sub>, tf<sub>k</sub>) corresponding to ids of sentences and frequency of a word. For example, it could be:
rock: (1, 1), (5, 2)
This tells us that the word rock occurs in the first sentence 1 time and in the fifth sentence 2 times.
This pre-processing step will allow you to get O(1) access to the term frequencies for any particular word, so it will be fast as you want it.
Also, you would want to have another hashmap to store sentence length, which should be a fairly easy task.
How to build inverted index? I am skipping stemming and lemmatization in your case, but you are welcome to read more about it. In short, you traverse through your document, continuously creating pairs/increasing frequencies for your hashmap containing the words. Here are some slides on building the index.

Related

Efficiently finding a list of near matches for list of words and phrases

I am looking for an algorithm, but I don't know the name of the problem so I can't find anything. Hopefully my explanation of the problem makes sense!
Let's say you have a long list of phrases, where each phrase is a set of words. The user inputs a list of words, and their list "matches" a phrase every word in the phrase is found in their list. A list's "score" is the number of phrases it matches. The goal is to provide the user with a list of words that would most improve their list's score.
Here's a simple example. We have ten phrases:
wood cabin
camping in woods
camping cabin
fun camping
bon fire
camping fire
swimming hole
fun cabin
wood fire
fire place
And the user provides this list:
wood
fun
camping
We match phrases 1 and 4, so the score is 2. But if the user adds "cabin" to their list, they will match 3 more phrases and get a score of 5. "fire" would add 2 to the score.
With the trivially short list, there isn't any complicated problem, as you can just iterate through the options in almost no time. But as the list grows to the hundreds of thousands, it starts taking hundreds of milliseconds. It feels like there should be a way to build an index to make the process faster, but I can't think of what the index's structure would be.
Anyone who took the time to read all this, thank you! Hopefully someone knows what I'm talking about.
You need to map words to number of occurrences. If you use a hash table you can do it very quickly (O(N) - with N being the number of words in the phrases) - loop over all phrases, break them into words, if the word is already in the map increment its count, if not - add it to the map with count 1.
To compute the score of the input, just loop over the input words and accumulate the number of occurrences. O(M) - this time M being the number of input words.
I doubt you can get better complexity (you need to scan the phrases at least once), and with a proper implementation of a map (available in almost all modern languages) - it will be fast as well.
Suffix tree.
They're rather fiddly and complicated things, but basically we store a node for each character (26 * 2), then we store the suffixes for each character, so entries for th and an and so on, but presumably not for qj or other combinations which won't occur. Then you get the suffixes for those, (so the, thr, and and so on, but plenty of combinations of three letters disallowed).
It allows for very fast searching, which doesn't have to be exact. If we want to match a*d we simply follow all the suffixes of a, then only d suffixes, then we insist on nul.

Correct order output in K Means and document clustering

I am doing a single document clustering with K Means, I am now working on preparing the data to be clustered and represent N sentences in their vector representations.
However, if I understand correctly, KMeans algorithm is set to create k clusters based on the euclidean distance to k center points. Regardless of the sentences order.
My problem is that I want to keep the order of the sentences and consider them in the clustering task.
Let say S = {1...n} a set of n vectors representing sentences, S_1 = sentence 1 , S_2 = sentence 2 .. etc.
I want that the clusters will be K_1 = S[1..i], K_2 = S[i..j] etc..
I thought maybe transform this into 1D and sum the index of each sentence to the transformed value. But not sure if it will help. And maybe there's a smarter way.
A quick and dirty way to do this would be to append each lexical item with the sentence number it's in. First sentence segment, then, for this document:
This document's really great. It's got all kinds of words in it. All the words are here.
You would get something like:
{"0_this": 1, "0_document": 1, "0_be": 1, "0_really": 1,...}
Whatever k-means you're using, this should be readily accepted.
I'd warn against doing this at all in general, though. You're introducing a lot of data sparsity, and your results will be more harmed by the curse of dimensionality. You should only do it if the genre you're looking at is (1) very predictable in lexical choice and (2) very predictable in structure. I can't think of a good linguistic reason that sentences should align precisely across texts.

What is a "term-vector algorithm"?

Google states that a "term-vector algorithm" can be used to determine popular keywords. I have studied http://en.wikipedia.org/wiki/Vector_space_model, but cant understand the term "term-vector algorithm".
Please explain it in a brief summary, very simple language, as if the reader is a child.
I believe "vector" refers to the mathematics definition, a quantity having direction as well as magnitude. How is it that keywords have a quantity moving in a direction?
http://en.wikipedia.org/wiki/Vector_space_model states "Each dimension corresponds to a separate term." I thought dimension relates to cardinality, is that correct?
From the book Hadoop In Practice, by Alex Holmes, page 12.
It means that each word forms a separate dimension:
Example: (shamelessly taken from here)
For a model containing only three words you would get:
dict = { dog, cat, lion }
Document 1
“cat cat” → (0,2,0)
Document 2
“cat cat cat” → (0,3,0)
Document 3
“lion cat” → (0,1,1)
Document 4
“cat lion” → (0,1,1)
The most popular example for MapReduce is to calculate work frequency; namely, a map step to output the word as key with 1 as a value, and a reduce step to sum the numbers for each word. So if a web page has a list of (possibly duplicate) words that occur, each word in that list maps to 1. The reduce step essentially counts how many times each word occurs in that page. You can do this across pages, websites, or whatever criteria. The resulting data is a dictionary mapping word to frequency which is effectively a term frequency vector.
Example document: "a be see be a"
Resulting data: { 'a':2, 'be':2, 'see':1 }
Term vector sounds like it just mean that each term has a weight or number value attached, probably corresponding to the number of times the term is mentioned.
You are thinking of the geometric meaning of the word vector but there is another mathematical meaning that just means multiple dimensions ie instead of saying x,y,z you say the vector x in bold that has multiple dimensions x1, x2, x3...xn and some values. So for a term vector, the vector is term and it takes the form term1, term2 up to term n. Each can then have a value, just as x,y, or z has a value.
As an example term 1 could be dog, term 2 cat, term3 lion and each has a weight, 2, 3, 1, meaning the word dog appears twice, cat 3 times and lion 1 time.

Algorithm to find keywords of a text

Given a set of texts (might be books, articles, documents, etc.) how would you find relevant keywords for each text?
Common sense suggests to:
split words
exclude common words (also called stop-words, like "a,
to, for, in")
count words frequencies
give a score to each word, with a formula that takes into account the frequency of each word in the document and in other documents, the number of words of the document and the total number of words of all documents
The question is: which is a good formula to do that?
I've developed one.
For each word calculate this ratio:
(frequency of word in this text) * (total number of words in all texts)
-----------------------------------------------------------------------
(number of words in this text) * (frequency of word in all texts)
Keywords are those words whose ratio is in the highest 20% (for this doucument).
Ankerl also proposes his own formula:
tanh(curVal/curWords*200) - 5*tanh((allVal-curVal)/(allWords-curWords)*200)
Where:
curVal: How often the word to score is present in the to-be-analyzed text
curWords: Total number of words in the to-be-analyzed text
allVal: How often the word to score is present in the indexed dataset
allWords: Total number of words of the indexed dataset
Both algorithms work pretty well, and results often coincide. Do you know any way to do it better?

Grouping similar sets algorithm

I have a search engine. The search engine generates results when is searched for a keyword. What I need is to find all other keywords which generate similar results.
For example keyword k1 gives result set R1 = { 1,2,3,4,5,...40 }, which contains up to 40 document ids. And I need to get a list of all other keywords K1 which generate results similar to what k1 generates.
The similarity S(R1, R2) between two result sets R1 and R2 is computed as follows:
2 * (number of same elements both in _R1_ and _R2_) / ( (total number of elements in _R1_) + (total number of elements in _R2_) ). Example: R1 = {1,2,3} and R2 = {2,3,4,5} gives S(R1, R2) = (2*|{2,3}|) / |{1,2,3}| + |{2,3,4,5}| = (2*2)/(3+4) = 4/7 = 0.57.
There are more than 100,000 keywords thus more than 100,000 result sets. So far I only was able to solve this problem the hard way O(N^2), where each result set is comprated to every other set. This takes a lot of time.
Is there someone with a better idea?
Some similar post which not solve the problem completely:
How to store sets, to find similar patterns fast?
efficient algorithm to compare similarity between sets of numbers?
One question are the results in sorted order?
Something which came to mind combine both the sets , sort it and find duplicates. It can be reduced to O(nlogn)
To make the problem be simple, it is supposed that all the key words have 10 results ans k1 is the key word to be compared. You remove 9 results from the set of each key word. Now compare the last result with k1's and the key words with the same last result is what you want. If a key word has 1 result in common with k1, there is only 1% probability that it will remain. A key word with 5 results in common with k1 will have 25% probability to remain. Maybe you will think that 1% is too big, then you can repeat the process above n times and the key word with 1 result in common will have 1%^n probability to remain.
The time is O(N).
Is your similarity criterium fixed, or can we apply a bit of variety to achieve faster search engine?
Alternative:
An alternative that came to my mind:
Given your result set R1, you could go through the documents and create a histogram over other keywords that those documents would be matched to. Then, if given alternative keyword gets, say, at least #R1/2 hits, you list it as "similar".
The big difference is, that you do not consider documents that are not in R1 at all.
Exact?
If you need a solution exact to your requirements, I believe it would suffice to compute R2 set only for those keywords that satisfy the above "alternative" criterium. I think (mathematical proof needed!) that if the "alternative" criterium is not satisfied, there is no chance that yours will be.

Resources