ElasticSearch - clarification around `top_terms_boost_N` rewrite parameter - elasticsearch

After reading the documentation, I haven't quite wrapped my head around how the rewrite parameter top_terms_boost_N works.
From the documentation:
Assigns each matching document a relevance score equal to the boost
parameter.
This method changes the original query to a bool query. This bool
query contains a should clause and term query for each matching term.
The final bool query only includes term queries for the top N terms.
You can use this method to avoid exceeding the clause limit in the
indices.query.bool.max_clause_count setting.
Example. Say I'm querying with the string A B C D E F G, and I set N to 2, so top_terms_boost_2 is used as the rewrite parameter.
Does that mean Elastic will take two of the terms from the provided string? E.g. C and F.
How does elastic decide what constitutes a "top" term, is it term frequency?
Why is it important that elastic Assigns each matching document a relevance score equal to the boost parameter?

Related

Search on ngram tokenizer and minimum score pass and result should get accordingly

I required search elastic lets consider trigram tokenizer, then ela, las, ast, sti, tic are words should search and minimum score is 60% is given means at least 3-words should match in result like elast.

Find out if any aggregations will have buckets without doing aggregations

Aggregations in Elasticsearch are pretty expensive. Before actually computing aggregations, I would like to find out if an aggregation would have non-zero count.
For eg. say based on my query, N documents are returned. Now I want to find out, if on these N documents, I aggregate over a certain field, will that aggregation have any buckets? If for all documents the field has null value or empty string, it should return false or 0. If even single document has field as non-empty string or non-null value, it should return true or a non-zero number. I don't really care about the count.
Is it possible to do in a way which is much faster than computing aggregation?

relevant text search in a large text file

I have a large text document and I have a search query(e.g. : rock climbing). I want to return 5 most relevant sentences from the text. What are the approaches that can be followed? I am a complete newbie at this text retrieval domain, so any help is appreciated.
One approach I can think of is :
scan the file sentence by sentence, look for the whole search query in the sentence and if it matches then return the sentence.
above approach works only if some of the sentences contain the whole search query. what to do if there are no sentences containing whole query and if some of the sentences contain just one of the word? or what if they contain none of the words?
any help?
Another question I have is can we preprocess the text document to make building index easier? Is trie a good data structure for preprocessing?
In general, relevance is something you define using some sort of scoring function. I will give you an example of a naive scoring algorithm, as well as one of the common search engine ranking algorithms (used for documents, but I modified it for sentences for educational purposes).
Naive ranking
Here's an example of a naive ranking algorithm. The ranking could go as simple as:
Sentences are ranked based on the average proximity between the query terms (e.g. the biggest number of words between all possible query term pairs), meaning that a sentence "Rock climbing is awesome" is ranked higher than "I am not a fan of climbing because I am lazy like a rock."
More word matches are ranked higher, e.g. "Climbing is fun" is ranked higher than "Jogging is fun."
Pick alphabetical or random favorites in case of a tie, e.g. "Climbing is life" is ranked higher than "I am a rock."
Some common search engine ranking
BM25
BM25 is a good robust algorithm for scoring documents with relation to the query. For reference purposes, here's a Wikipedia article about BM25 ranking algorithm. You would want to modify it a little because you are dealing with sentences, but you can take a similar approach by treating each sentence as a 'document'.
Here it goes. Assuming your query consists of keywords q1, q2, ... , qm, the score of a sentence S with respect to the query Q is calculated as follows:
SCORE(S, Q) = SUM(i=1..m) (IDF(qi * f(qi, S) * (k1 + 1) / (f(qi, S) + k1 * (1 - b + b * |S| / AVG_SENT_LENGTH))
k1 and b are free parameters (could be chosen as k in [1.2, 2.0] and b = 0.75 -- you can find some good values empirically) f(qi, S) is the term frequency of qi in a sentence S (could treat is as just the number of times the term occurs), |S| is the length of your sentence (in words), and AVG_SENT_LENGTH is the average sentence length of your sentences in a document. Finally, IDF(qi) is the inverse document frequency (or, in this case, inverse sentence frequency) of the qi, which is usually computed as:
IDF(qi) = log ((N - n(qi) + 0.5) / (n(qi) + 0.5))
Where N is the total number of sentences, and n(qi) is the number of sentences containing qi.
Speed
Assume you don't store inverted index or any additional data structure for fast access.
These are the terms that could be pre-computed: N, *AVG_SENT_LENGTH*.
First, notice that the more terms are matched, the higher this sentence will be scored (because of the sum terms). So if you get top k terms from the query, you really need to compute the values f(qi, S), |S|, and n(qi), which will take O(AVG_SENT_LENGTH * m * k), or if you are ranking all the sentences in the worst case, O(DOC_LENGTH * m) time where k is the number of documents that have the highest number of terms matched and m is the number of query terms. Assuming each sentence is about AVG_SENT_LENGTH, and you have to go m times for each of the k sentences.
Inverted index
Now let's look at inverted index to allow fast text searches. We will treat your sentences as documents for educational purposes. The idea is to built a data structure for your BM25 computations. We will need to store term frequencies using inverted lists:
wordi: (sent_id1, tf1), (sent_id2, tf2), ... ,(sent_idk, tfk)
Basically, you have hashmaps where your key is word and your value is list of pairs (sent_id<sub>j</sub>, tf<sub>k</sub>) corresponding to ids of sentences and frequency of a word. For example, it could be:
rock: (1, 1), (5, 2)
This tells us that the word rock occurs in the first sentence 1 time and in the fifth sentence 2 times.
This pre-processing step will allow you to get O(1) access to the term frequencies for any particular word, so it will be fast as you want it.
Also, you would want to have another hashmap to store sentence length, which should be a fairly easy task.
How to build inverted index? I am skipping stemming and lemmatization in your case, but you are welcome to read more about it. In short, you traverse through your document, continuously creating pairs/increasing frequencies for your hashmap containing the words. Here are some slides on building the index.

Extended Boolean Model explanation?

We are implementing extended boolean model, but we cannot figure out how to use the formula given: http://en.wikipedia.org/wiki/Extended_Boolean_model The formula here:
contains three "variables" but we have no clue what they means. Assume we have already processed the collection of documents, so we have mapped all words in collection and for each term we have the count of occurations in each document as well as count of occurencies (of concrete term) in the whole collection.
I says right there "The weight of term Kx associated with document dj".
So we are talking about term 'x' and document 'j'. 'i' is the value that maximizes Idfi (the term that has the highest frequency).

Binary String Search - minimum bin width?

I happen to be building the binary search in Python, but the question has more to do with binary search structure in general.
Let's assume I have about one thousand eligible candidates I am searching through using binary search, doing the classic approach of bisecting the sorted dataset and repeating this process in order to narrow down the eligible set to iterate over. The candidates are just strings of names,(first-last format, eg "Peter Jackson") I initially sort the set alphabetically and then proceed with bisection using something like this:
hi = len(names)
lo = 0
while lo < hi:
mid = (lo+hi)//2
midval = names[mid].lower()
if midval < query.lower():
lo = mid+1
elif midval > query.lower():
hi=mid
else:
return midval
return None
This code adapted from here: https://stackoverflow.com/a/212413/215608
Here's the thing, the above procedure assumes a single exact match or no result at all. What if the query was merely for a "Peter", but there are several peters with differing last names? In order to return all the Peters, one would have to ensure that the bisected "bins" never got so small as to except eligible results. The bisection process would have to cease and cede to something like a regex/regular old string match in order to return all the Peters.
I'm not so much asking how to accomplish this as what this type of search is called... what is a binary search with a delimited criteria for "bin size" called? Something that conditionally bisects the dataset, and once the criteria is fulfilled, falls back to some other form of string matching in order to ensure that there can effectively be a ending wildcard on the query (so a search for a "Peter" will get "Peter Jacksons" and "Peter Edwards")
Hopefully I've been clear what I mean. I realize in the typical DB scenario the names might be separated, this is just intended as a proof of concept.
I've not come across this type of two-stage search before, so don't know whether it has a well-known name. I can, however, propose a method for how it can be carried out.
Let say you've run the first stage and have found no match.
You can perform the second stage with a pair of binary searches and a special comparator. The binary searches would use the same principle as bisect_left and bisect_right. You won't be able to use those functions directly since you'll need a special comparator, but you can use them as the basis for your implementation.
Now to the comparator. When comparing the list element x against the search key k, the comparator would only use x[:len(k)] and ignore the rest of x. Thus when searching for "Peter", all Peters in the list would compare equal to the key. Consequently, bisect_left() to bisect_right() would give you the range containing all Peters in the list.
All of this can be done using O(log n) comparisons.
In your binary search you either hit an exact match OR an area where the match would be.
So in your case you need to get the upper and lower boundaries (hi lo as you call them) for the area that would include the Peter and return all the intermediate strings.
But if you aim to do something like show next words of a word you should look into Tries instead of BSTs

Resources