Speeding up elasticsearch more_like_this query - elasticsearch

I was interested in fetching similar documents for a given input document (similar to KNN). As vectorizing documents (using doc2vec) that are not similar in sizes would result in inconsistent document vectors, and then computing a vector for the user's input (which maybe just a few terms/sentences compared the docs on which the doc2vec model was trained on where each doc would consist of 100s or 1000s of words) trying to find k-Nearest Neighbours would produce incorrect results due to lack of features.
Hence, I went ahead with using more_like_this query, which does a similar job compared to kNN, irrespective of the size of the user's input, since I'm interested in analyzing only text fields.
But I was concerned about the performance when I have millions of documents indexed in elasticsearch. The documentation says that using term_vector to store the term vectors at the index time can speed up the analysis.
But what I don't understand is which type of term vector the documentation refers to in this context. As there are three different types of term vectors: term information, term statistics, and field statistics.
And term statistics and field statistics compute the frequency of the terms with respect to other documents in the index, wouldn't these vectors be outdated when I introduce new documents in the index.
Hence I presume that the more_like_this documentation refers to the term information (which is the information of the terms in one particular document irrespective of the others).
Can anyone let me know if computing only the term information vector at the index time is sufficient to speed up more_like_this?

There shouldn't be any worries about term vectors being outdated, since they are stored for every document, so they will be updated respectively.
For More Like This it will be enough just to have term_vectors:yes, you don't need to have offsets and positions. So, if you don't plan using highlighting, you should be fine with just default one.
So, for your text field, you would need to have mappings like this and it will be enough to speed up MLT execution:
{
"mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "yes"
}
}
}
}

Related

How to find if a document is a good match for a query, e.g., normalize elasticsearch score?

The score computed by Elasticsearch provides a ranking between the documents, but it does not tell if the documents are a good match for the request. Currently, the first document can either match on all fields or just one. The only information that the score provides is that it is the best match.
Would it be possible to get a normalized score with respect to the query ? For example, a score of 1 would be a document matching perfectly the query and a score of 0.1 a document matching poorly.
In short, no, it is not possible to get a real normalized score for a query, but it is possible to get a good enough score normalization that works in many cases.
The problem to get a score that tells if the document is a good match or not for a query is to find what would be the best document for this query, and consequently the maximum score. Using elasticsearch and most (if not all) metrics, the maximum score is not bounded.
Even with a simple match query, you can technically reach an infinite score with a document that repeat the queried term an infinite number of time. Without bound on the score, it is not possible to get a true normalized score.
But all hopes are not lost. Instead of normalizing against the best possible score you can normalize against a fake ideal document which is supposed to get the maximum score. For example, if you are querying two fields name and occupation with queried terms Jane Doe and Cook your ideal document can be
{
"name": "Jane Doe",
"occupation": "Cook"
}
If the index contains a document with for example the name Jane Jane Doe then the ideal document may not get the maximum score. If the queried fields are relatively short, you probably do not have to worry about term duplication. If you have fields with many terms you may decide to duplicate some terms which are frequent in the ideal document. If the objective is to find if the document is a good match or not, it is usually not a problem to have a document scored higher than the ideal document.
The good news is that if you are using at least elasticsearch 6.4 you do not have to index the fake document to get its score for a query. You may use the endpoint _scripts/painless/_execute to obtain the score of the ideal document.
GET _scripts/painless/_execute
{
"script": {
"source": "_score"
},
"context": "score",
"context_setup": {
"index": <INDEX>,
"document": <THE_IDEAL_DOCUMENT>,
"query": <YOUR_QUERY>
}
}
Please note that the fields statistics of the fake document such as the number of documents containing a field and the number of fields containing the queried term will be taken into account when computing the score. If you have many documents, this should not be a problem, but for very not frequent field or term (say below 20) you can notice a lower score for the ideal document compared to a previously indexed document.

"Term Vector API" clarification required

I'm not sure if I've understood the Term Vectors API correctly.
The document starts by saying:
Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting realtime parameter to false.
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
Setting field_statistics to false (default is true) will omit :
document count (how many documents contain this field)
sum of document frequencies (the sum of document frequencies for all terms in this field)
sum of total term frequencies (the sum of total term frequencies of each term in this field)
I guess they are simply the sum over their corresponding values reported in term statistics?
Then in the section Behavior it says:
The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected. Use routing only to hit a particular shard.
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
term and token are synonyms and simply mean whatever came out of the analysis process and has been indexed in the Lucene inverted index.
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
By default, the call returns term information and field statistics, but term statistics have to be requested explicitly with &term_statistics=true.
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
payload is a Lucene concept, which is pretty well explained here. Term payloads are not available unless your have a custom analyzer that makes use of a delimited-payload token filter to extract them.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
[...]
I guess they are simply the sum over their corresponding values reported in term statistics?
The sum of "document frequencies" is the number of times each term present in the field appears in the same document. So if the field contains "big brown fox", it will count the number of times "big" appears in the same document, the number of times "brown" appears in the same document and the same for "fox".
The sum of "total term frequencies" is the number of times each term present in this field appears in all documents present in the Lucene index (which is located on a single shard of an ES index). So if the field contains "big brown fox", it will count the number of times "big" appears in all documents, the number of times "brown" appears in all documents and the same for "fox".
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
It is realtime by default, which means that a refresh call is made when issuing the _termvectors call in order to get fresh information from the Lucene index. However, statistics are gathered only from a single shard, which does not give an overall view of the statistics of the whole ES index (potentially made of several shards, hence several Lucene indexes).

Solr Boosting Logic Concepts

I'm trying to understand boosting and if boosting is the answer to my problem.
I have an index and that has different types of data.
EG: Index Animals. One of the fields is animaltype. This value can be Carnivorous, herbivorous etc.
Now when a we query in search, I want to show results of type carnivorous at top, and then the herbivorous type.
Also would it be possible to show only say top 3 results from a type and then remaining from other types?
Let assume for a herbivourous type we have a field named vegetables. This will have values only for a herbivourous animaltype.
Now, can it be possible to have boosting rules specified as follows:
Boost Levels:
animaltype:Carnivorous
then animaltype:Herbivorous and vegatablesfield: spinach
then animaltype:herbivoruous and vegetablesfield: carrot
etc. Basically boosting on various fields at various levels. Im new to this concept. It would really helpful to get some inputs/guidance.
Thanks,
Kasturi Chavan
Your example is closer to sorting than boosting, as you have a priority list for how important each document is - while boosting (in Solr) is usually applied a bit more fluent, meaning that there is no hard line between documents of type X and type Y.
However - boosting with appropriately large values will in effect give you the same result, putting the documents into different score "areas" which will then give you the sort order you're looking for. You can see the score contributed by each term by appending debugQuery=true to your query. Boosting says that 'a document with this value is z times more important than those with a different value', but if the document only contains low scoring tokens from the search (usually words that are very common), while other documents contain high scoring tokens (words that are infrequent), the latter document might still be considered more important.
Example: Searching for "city paris", where most documents contain the word 'city', but only a few contain the word 'paris' (but does not contain city). Even if you boost all documents assigned to country 'germany', the score contributed from city might still be lower - even with the boost factor than what 'paris' contributes alone. This might not occur in real life, but you should know what the boost actually changes.
Using the edismax handler, you can apply the boost in two different ways - one is to use boost=, which is multiplicative, or to use either bq= or bf=, which are additive. The difference is how the boost contributes to the end score.
For your example, the easiest way to get something similar to what you're asking, is to use bq (boost query):
bq=animaltype:Carnivorous^1000&
bq=animaltype:Herbivorous^10
These boosts will probably be large enough to move all documents matching these queries into their own buckets, without moving between groups. To create "different levels" as your example shows, you'll need to tweak these values (and remember, multiple boosts can be applied to the same document if something is both herbivorous and eats spinach).
A different approach would be to create a function query using query, if and similar functions to result in a single integer value that you can use as a sorting value. You can also calculate this value when indexing the document if it's static (which your example is), and then sort by that field instead. It will require you to reindex your documents if the sorting values change, but it might be an easy and effective solution.
To achieve the "Top 3 results from a type" you're probably going to want to look at Result grouping support - which makes it possible to get "x documents" for each value in a single field. There is, as far as I know, no way to say "I want three of these at the top, then the rest from other values", except for doing multiple queries (and excluding the three you've already retrieved from the second query). Usually issuing multiple queries works just as fine (or better) performance wise.

Terms aggregation performance high cardinality

I'm hoping for some clarifications for query performance in elastic 1.5.2 I observed recently.
I have a string field with high cardinality (approx 200,000,000).
I observed that if I use a simple terms aggregation with execution hint global_ordinals_low_cardinality, two things happen:
1. The query returns same results as with global_ordinals, or global_ordinals_hash.
2. The query performs significantly faster. (about twice as fast as global_ordinals, and 4 times as fast as global_ordinals_hash.
Here's the query:
{
"aggs": {
"myTerms": {
"terms": {
"field": "myField",
"size": 1000,
"shard_size": 1000,
"execution_hint": "global_ordinals_low_cardinality"
}
}
}
}
I don't understand why its even legitimate to use global_ordinals_low_cardinality in this instance, because my field has a high cardinality. So perhaps I don't understand what exactly global_ordinals_low_cardinality means?
Secondly, I have another numerical field (long), with roughly same cardinality value.
The values of the long field are actually precomputed hash values (murmur3) for the same string field from above, which i use to greatly speedup cardinality aggregation.
Running the same terms aggregation on the numerical field performs as bad as global_ordinals_hash.
In fact, it doesn't matter what execution hint i use, execution time remains the same.
So why is global_ordinals_low_cardinality applicable for string types, but not for long types? Is it because numerical fields do not require global ordinals at all?
Thanks
I think both the official documentation and the source code shed some light on this. First off, one thing that needs to be mentioned is that execution_hint is exactly what its name says, i.e. just an hint that ES will try to honor, but might not in all cases if it deems not appropriate.
So, the mere fact that you have a high cardinality field precludes the use of global_ordinals_low_cardinality since:
global_ordinals_low_cardinality is only enabled by default on low-cardinality fields.
As for global_ordinals_hash, it is mainly used in inner terms aggregations (not the case here), and map is only used when running an aggregation on scripts and few documents match the query (not the cas either)
So it leaves you with one choice, i.e. global_ordinals which is the default hint used on top-level terms aggregations.
As mentioned earlier, execution_hint is just an hint you specify, but ES will try its best to pick the right execution mode for your data no matter what. Looking into the source code is enlightening and sheds some light on a few things:
Starting here, you'll see that:
on line 201, your hint is read, and then might get overridden if the field doesn't support global ordinals (lines 205-241)
if your field is numeric, execution_hint is completely ignored, and this should probably answer your question.

Performance based on length of "terms"

I have following criteria in the query. The terms list for seen by can grow significantly large. There are also a couple of similar kind of list in "must_not" clause and those can be grow large too.
{
"terms": {
"seen_by": [
"54",
"3",
"418",
"411",
"1",
"101"
]
}
}
What will be the performance difference if the list of terms in conditions grows or shrinks?
It's difficult to answer this question without knowing details about your data size, terms distribution and queries. In general, the number of terms is contributing linearly into search time. Basically, the search engine has to pull a list of documents for every term in your query. Because of this, it's typically not recommended to execute queries with very large number of terms and elasticsearch is actually limiting the number of clauses in boolean queries to 1024 (it can be changed using indices.query.bool.max_clause_count setting).

Resources