Terms aggregation performance high cardinality - performance

I'm hoping for some clarifications for query performance in elastic 1.5.2 I observed recently.
I have a string field with high cardinality (approx 200,000,000).
I observed that if I use a simple terms aggregation with execution hint global_ordinals_low_cardinality, two things happen:
1. The query returns same results as with global_ordinals, or global_ordinals_hash.
2. The query performs significantly faster. (about twice as fast as global_ordinals, and 4 times as fast as global_ordinals_hash.
Here's the query:
{
"aggs": {
"myTerms": {
"terms": {
"field": "myField",
"size": 1000,
"shard_size": 1000,
"execution_hint": "global_ordinals_low_cardinality"
}
}
}
}
I don't understand why its even legitimate to use global_ordinals_low_cardinality in this instance, because my field has a high cardinality. So perhaps I don't understand what exactly global_ordinals_low_cardinality means?
Secondly, I have another numerical field (long), with roughly same cardinality value.
The values of the long field are actually precomputed hash values (murmur3) for the same string field from above, which i use to greatly speedup cardinality aggregation.
Running the same terms aggregation on the numerical field performs as bad as global_ordinals_hash.
In fact, it doesn't matter what execution hint i use, execution time remains the same.
So why is global_ordinals_low_cardinality applicable for string types, but not for long types? Is it because numerical fields do not require global ordinals at all?
Thanks

I think both the official documentation and the source code shed some light on this. First off, one thing that needs to be mentioned is that execution_hint is exactly what its name says, i.e. just an hint that ES will try to honor, but might not in all cases if it deems not appropriate.
So, the mere fact that you have a high cardinality field precludes the use of global_ordinals_low_cardinality since:
global_ordinals_low_cardinality is only enabled by default on low-cardinality fields.
As for global_ordinals_hash, it is mainly used in inner terms aggregations (not the case here), and map is only used when running an aggregation on scripts and few documents match the query (not the cas either)
So it leaves you with one choice, i.e. global_ordinals which is the default hint used on top-level terms aggregations.
As mentioned earlier, execution_hint is just an hint you specify, but ES will try its best to pick the right execution mode for your data no matter what. Looking into the source code is enlightening and sheds some light on a few things:
Starting here, you'll see that:
on line 201, your hint is read, and then might get overridden if the field doesn't support global ordinals (lines 205-241)
if your field is numeric, execution_hint is completely ignored, and this should probably answer your question.

Related

Speeding up elasticsearch more_like_this query

I was interested in fetching similar documents for a given input document (similar to KNN). As vectorizing documents (using doc2vec) that are not similar in sizes would result in inconsistent document vectors, and then computing a vector for the user's input (which maybe just a few terms/sentences compared the docs on which the doc2vec model was trained on where each doc would consist of 100s or 1000s of words) trying to find k-Nearest Neighbours would produce incorrect results due to lack of features.
Hence, I went ahead with using more_like_this query, which does a similar job compared to kNN, irrespective of the size of the user's input, since I'm interested in analyzing only text fields.
But I was concerned about the performance when I have millions of documents indexed in elasticsearch. The documentation says that using term_vector to store the term vectors at the index time can speed up the analysis.
But what I don't understand is which type of term vector the documentation refers to in this context. As there are three different types of term vectors: term information, term statistics, and field statistics.
And term statistics and field statistics compute the frequency of the terms with respect to other documents in the index, wouldn't these vectors be outdated when I introduce new documents in the index.
Hence I presume that the more_like_this documentation refers to the term information (which is the information of the terms in one particular document irrespective of the others).
Can anyone let me know if computing only the term information vector at the index time is sufficient to speed up more_like_this?
There shouldn't be any worries about term vectors being outdated, since they are stored for every document, so they will be updated respectively.
For More Like This it will be enough just to have term_vectors:yes, you don't need to have offsets and positions. So, if you don't plan using highlighting, you should be fine with just default one.
So, for your text field, you would need to have mappings like this and it will be enough to speed up MLT execution:
{
"mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "yes"
}
}
}
}

How does Solr's spellcheck.collate influence performance?

The Solr documentation on spellchecking parameters states (emphasis mine):
spellcheck.collate
If true, this parameter directs Solr to take the best suggestion for
each token (if one exists) and construct a new query from the
suggestions. [...]
The spellcheck.collate parameter only returns collations that are guaranteed to result in hits if re-queried, even when applying
original fq parameters. This is especially helpful when there is more
than one correction per query.
This only returns a query to be used. It does not actually run the suggested query.
I'd imagine in order to decide if the corrected terms yield a result Solr still has to run a variant of the original query in the background. Sure, it can ignore most parts of the original query like grouping and does not have to compute the relevance of results, but it still will have to perform the whole filter query, stemming, fuzzy search etc.
So can I expect spellcheck.collate to have a performance impact depending on the complexity of my filter query and certain other parts of the original query?

Solr Boosting Logic Concepts

I'm trying to understand boosting and if boosting is the answer to my problem.
I have an index and that has different types of data.
EG: Index Animals. One of the fields is animaltype. This value can be Carnivorous, herbivorous etc.
Now when a we query in search, I want to show results of type carnivorous at top, and then the herbivorous type.
Also would it be possible to show only say top 3 results from a type and then remaining from other types?
Let assume for a herbivourous type we have a field named vegetables. This will have values only for a herbivourous animaltype.
Now, can it be possible to have boosting rules specified as follows:
Boost Levels:
animaltype:Carnivorous
then animaltype:Herbivorous and vegatablesfield: spinach
then animaltype:herbivoruous and vegetablesfield: carrot
etc. Basically boosting on various fields at various levels. Im new to this concept. It would really helpful to get some inputs/guidance.
Thanks,
Kasturi Chavan
Your example is closer to sorting than boosting, as you have a priority list for how important each document is - while boosting (in Solr) is usually applied a bit more fluent, meaning that there is no hard line between documents of type X and type Y.
However - boosting with appropriately large values will in effect give you the same result, putting the documents into different score "areas" which will then give you the sort order you're looking for. You can see the score contributed by each term by appending debugQuery=true to your query. Boosting says that 'a document with this value is z times more important than those with a different value', but if the document only contains low scoring tokens from the search (usually words that are very common), while other documents contain high scoring tokens (words that are infrequent), the latter document might still be considered more important.
Example: Searching for "city paris", where most documents contain the word 'city', but only a few contain the word 'paris' (but does not contain city). Even if you boost all documents assigned to country 'germany', the score contributed from city might still be lower - even with the boost factor than what 'paris' contributes alone. This might not occur in real life, but you should know what the boost actually changes.
Using the edismax handler, you can apply the boost in two different ways - one is to use boost=, which is multiplicative, or to use either bq= or bf=, which are additive. The difference is how the boost contributes to the end score.
For your example, the easiest way to get something similar to what you're asking, is to use bq (boost query):
bq=animaltype:Carnivorous^1000&
bq=animaltype:Herbivorous^10
These boosts will probably be large enough to move all documents matching these queries into their own buckets, without moving between groups. To create "different levels" as your example shows, you'll need to tweak these values (and remember, multiple boosts can be applied to the same document if something is both herbivorous and eats spinach).
A different approach would be to create a function query using query, if and similar functions to result in a single integer value that you can use as a sorting value. You can also calculate this value when indexing the document if it's static (which your example is), and then sort by that field instead. It will require you to reindex your documents if the sorting values change, but it might be an easy and effective solution.
To achieve the "Top 3 results from a type" you're probably going to want to look at Result grouping support - which makes it possible to get "x documents" for each value in a single field. There is, as far as I know, no way to say "I want three of these at the top, then the rest from other values", except for doing multiple queries (and excluding the three you've already retrieved from the second query). Usually issuing multiple queries works just as fine (or better) performance wise.

Top 10% of results with sort

I'm looking for a setup that actually returns the top 10% of results of a certain query. After the result we also want to sort the subset.
Is there an easy way to do this?
Can anyone provide a simple example for this.
I was thinking scaling the results scores between 0 and 1.0 and basically sepcifiying min_score to 0.9.
I was trying to create function_score queries but those seem a bit complex for a simple requirement such as this one, plus I was not sure how sorting would effect the results, since I want the sort functions work always on the 10% most relevant articles of course.
Thanks,
Peter
As you want to slice response in % of overall docs count, you need to know that anyway. And using from / size params will cut off the required amount at query time.
Assuming this, seems that easiest way to achieve your goal is to make 2 queries:
Filtered query with all filters, no queries and search_type=count to get overall document count.
Perform your regular matching query, applying {"from": 0, "size": count/10} with count got from 1st response.
Talking about tweaking the scoring. For me, it seems as bad idea, as getting multiple documents with the same score is pretty generic situation. So, cutting dataset by min_score will probably result in skewed data.

Difference between Elasticsearch Range Query and Range Filter

I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document — 
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html

Resources