Limit and Offset in Term Aggregation ElasticSearch - elasticsearch

There is way to get the top n terms result. For example:
{
"aggs": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 5
}
}
}
}
Is there any way to set the offset for the terms result?

If you mean something like ignore first m results and return the next n results then no; it is not possible. A workaround to that would be to set size to m + n and do client side processing to ignore the first m results.

A little late, but (at least) since Elastic 5.2.0 you can use partitioning in the terms aggregation to paginate results.
https://www.elastic.co/guide/en/elasticsearch/reference/5.2/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions

Maybe this helps a bit:
"aggregations": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 9999 ---> add here a bigger size
}
},
"aggregations": {
"limitBucket": {
"bucket_sort": {
"sort": [],
"from": 10,
"size": 20,
"gap_policy": "SKIP"
}
}
}
}
I am not sure about what value to put in the term size. I would suggest to put a reasonable value. This limits the initial aggregation, then the second limitBucket agg will limit again the term agg. This will probably still load in memory all the documents that you limited in the terms agg. That is why it depends on your scenario, if it's reasonable not get all results (i.e. if you have tens of thousands). I.e you are doing a google like search where you don't need to jump to page 1000.
Compared to the alternative to get the data on the client side, this might save you some data transfer from ES, but as I said weight this carefully as it loads all a lot of data in ES memory and you might have memory issues in ElasticSearch

Related

elasticsearch: count appearance of terms aggregation on other fields

I want to count how many times, unique values (result of terms aggragation) have appeared in other fields in the same query. Let's say:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"unique_products": {
"terms": {
"field": "products.name.keyword",
"min_doc_count": 10
}
}
}
}
What I want is to count, how many time each of the keys returned in the bucket, appeared in another field.
My ideal output is:
"aggregations": {
"product_stat": {
"key": "<product_name>"
"sold": "<#>" #I want to know how many times the key is appeared in another field like sold
"bought": "<#>"
}
}
Elasticsearch cannot do terms aggregations over multiple fields. In short, if they would, aggregations would not be blazing fast.
As documentation suggests, there are two options:
use script terms aggregation (with performance penalty),
change how the documents are indexed so a normal terms aggregation can be used.
Depending on the structure of your data and your use-cases, you might get by with a complex aggregation + some processing on the client side. This can be done with sub aggregations like here, for example.
Hope that helps!

ES: How do quasi-join queries using global aggregation compare to parent-child / nested queries?

At my work, I came across the following pattern for doing quasi-joins in Elasticsearch. I wonder whether this is a good idea, performance-wise.
The pattern:
Connects docs in one index in one-to-many relationship.
Somewhat like ES parent-child, but implemented without it.
Child docs need to be indexed with a field called e.g. "my_parent_id", with value being the parent ID.
Can be used when querying for parent, knowing its ID in advance, to also get the children in the same query.
The query with quasi-join (assume 123 is parent ID):
GET /my-index/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"id": {
"value": 123
}
}
}
]
}
},
"aggs": {
"my-global-agg" : {
"global" : {},
"aggs" : {
"my-filtering-all-but-children": {
"filter": {
"term": {
"my_parent_id": 123
}
},
"aggs": {
"my-returning-children": {
"top_hits": {
"_source": {
"includes": [
"my_child_field1_to_return",
"my_child_field2_to_return"
]
},
"size": 1000
}
}
}
}
}
}
}
}
This query returns:
the parent (as search query result), and
its children (as the aggregation result).
Performance-wise, is the above:
definitively a good idea,
definitively a bad idea,
hard to tell / it depends?
It depends ;-) The idea is good, however, by default the maximum number of hits you can return in a top_hits aggregation is 100, if you try 1000 you'll get an error like this:
Top hits result window is too large, the top hits aggregator [hits]'s from + size must be less than or equal to: [100] but was [1000]. This limit can be set by changing the [index.max_inner_result_window] index level setting.
As the error states, you can increase this limit by changing the index.max_inner_result_window index setting. But, if there's a default, there's usually a good reason. I would take that as a hint that it might not be that great an idea to increase it too much.
So, if your parent documents have less than 100 children, why not, otherwise I'd seriously consider going another approach.

How to get only x results from elastic and then stop searching?

My whole index is about 700M docs, this query:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10
}
applies to ca 5M docs. "Some_field" is indexed, not analysed.
Query takes ca 1s on average hetzner. Too slow :) I don't care about pagination or sorting or scoring. I just want 10 first "random" matching docs.
Is there the way to do it with disabled score, in the "mysql way"?
filter or constant_score do not help
If you go with filters, that will remove the score computation and should provide faster query speeds:
{
"query": {
"bool": {
"filter": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
}
}
}
"size": 10
}
If that's still too slow, you could consider using document routing, but it may not be a viable option for you as you might have just 1 shard or very few terms for SOME_FIELD.
I also suggest you go over the production deployment document by Elastic, it gives you an overview on how to configure your cluster optimally and can also produce some serious performance boost in case you currently have a misconfigured cluster, i.e. running on a strong machine but keeping the default ES_HEAP_SIZE value.
The option i was looking for is "terminate_after". Unfortunately it is not "very well" documemented, see:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-limit-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-count.html#_request_parameters
so, my query looks like this:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10,
"terminate_after": 10
}
Don't use "10" instead of 10. Elastic does not cast it to integer and ignores the parameter

Elasticsearch:: Sorting giving weird results

When I am searching the for the first time, its sorting all documents and giving me the first 5 records. However, if same search query is executed by changing the sort direction(ASC -> DESC), then its not sorting all documents again, its giving me last 5 retrieved documents(from previous search query), sorting them in desc order, and giving it back to me. I was expecting that it will sort all available documents in DESC order, and then retrieve first 5 results.
Am I doing something wrong, or missed any concept.
My search query:
{
"sort": {
"taskid": {
"order": "ASC"
}
},
"from": 0,
"size": 5,
"query": {
"filtered": {
"query": {
"match_all": []
}
}
}
}
I have data with taskid 1 to 100. Now above query fetched me record from taskid 1 to 5 in first attempt. Now when I changed the sort direction to desc, I was expecting documents with taskid 96-100(100,99,98,97,96 sequence) should be returned, however I was returned documents with taskid 5,4,3,2,1 in that sequence. Which meant, sorting was done on previous returned result only.
Please note that taskid and _id are same in my document. I had added a redundant field in my mapping which will be same as _id
Just change the case of the value in order key and you are good to go.
{
"sort": {
"taskid": {
"order": "asc" // or "desc"
}
},
"from": 0,
"size": 5,
"query": {
"filtered": {
"query": {
"match_all": []
}
}
}
}
Hope this helps..
In elastic search, sort query is applied after the result are extracted from the es. As per the query mentioned in your question, first result is filtered based on search criteria, and then sorting is applied on the filtered result.
If it looks like you are only getting results based on an old subset of your data, then it may be that your newer data has not been indexed yet. This can happen easily in an automated test but with manual testing it is less likely.
Segments are rebuilt every second, so adding a delay/sleep of about a second between indexing and searching should fix your test if this is the problem.

Elasticsearch get aggregation bucket size (number of elements in the bucket) without retrieving all data

I am trying to get information about an aggregation in Elasticsearch.
I have an index in which I store mail metadata (sender ip, subject etc.) What I'm trying to do is I want to get number of IPs which send over 1000 mails. (So for example let's say we have 3 IP addresses, 2000 mails are sent from first IP, 1500 from second and 200 from the third IP. Then I want to see 2 as the aggregation result.) I wrote the following query:
GET /my_index/_search
{
"size": 0,
"aggs": {
"ipAddresses": {
"terms": {
"field": "senderIpAddress",
"min_doc_count": 1000,
"size" : 0
}
}
}
}
I can get the bucket and calculate its size in my back-end implementation, however I need to get all the data in the bucket in order to do this. It is slow and I want to get the bucket size without getting all the data.
TL;DR, how can I get the total size of aggregation bucket without retrieving the whole data?
This is the purpose of the cardinality aggregation:
{
"size": 0,
"aggs": {
"ipAddressesCount": {
"cardinality": {
"field": "senderIpAddress"
}
}
}
}
Keep in mind that it is an approximation -- the precision can be configured using precision_threshold as documented in the link above.

Resources