doc_count sub aggregation script - elasticsearch

I have an aggregation which gives back a number of buckets. I'd like to further run a script which analyzes all the doc_counts and groups them into categories. Can someone give me an example of how to do this?
For example....
"aggregations": {
“updates_by_account: {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "fbf60d008f0b2f9b3d1d8f1f7fe6e4262662a04c9bcbcc20d92316daade3c25c",
"doc_count": 2
},
{
"key": "916129338fb099792f7b1f414868d45c3fd1a0feb89e1bbeafe24bdb496bec0b",
"doc_count": 1
},
{
"key": "f1b256be780d983549e968f187daef882999fd05889dcab7f1c8c4769ed0996b",
"doc_count": 1
}
]
}
my query looks like this
{
"aggs" : {
"updates_by_account": {
"terms": {
"script" : "doc[‘account_number’].value"
}
}
}
I'd like to do something like:
Number of users with 0-5 updates: 4
Number of users with 6-10 updates: 7
Number of users with 11 or more updates: 12
etc

Related

Calculate exact count of distinct values for combination of 2 fields in Elasticsearch

I have around 40 million records in my elasticsearch index. I want to calculate count of distinct values for combination of 2 fields.
Example for given set of documents:
[
{
"JobId" : 2,
"DesigId" : 12
},
{
"JobId" : 2,
"DesigId" : 4
},
{
"JobId" : 3,
"DesigId" : 5
},
{
"JobId" : 2,
"DesigId" : 4
},
{
"JobId" : 3,
"DesigId" : 5
}
]
For above example, I should get the count = 3 as only 3 distinct values exists :
[(2,12),(2,4),(3,5)]
I tried using cardinality aggregation for this but that provides an approximate count. I want to calculate the exact count accurately.
Below is the query which I used using cardinality aggregation:
"aggs": {
"counts": {
"cardinality": {
"script": "doc['JobId'].value + ',' + doc['DesigId'].value",
"precision_threshold": 40000
}
}
}
I also tried using composite aggregation on combination of 2 fields using after key and counting the overall size of buckets but that process is really time taking and my query is getting timed out.
Is there any optimal way to achieve it?
Scripting should be avoided as it affects performance. For your use case, there are 3 ways by which you can achieve your required results :
Using Composite Aggregation (which you have already tried)
Using Multi terms aggregation, but this is not memory efficient solution
Search Query :
{
"size": 0,
"aggs": {
"jobId_and_DesigId": {
"multi_terms": {
"terms": [
{
"field": "JobId"
},
{
"field": "DesigId"
}
]
}
}
}
}
Search Result:
"aggregations": {
"jobId_and_DesigId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": [
2,
4
],
"key_as_string": "2|4",
"doc_count": 2
},
{
"key": [
3,
5
],
"key_as_string": "3|5",
"doc_count": 2
},
{
"key": [
2,
12
],
"key_as_string": "2|12",
"doc_count": 1
}
]
}
}
The combined field value (i.e., the combination of "JobId" and "DesigId") should be stored at the index time itself as this is the best method. This is possible by using a set processor.
PUT /_ingest/pipeline/concat
{
"processors": [
{
"set": {
"field": "combined_field",
"value": "{{JobId}} {{DesigId}}"
}
}
]
}
Index API
When indexing the documents, you need to add pipeline=concat query param, each time you index the documents. Suppose a index API will look like :
POST _doc/1?pipeline=concat
{
"JobId": 2,
"DesigId": 12
}
Search Query:
{
"size": 0,
"aggs": {
"jobId_and_DesigId": {
"terms": {
"field":"combined_field.keyword"
}
}
}
}
Search Result:
"aggregations": {
"jobId_and_DesigId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "2 4",
"doc_count": 2
},
{
"key": "3 5",
"doc_count": 2
},
{
"key": "2 12",
"doc_count": 1
}
]
}
}
Cardinality aggregation only gives approximate count. Since there are more than 40K documents using precision threshold will also not work.
You can use scripted metric aggregation. It will give accurate count but will considerably slower than cardinality aggregation.
{
"aggs": {
"Distinct_Count": {
"scripted_metric": {
"init_script": "state.list = []",
"map_script": """
state.list.add(doc['JobId'].value+'-'+doc['DesigId'].value);
""",
"combine_script": "return state.list;",
"reduce_script":"""
Map uniqueValueMap = new HashMap();
int count = 0;
for(shardList in states) {
if(shardList != null) {
for(key in shardList) {
if(!uniqueValueMap.containsKey(key)) {
count +=1;
uniqueValueMap.put(key, key);
}
}
}
}
return count;
"""
}
}
}
}

Elasticsearch - Sort results of Terms aggregation by key string length

I am querying ES with a Terms aggregation to find the first N unique values of a string field foo where the field contains a substring bar, and the document matches some other constraints.
Currently I am able to sort the results by the key string alphabetically:
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": {"_key": "asc"},
"size": N
}
}
}
}
This gives results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
},
{
"key": "z_bar_z",
"doc_count": 1
}
]
}
}
}
How can I change the order option so that the buckets are sorted by the length of the strings in the foo key field, so that the results are like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
}
]
}
}
}
This is desired because a shorter string is closer to the search substring so is considered a 'better' match so should appear earlier in the results than a longer string.
Any alternative way to sort the buckets by how similar they are to the original substring would also be helpful.
I need the sorting to occur in ES so that I only have to load the top N results from ES.
I worked out a way to do this.
I used a sub-aggregation per dynamic bucket to calculate the length of the key string as another field.
Then I was able to sort by this new length field first, then by the actual key so keys of the same length are sorted alphabetically.
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": [
{"key_length": "asc"},
{"_key": "asc"}
],
"size": N
},
"aggs": {
"key_length": {
"max": {"script": "doc['foo.raw'].value.length()" }
}
}
}
}
}
This gave me results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "dd_bar_dd",
"doc_count": 1
},
{
"key": "bbb_bar_bbb",
"doc_count": 1
}
]
}
}
}
which is what I wanted.

query on result aggregation elasticsearch

I had imported millions data into elasticsearch. Mapping is following:
"_source": {
"mt": "w",
"hour": 1
}
i want to find number of hour's that have occured more than 5.
for exmple:
using terms aggregation i get following result :
"aggregations": {
"hours": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 7
},
{
"key": 4,
"doc_count": 5
},
{
"key": 5,
"doc_count": 2
}
]
}
}
How do i find count of hour's that occure more than 5.
in here it be 1 because only hour=1 is more than 5
you can use "min_doc_count": 5 in terms aggregation Elastic doc

Elastic Search Aggregation buckets, buckets by number of records

I am new to Elastic Search and I'm trying to create a request without a lot of success. Here is the use case:
Let's imagine I have 4 documents, which have an amount field:
[
{
"id": 541436748332,
"amount": 5,
"date": "2017-01-01"
},
{
"id": 6348643512,
"amount": 2,
"date": "2017-03-13"
},
{
"id": 343687432,
"amount": 2,
"date": "2017-03-14"
},
{
"id": 6457866181,
"amount": 7,
"date": "2017-05-21"
}
]
And here is the kind of result I'd like to get:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 9
}
]
}
}
}
As you can see, I want some kind of histogram, but instead of putting a date interval, I'd like to set a "document" interval. So here, that would be 2 documents per bucket, and the sum of the field amount of those two documents.
Does someone knows if that is even possible? That would also imply to sort the records by date for example, to get the wanted results
EDIT: Some more explanations on the use case:
The real use case is a line graph I'd like to print. But I want to make the X axis the number of sales, and in the Y the total amount $$$ of those sales. And I don't want to print thousands of dot on my graph, I want fewer dots, that's why I was hoping to deal with the buckets and the sums...
The example of response I gave is just the first step I want to achieve, the second step would be to add each field the one that is behind it:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 16
}
]
}
}
}
(7 = 5 + 2); (16 = 7 (from last result) + 2 + 7);
You can use histogram and sum aggregations, like this:
{
"size": 0,
"aggs": {
"prices": {
"histogram": {
"field": "id",
"interval": 2,
"offset": 1
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
(offset 1 is required if you want the first bucket to start at 1 instead of at 0.) Then you'll get a response like this:
{
"aggregations": {
"prices": {
"buckets": [
{
"key": 1,
"doc_count": 2,
"total_amount": {
"value": 7
}
},
{
"key": 3,
"doc_count": 2,
"total_amount": {
"value": 9
}
}
]
}
}
}
Sorting is not required, because the default order is the order you want. However, there's also an order parameter in case you want a different ordering of the buckets.

ElasticSearch - Get Statistics on Aggregation results

I have the following simple aggregation:
GET index1/type1/_search
{
"size": 0,
"aggs": {
"incidentID": {
"terms": {
"field": "incidentID",
"size": 5
}
}
}
}
Results are:
"aggregations": {
"incidentID": {
"buckets": [
{
"key": "0A631EB1-01EF-DC28-9503-FC28FE695C6D",
"doc_count": 233
},
{
"key": "DF107D2B-CA1E-85C9-E01A-C966DC6F7051",
"doc_count": 226
},
{
"key": "60B8955F-38FD-8DFE-D374-4387668C8368",
"doc_count": 220
},
{
"key": "B787868A-F72E-63DC-D837-B3A864D9FFC6",
"doc_count": 174
},
{
"key": "C597EC5F-C60F-F3BA-61CB-4990F12C1893",
"doc_count": 174
}
]
}
}
What I want to do is get the "statistics" of the "doc_count" returned. I want:
Min Value
Max Value
Average
Standard Deviation
No, this is not currently possible, here is the issue tracking the support:
https://github.com/elasticsearch/elasticsearch/issues/8110
Obviously, it is possible to do this client side if you are able to pull the full list of all buckets into memory.

Resources