Calculate exact count of distinct values for combination of 2 fields in Elasticsearch - elasticsearch

I have around 40 million records in my elasticsearch index. I want to calculate count of distinct values for combination of 2 fields.
Example for given set of documents:
[
{
"JobId" : 2,
"DesigId" : 12
},
{
"JobId" : 2,
"DesigId" : 4
},
{
"JobId" : 3,
"DesigId" : 5
},
{
"JobId" : 2,
"DesigId" : 4
},
{
"JobId" : 3,
"DesigId" : 5
}
]
For above example, I should get the count = 3 as only 3 distinct values exists :
[(2,12),(2,4),(3,5)]
I tried using cardinality aggregation for this but that provides an approximate count. I want to calculate the exact count accurately.
Below is the query which I used using cardinality aggregation:
"aggs": {
"counts": {
"cardinality": {
"script": "doc['JobId'].value + ',' + doc['DesigId'].value",
"precision_threshold": 40000
}
}
}
I also tried using composite aggregation on combination of 2 fields using after key and counting the overall size of buckets but that process is really time taking and my query is getting timed out.
Is there any optimal way to achieve it?

Scripting should be avoided as it affects performance. For your use case, there are 3 ways by which you can achieve your required results :
Using Composite Aggregation (which you have already tried)
Using Multi terms aggregation, but this is not memory efficient solution
Search Query :
{
"size": 0,
"aggs": {
"jobId_and_DesigId": {
"multi_terms": {
"terms": [
{
"field": "JobId"
},
{
"field": "DesigId"
}
]
}
}
}
}
Search Result:
"aggregations": {
"jobId_and_DesigId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": [
2,
4
],
"key_as_string": "2|4",
"doc_count": 2
},
{
"key": [
3,
5
],
"key_as_string": "3|5",
"doc_count": 2
},
{
"key": [
2,
12
],
"key_as_string": "2|12",
"doc_count": 1
}
]
}
}
The combined field value (i.e., the combination of "JobId" and "DesigId") should be stored at the index time itself as this is the best method. This is possible by using a set processor.
PUT /_ingest/pipeline/concat
{
"processors": [
{
"set": {
"field": "combined_field",
"value": "{{JobId}} {{DesigId}}"
}
}
]
}
Index API
When indexing the documents, you need to add pipeline=concat query param, each time you index the documents. Suppose a index API will look like :
POST _doc/1?pipeline=concat
{
"JobId": 2,
"DesigId": 12
}
Search Query:
{
"size": 0,
"aggs": {
"jobId_and_DesigId": {
"terms": {
"field":"combined_field.keyword"
}
}
}
}
Search Result:
"aggregations": {
"jobId_and_DesigId": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "2 4",
"doc_count": 2
},
{
"key": "3 5",
"doc_count": 2
},
{
"key": "2 12",
"doc_count": 1
}
]
}
}

Cardinality aggregation only gives approximate count. Since there are more than 40K documents using precision threshold will also not work.
You can use scripted metric aggregation. It will give accurate count but will considerably slower than cardinality aggregation.
{
"aggs": {
"Distinct_Count": {
"scripted_metric": {
"init_script": "state.list = []",
"map_script": """
state.list.add(doc['JobId'].value+'-'+doc['DesigId'].value);
""",
"combine_script": "return state.list;",
"reduce_script":"""
Map uniqueValueMap = new HashMap();
int count = 0;
for(shardList in states) {
if(shardList != null) {
for(key in shardList) {
if(!uniqueValueMap.containsKey(key)) {
count +=1;
uniqueValueMap.put(key, key);
}
}
}
}
return count;
"""
}
}
}
}

Related

Distinct query in ElasticSearch

I've an index where a field (category) is a list field. I want to fetch all the distinct categories within in an index.
Following is the example.
Doc1 -
{
"category": [1,2,3,4]
}
Doc2 -
{
"category": [5,6]
}
Doc3 -
{
"category": [1,2,3,4]
}
Doc4 -
{
"category": [1,2,7]
}
My output should be
[1,2,3,4]
[5,6]
[1,2,7]
I using the below query:-
GET /products/_search
{
"size": 0,
"aggs" : {
"category" : {
"terms" : { "field" : "category", "size" : 1500 }
}
}}
This returns me [1], [2], [3], [4], [5], [6], [7]. I don't want the individual unique items in my list field. I'm rather looking for the complete unique list.
What am I missing in the above query? I'm using ElasticSearch v7.10
You can use terms aggregation with script:
{
"size": 0,
"aggs": {
"category":{
"terms": {
"script": {
"source": """
def cat="";
for(int i=0;i<doc['category'].length;i++){
cat+=doc['category'][i];}
return cat;
"""
}
}
}
}
}
Above query will return result like below:
"aggregations": {
"category": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "1234",
"doc_count": 2
},
{
"key": "127",
"doc_count": 1
},
{
"key": "56",
"doc_count": 1
}
]
}
}

Interval search for messages in Elasticsearch

I need to split the found messages into intervals. Can this be done with Elasticsearch?
For example. There are 10 messages, you need to divide them into 3 intervals. It should look like this...
[0,1,2,3,4,5,6,7,8,9] => {[0,1,2], [3,4,5,6], [7,8,9]}.
I'm only interested in the beginning of the intervals. For example: {[count - 3, min 0], [count - 4, min 3], [count - 3, min - 7]}
Example.
PUT /test_index
{
"mappings": {
"properties": {
"id": {
"type": "long"
}
}
}
}
POST /test_index/_doc/0
{
"id": 0
}
POST /test_index/_doc/1
{
"id": 1
}
POST /test_index/_doc/2
{
"id": 2
}
POST /test_index/_doc/3
{
"id": 3
}
POST /test_index/_doc/4
{
"id": 4
}
POST /test_index/_doc/5
{
"id": 5
}
POST /test_index/_doc/6
{
"id": 6
}
POST /test_index/_doc/7
{
"id": 7
}
POST /test_index/_doc/8
{
"id": 8
}
POST /test_index/_doc/9
{
"id": 9
}
It is necessary to divide the values ​​into 3 intervals with the same number of elements in each interval:
{
...
"aggregations": {
"result": {
"buckets": [
{
"min": 0.0,
"doc_count": 3
},
{
"min": 3.0,
"doc_count": 4
},
{
"min": 7.0,
"doc_count": 3
}
]
}
}
}
There is a similar function: "variable width histogram":
GET /test_index/_search?size=0
{
"aggs": {
"result": {
"variable_width_histogram": {
"field": "id",
"buckets": 3
}
}
},
"query": {
"match_all": {}
}
}
But "variable width histogram" separates documents by id value, not by the number of elements in the bucket
Assuming your mapping is like:
{
"some_numeric_field" : {"type" : "integer"}
}
Then you can build histograms out of it with fixed interval sizes:
POST /my_index/_search?size=0
{
"aggs": {
"some_numeric_field": {
"histogram": {
"field": "some_numeric_field",
"interval": 7
}
}
}
}
Results:
{
...
"aggregations": {
"prices": {
"buckets": [
{
"key": 0.0,
"doc_count": 7
},
{
"key": 7.0,
"doc_count": 7
},
{
"key": 14.0,
"doc_count": 7
}
]
}
}
}
To get the individual values inside each bucket, just add a sub-aggregation, maybe "top_hits" or anything else like a "terms"
aggregation.
Without knowing more about your data, I really cannot help further.

Elasticsearch - Sort results of Terms aggregation by key string length

I am querying ES with a Terms aggregation to find the first N unique values of a string field foo where the field contains a substring bar, and the document matches some other constraints.
Currently I am able to sort the results by the key string alphabetically:
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": {"_key": "asc"},
"size": N
}
}
}
}
This gives results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
},
{
"key": "z_bar_z",
"doc_count": 1
}
]
}
}
}
How can I change the order option so that the buckets are sorted by the length of the strings in the foo key field, so that the results are like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "iii_bar_iii",
"doc_count": 1
}
]
}
}
}
This is desired because a shorter string is closer to the search substring so is considered a 'better' match so should appear earlier in the results than a longer string.
Any alternative way to sort the buckets by how similar they are to the original substring would also be helpful.
I need the sorting to occur in ES so that I only have to load the top N results from ES.
I worked out a way to do this.
I used a sub-aggregation per dynamic bucket to calculate the length of the key string as another field.
Then I was able to sort by this new length field first, then by the actual key so keys of the same length are sorted alphabetically.
{
"query": {other constraints},
"aggs": {
"my_values": {
"terms": {
"field": "foo.raw",
"include": ".*bar.*",
"order": [
{"key_length": "asc"},
{"_key": "asc"}
],
"size": N
},
"aggs": {
"key_length": {
"max": {"script": "doc['foo.raw'].value.length()" }
}
}
}
}
}
This gave me results like
{
...
"aggregations": {
"my_values": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 145,
"buckets": [
{
"key": "z_bar_z",
"doc_count": 1
},
{
"key": "aa_bar_aa",
"doc_count": 1
},
{
"key": "dd_bar_dd",
"doc_count": 1
},
{
"key": "bbb_bar_bbb",
"doc_count": 1
}
]
}
}
}
which is what I wanted.

Elastic Search Aggregation buckets, buckets by number of records

I am new to Elastic Search and I'm trying to create a request without a lot of success. Here is the use case:
Let's imagine I have 4 documents, which have an amount field:
[
{
"id": 541436748332,
"amount": 5,
"date": "2017-01-01"
},
{
"id": 6348643512,
"amount": 2,
"date": "2017-03-13"
},
{
"id": 343687432,
"amount": 2,
"date": "2017-03-14"
},
{
"id": 6457866181,
"amount": 7,
"date": "2017-05-21"
}
]
And here is the kind of result I'd like to get:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 9
}
]
}
}
}
As you can see, I want some kind of histogram, but instead of putting a date interval, I'd like to set a "document" interval. So here, that would be 2 documents per bucket, and the sum of the field amount of those two documents.
Does someone knows if that is even possible? That would also imply to sort the records by date for example, to get the wanted results
EDIT: Some more explanations on the use case:
The real use case is a line graph I'd like to print. But I want to make the X axis the number of sales, and in the Y the total amount $$$ of those sales. And I don't want to print thousands of dot on my graph, I want fewer dots, that's why I was hoping to deal with the buckets and the sums...
The example of response I gave is just the first step I want to achieve, the second step would be to add each field the one that is behind it:
{
"aggregations": {
"my_aggregation": {
"buckets": [
{
"doc_count": 2,
"sum": 7
},
{
"doc_count": 2,
"sum": 16
}
]
}
}
}
(7 = 5 + 2); (16 = 7 (from last result) + 2 + 7);
You can use histogram and sum aggregations, like this:
{
"size": 0,
"aggs": {
"prices": {
"histogram": {
"field": "id",
"interval": 2,
"offset": 1
},
"aggs": {
"total_amount": {
"sum": {
"field": "amount"
}
}
}
}
}
}
(offset 1 is required if you want the first bucket to start at 1 instead of at 0.) Then you'll get a response like this:
{
"aggregations": {
"prices": {
"buckets": [
{
"key": 1,
"doc_count": 2,
"total_amount": {
"value": 7
}
},
{
"key": 3,
"doc_count": 2,
"total_amount": {
"value": 9
}
}
]
}
}
}
Sorting is not required, because the default order is the order you want. However, there's also an order parameter in case you want a different ordering of the buckets.

How to aggregate and roll up values from child to parent in Elastic Search

I am a newbie to Elastic Search and I am trying to find out how to handle the scenario briefed here. I am having a schema where a document may contain data such as
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 4500,
"status": 0
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 5000,
"status": 0
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster1",
"time_taken": 5000,
"status": 1
},
{
"country":"US",
"zone": "East",
"cluster": "Cluster2",
"time_taken": 5000,
"status": 0
}
Where status = 0 for success, 1 for failure
I would want to show a result in a way that it can reflect a hierarchy with values from "success" like
US/East/Cluster1 = 66% (which is basically 2 success and 1 failure)
US/East/Cluster2 = 100% (which is basically 1 success)
US/East = 75%
US = 75%
Alternatively, if there is also a way to get the time taken average for success and failure scenarios spread across this hierarchy like denoted above, would be great.
I think a terms aggregation should get the job done for you.
In order to satisfy your first query examples (% success per cluster), try something like this:
{
"aggs": {
"byCluster": {
"terms": {
"field": "cluster"
},
"aggs": {
"success_or_fail": {
"terms": {
"field": "status"
}
}
}
}
}
}
This returns a result that looks something like this:
"aggregations": {
"byCluster": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cluster1",
"doc_count": 3,
"success_or_fail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 1,
"doc_count": 1
}
]
}
},
{
"key": "cluster2",
"doc_count": 1,
"success_or_fail": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 0,
"doc_count": 1
}
]
}
}
]
}
}
You can take the doc_count for the 0 bucket of the "success_or_fail" (arbitrary name) aggregation and divide it by the doc_count for the corresponding cluster. This will give you the % success for each cluster. (2/3 for "cluster1" and 1/1 for "cluster2").
The same type of aggregation could be used to group by "country" and "zone".
UPDATE
You can also nest a avg aggregation next to the "success_or_fail" terms aggregation, in order to achieve the average time taken you were looking for.
As in:
{
"query": {
"match_all": {}
},
"aggs": {
"byCluster": {
"terms": {
"field": "cluster"
},
"aggs": {
"success_or_fail": {
"terms": {
"field": "status"
},
"aggs": {
"avg_time_taken": {
"avg": {
"field": "time_taken"
}
}
}
}
}
}
}
}

Resources