Issue with nested aggregations ElasticSearch : doing a sum after a max - elasticsearch

I know sub aggregation isn't possible with metric aggregations and that Elasticsearch supports sub aggregations with buckets. But I am a bit lost on how to do this.
I want to do a sum after nested aggregations and after having aggregated by max timestamp.
Something like the code below, give me this error : "Aggregator [max_date_aggs] of type [max] cannot accept sub-aggregations" which is normal. Is there a way to make it works?
{
"aggs": {
"sender_comp_aggs": {
"terms": {
"field": "senderComponent"
},
"aggs": {
"activity_mnemo_aggs": {
"terms": {
"field": "activityMnemo"
},
"aggs": {
"activity_instance_id_aggs": {
"terms": {
"field": "activityInstanceId"
},
"aggs": {
"business_date_aggs": {
"terms": {
"field": "correlationIdSet.businessDate"
},
"aggs": {
"context_set_id_closing_aggs": {
"terms": {
"field": "contextSetId.closing"
},
"aggs": {
"max_date_aggs": {
"max": {
"field": "timestamp"
},
"aggs" : {
"sum_done": {
"sum": {
"field": "itemNumberDone"
}
}
}
}
}
}
}
}
}
}
}
}
}
}
Thank you

I am not 100% sure what you would like to achieve, it helps if you also would have shared the mapping.
A bucket aggregation is about defining the buckets/groups. As you do in your example, you can wrap/nest bucket aggregations to further break down your buckets into sub-buckets and so on.
By default Elasticsearch always calculates the count-metric, but you can specify other metrics to get calculated as well. A metric is calculated per bucket / for a bucket (and not for another metric) this is why you cannot nest a metrics aggregation under a metric aggregation, it simply does not make sense.
Depending how your data looks like the only change you may need to do is, moving the sum_done aggregation out of the aggs-clause, to the same level as your max_date_aggs-aggregation.
Code Snippet
"aggs": {
"max_date_aggs": { "max": {"field": "timestamp"} },
"sum_done": { "sum": { "field": "itemNumberDone"} }
}
After you refined your question and you provided I managed to come up with a solution requiring one single request. As previously mentioned that sum-metric aggregation needs to operate on a bucket and not a metric. The solution is pretty straight forward: rather than calculating the max-date, just re-formulate this aggregation to a terms-aggregation, sorted by descending timestamp, asking for exactly one bucket.
Solution
GET gos_element/_search
{
"size": 0,
"aggs": {
"sender_comp_aggs": {
"terms": {"field": "senderComponent.keyword"},
"aggs": {
"activity_mnemo_aggs": {
"terms": {"field": "activityMnemo.keyword"},
"aggs": {
"activity_instance_id_aggs": {
"terms": {"field": "activityInstanceId.keyword"},
"aggs": {
"business_date_aggs": {
"terms": {"field": "correlationIdSet.businessDate"},
"aggs": {
"context_set_id_closing_aggs": {
"terms": {"field": "contextSetId.closing.keyword"},
"aggs": {
"max_date_bucket_aggs": {
"terms": {
"field": "timestamp",
"size": 1,
"order": {"_key": "desc"}
},
"aggs": {
"sum_done": {
"sum": {"field": "itemNumberDone"}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
As I relied on the default Elasticsearch mapping, I had to refer to the .keyword-version of the fields. If your fields are directly mapped to a field of type keyword, you don't need to do that.
You can test the request above right away after indexing the documents provided by you with the following 2 commands:
PUT gos_element/_doc/AW_yu3dIa2R_HwqpSz
{
"senderComponent": "PS",
"timestamp": "2020-01-28T02:31:00Z",
"activityMnemo": "PScommand",
"activityInstanceId": "123466",
"activityStatus": "Progress",
"activityStatusNumber": 300,
"specificActivityStatus": "",
"itemNumberTotal": 10,
"itemNumberDone": 9,
"itemNumberInError": 0,
"itemNumberNotStarted": 1,
"itemNumberInProgress": 0,
"itemUnit": "Command",
"itemList": [],
"contextSetId": {
"PV": "VAR",
"closing": "PARIS"
},
"correlationIdSet": {
"closing": "PARIS",
"businessDate": "2020-01-27",
"correlationId": "54947df8-0e9e-4471-a2f9-9af509fb5899"
},
"errorSet": [],
"kpiSet": "",
"activitySpecificPayload": "",
"messageGroupUUID": "54947df8-0e9e-4471-a2f9-9af509fb5899"
}
PUT gos_element/_doc/AW_yu3dIa2R_HwqpSz8z
{
"senderComponent": "PS",
"timestamp": "2020-01-28T03:01:00Z",
"activityMnemo": "PScommand",
"activityInstanceId": "123466",
"activityStatus": "End",
"activityStatusNumber": 200,
"specificActivityStatus": "",
"itemNumberTotal": 10,
"itemNumberDone": 10,
"itemNumberInError": 0,
"itemNumberNotStarted": 0,
"itemNumberInProgress": 0,
"itemUnit": "Command",
"itemList": [],
"contextSetId": {
"PV": "VAR",
"closing": "PARIS"
},
"correlationIdSet": {
"closing": "PARIS",
"businessDate": "2020-01-27",
"correlationId": "54947df8-0e9e-4471-a2f9-9af509fb5899"
},
"errorSet": [],
"errorMessages": "",
"kpiSet": "",
"activitySpecificPayload": "",
"messageGroupUUID": "54947df8-0e9e-4471-a2f9-9af509fb5899"
}
As a result you get back the following response (with value 10 as expected):
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"sender_comp_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "PS",
"doc_count" : 2,
"activity_mnemo_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "PScommand",
"doc_count" : 2,
"activity_instance_id_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "123466",
"doc_count" : 2,
"business_date_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1580083200000,
"key_as_string" : "2020-01-27T00:00:00.000Z",
"doc_count" : 2,
"context_set_id_closing_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "PARIS",
"doc_count" : 2,
"max_date_bucket_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1,
"buckets" : [
{
"key" : 1580180460000,
"key_as_string" : "2020-01-28T03:01:00.000Z",
"doc_count" : 1,
"sum_done" : {
"value" : 10.0
}
}
]
}
}
]
}
}
]
}
}
]
}
}
]
}
}
]
}
}
}

Here are two documents:
{
"_type": "gos_element",
"_id": "AW_yu3dIa2R_HwqpSz-o",
"_score": 5.785128,
"_source": {
"senderComponent": "PS",
"timestamp": "2020-01-28T02:31:00Z",
"activityMnemo": "PScommand",
"activityInstanceId": "123466",
"activityStatus": "Progress",
"activityStatusNumber": 300,
"specificActivityStatus": "",
"itemNumberTotal": 10,
"itemNumberDone": 9,
"itemNumberInError": 0,
"itemNumberNotStarted": 1,
"itemNumberInProgress": 0,
"itemUnit": "Command",
"itemList": [],
"contextSetId": {
"PV": "VAR",
"closing": "PARIS"
},
"correlationIdSet": {
"closing": "PARIS",
"businessDate": "2020-01-27",
"correlationId": "54947df8-0e9e-4471-a2f9-9af509fb5899"
},
"errorSet": [],
"kpiSet": "",
"activitySpecificPayload": "",
"messageGroupUUID": "54947df8-0e9e-4471-a2f9-9af509fb5899"
}
},
{
"_type": "gos_element",
"_id": "AW_yu3dIa2R_HwqpSz8z",
"_score": 4.8696175,
"_source": {
"senderComponent": "PS",
"timestamp": "2020-01-28T03:01:00Z",
"activityMnemo": "PScommand",
"activityInstanceId": "123466",
"activityStatus": "End",
"activityStatusNumber": 200,
"specificActivityStatus": "",
"itemNumberTotal": 10,
"itemNumberDone": 10,
"itemNumberInError": 0,
"itemNumberNotStarted": 0,
"itemNumberInProgress": 0,
"itemUnit": "Command",
"itemList": [],
"contextSetId": {
"PV": "VAR",
"closing": "PARIS"
},
"correlationIdSet": {
"closing": "PARIS",
"businessDate": "2020-01-27",
"correlationId": "54947df8-0e9e-4471-a2f9-9af509fb5899"
},
"errorSet": [],
"errorMessages": "",
"kpiSet": "",
"activitySpecificPayload": "",
"messageGroupUUID": "54947df8-0e9e-4471-a2f9-9af509fb5899"
}
}
]
}
I would like to aggregate of a few terms (senderComponent, activityMnemo ,activityInstanceId, correlationIdSet.businessDate and contextSetId.closing) and also aggregate on the max timestamp of each of these aggregations. Once this is done, I would like to sum the itemNumberDone.
If we take only these two documents and do the aggregations, I would like to get 10 itemNumberDone.
Is it possible with only one query and using buckets?

Related

Query filter for searching rollup index works with epoch time fails with date math

`How do we query (filter) a rollup index?
For example, based on the query here
Request:
{
"size": 0,
"aggregations": {
"timeline": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "7d"
},
"aggs": {
"nodes": {
"terms": {
"field": "node"
},
"aggs": {
"max_temperature": {
"max": {
"field": "temperature"
}
},
"avg_voltage": {
"avg": {
"field": "voltage"
}
}
}
}
}
}
}
}
Response:
{
"took" : 93,
"timed_out" : false,
"terminated_early" : false,
"_shards" : ... ,
"hits" : {
"total" : {
"value": 0,
"relation": "eq"
},
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"timeline" : {
"buckets" : [
{
"key_as_string" : "2018-01-18T00:00:00.000Z",
"key" : 1516233600000,
"doc_count" : 6,
"nodes" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "a",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 5.1499998569488525
}
},
{
"key" : "b",
"doc_count" : 2,
"max_temperature" : {
"value" : 201.0
},
"avg_voltage" : {
"value" : 5.700000047683716
}
},
{
"key" : "c",
"doc_count" : 2,
"max_temperature" : {
"value" : 202.0
},
"avg_voltage" : {
"value" : 4.099999904632568
}
}
]
}
}
]
}
}
}
How to filter say last 3 days, is it possible?
For a test case, I used fixed_interval rate of 1m (one minute, and also 60 minutes) and I tried the following and the error was all query shards failed. Is it possible to query filter rollup agggregations?
Test Query for searching rollup index
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "now-3d/d",
"lt": "now/d"
}
}
}
"aggregations": {
"timeline": {
"date_histogram": {
"field": "timestamp",
"fixed_interval": "7d"
},
"aggs": {
"nodes": {
"terms": {
"field": "node"
},
"aggs": {
"max_temperature": {
"max": {
"field": "temperature"
}
},
"avg_voltage": {
"avg": {
"field": "voltage"
}
}
}
}
}
}
}
}

ELASTICSEARCH - Get a count of values from the most recent document

I can't get a count of fields with a filtered document value.
I have this json
``
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "net",
"_type" : "_doc",
"_id" : "RTHRTH",
"_score" : 1.0,
"_source" : {
"created_at" : "2020-05-31 19:01:01",
"data" : [...]
{
"_index" : "net",
"_type" : "_doc",
"_id" : "LLLoIJBHHM",
"_score" : 1.0,
"_source" : {
"created_at" : "2020-06-23 15:11:59",
"data" : [...]
}
}
]
}
}
``
In the "data" field, there are more fields within other fields respectively.
I want to filter the most recent document, and then count a certain value in the most recent document.
This is my query:
`{
"query": {
"match": {
"name.keyword": "net"
}
},
"sort": [
{
"created_at.keyword": {
"order": "desc"
}
}
],
"size": 1,
"aggs": {
"CountValue": {
"terms": {
"field": "data.add.serv.desc.keyword",
"include": "nginx"
}
}
}
}`
And the output is:
`{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"CountValue" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "nginx",
"doc_count" : 2
}
]
}
}`
I suspect that doc_count is the number of documents the value appears in, not the number of times the value is repeated within the filtered document.
Any advice I will be very grateful!
Unless any of the fields under the path data.add.serv are of the nested type, the terms agg will produce per-whole-doc results, not per-field.
Exempli gratia:
POST example/_doc
{
"serv": [
{
"desc": "nginx"
},
{
"desc": "nginx"
},
{
"desc": "nginx"
}
]
}
then
GET example/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "serv.desc.keyword"
}
}
}
}
produces doc_count==1.
When, however, specified as nested:
DELETE example
PUT example
{
"mappings": {
"properties": {
"serv": {
"type": "nested"
}
}
}
}
POST example/_doc
{"serv":[{"desc":"nginx"},{"desc":"nginx"},{"desc":"nginx"}]}
then
GET example/_search
{
"size": 0,
"aggs": {
"NAME": {
"nested": {
"path": "serv"
},
"aggs": {
"NAME": {
"terms": {
"field": "serv.desc.keyword"
}
}
}
}
}
}
we end up with doc_count==3.
This has to do with the way non-nested array types are flattened and de-duplicated. At the end, you may need to reindex your collections after having applied the nested mapping.
EDIT
In order to only take the latest doc, you could do the following:
PUT example
{
"mappings": {
"properties": {
"serv": {
"type": "nested"
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
then
POST example/_doc
{
"created_at" : "2020-05-31 19:01:01",
"serv": [
{
"desc": "nginx"
},
{
"desc": "nginx"
},
{
"desc": "nginx"
}
]
}
POST example/_doc
{
"created_at" : "2020-06-23 15:11:59",
"serv": [
{
"desc": "nginx"
},
{
"desc": "nginx"
}
]
}
then use a terms agg of size 1, sorted by timestamp desc:
GET example/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "created_at",
"order": {
"_term": "desc"
},
"size": 1
},
"aggs": {
"NAME2": {
"nested": {
"path": "serv"
},
"aggs": {
"NAME": {
"terms": {
"field": "serv.desc.keyword"
}
}
}
}
}
}
}
}

Elasticsearch - find IPs from which only anonymous requests came

I have network logs in my Elasticsearch. Each log has an username and an IP field. Something like this:
{"username":"user1", "ip": "1.2.3.4"}
{"username":"anonymous", "ip": "1.2.3.4"}
{"username":"anonymous", "ip": "2.3.4.5"}
{"username":"user2", "ip": "3.4.5.6"}
I have a seemingly simple task: list all IP-s from which only anonymous requests came. The problem is, I can not simply filter for anonymous, because then I'll list false IP-s which appear with anonymous, but not exclusively. Manually I can do this with a 3 step process:
List all unique IP-s
List unique IP-s that appear with something other than anonymous
Exclude items of 2nd list from the first.
But is there a way to do this with a single ES query? My first instinct was to use bool query. My current approach is this:
GET /sample1/_search
{
"query": {
"bool": {
"must": {
"wildcard": {
"ip": "*"
}
},
"must_not": {
"term": {
"username": "-anonymous"
}
}
}
},
"size": 0,
"aggs": {
"ips": {
"terms": {
"field": "ip.keyword"
}
}
}
}
I expect "2.3.4.5", but it returns all 3 unique IPs. I searched the web and tried different query types for hours. Any ideas?
Please find the below mapping, sample docs, the respective query for your scenario and the response:
Mapping:
PUT my_ip_index
{
"mappings": {
"properties": {
"user":{
"type": "keyword"
},
"ip":{
"type": "ip"
}
}
}
}
Documents:
POST my_ip_index/_doc/1
{
"user": "user1",
"ip": "1.2.3.4"
}
POST my_ip_index/_doc/2
{
"user": "anonymous",
"ip": "1.2.3.4"
}
POST my_ip_index/_doc/3
{
"user": "anonymous",
"ip": "2.3.4.5"
}
POST my_ip_index/_doc/4
{
"user": "user2",
"ip": "3.4.5.6"
}
Aggregation Query:
POST my_ip_index/_search
{
"size": 0,
"aggs": {
"my_valid_ips": {
"terms": {
"field": "ip",
"size": 10
},
"aggs": {
"valid_users": {
"terms": {
"field": "user",
"size": 10,
"include": "anonymous"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"valid_users_count": "valid_users._bucket_count",
"my_valid_ips_count": "_count"
},
"script": {
"source": "params.valid_users_count == 1 && params.my_valid_ips_count == 1"
}
}
}
}
}
}
}
Note how I've made use of Terms Aggregation and Bucket Selector Aggregation in the above query.
I've added include part in Terms Agg so as to consider only anonymous users and the logic inside bucket aggregation is to filter out only if it is a single doc count in the top level terms aggregation for e.g. 2.3.4.5 followed by single bucket count in the second level terms aggregation.
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_valid_ips" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "2.3.4.5", <---- Expected IP/Answer
"doc_count" : 1,
"valid_users" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "anonymous",
"doc_count" : 1
}
]
}
}
]
}
}
}
Hope it helps!

How to get multiple fields returned in elasticsearch query?

How to get multiple fields returned that are unique using elasticsearch query?
All of my documents have duplicate name and job fields. I would like to use an es query to get all the unique values which include the name and job in the same response, so they are tied together.
[
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "albert",
"job": "teacher",
"dob": "11/22/91"
},
{
"name": "justin",
"job": "engineer",
"dob": "1/2/93"
},
{
"name": "justin",
"job": "engineer",
"dob": "1/2/93"
},
{
"name": "luffy",
"job": "rubber man",
"dob": "1/2/99"
}
]
Expected result in any format -> I was trying to use aggs but I only get one field
[
{
"name": "albert",
"job": "teacher"
},
{
"name": "justin",
"job": "engineer"
},
{
"name": "luffy",
"job": "rubber man"
},
]
This is what I tried so far
GET name.test.index/_search
{
"size": 0,
"aggs" : {
"name" : {
"terms" : { "field" : "name.keyword" }
}
}
}
using the above query gets me this which is good that its unique
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 95,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Justin",
"doc_count" : 56
},
{
"key" : "Luffy",
"doc_count" : 31
},
{
"key" : "Albert",
"doc_count" : 8
}
]
}
}
}
I tried doing nested aggregation but that did not work. Is there an alternative solution for getting multiple unique values or am I missing something?
That's a good start! There are a few ways to achieve what you want, each provides a different response format, so you can decide which one you prefer.
The first option is to leverage the top_hits sub-aggregation and return the two fields for each name bucket:
GET name.test.index/_search
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "name.keyword"
},
"aggs": {
"top": {
"top_hits": {
"_source": [
"name",
"job"
],
"size": 1
}
}
}
}
}
}
The second option is to use a script in your terms aggregation instead of a field to return a compound value:
GET name.test.index/_search
{
"size": 0,
"aggs": {
"name": {
"terms": {
"script": "doc['name'].value + ' - ' + doc['job'].value"
}
}
}
}
The third option is to use two levels of field collapsing:
GET name.test.index/_search
{
"collapse": {
"field": "name",
"inner_hits": {
"name": "by_job",
"collapse": {
"field": "job"
},
"size": 1
}
}
}

How to do sum + cardinality together in aggregation in Elastic search, like sum(distinct target) with distinct( sum(amount)))

Datatable:
target amount
10 10000
10 10000
15 12000
15 12000
Expected out put is :
target amount
1 10000
1 12000
I wanted to achieve the above result in aggregation in elastic search.
Some thing like below but cardinality and sum cant be used together...
"accessorials": {
"date_histogram": {
"field": "invoicedt",
"interval": "week",
"format": "Y-MM-dd:w",
"keyed": true
},
"aggs": {
"net_amount": {
"sum": {
"field": "netamt"
}
},
"distinct_trknumber": {
"cardinality": {
"field": "target"
},
"sum": {
"field": "amount"
}
}
}
}
Terms aggregation can be used to get distinct target and amount
Query:
{
"size": 0,
"aggs": {
"targets": {
"terms": {
"field": "target",
"size": 10
},
"aggs": {
"amount": {
"terms": {
"field": "amount",
"size": 10
},
"aggs": {
"count": {
"cardinality": {
"field": "amount"
}
}
}
}
}
}
}
}
Result:
"aggregations" : {
"targets" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10,
"doc_count" : 2,
"amount" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 10000,
"doc_count" : 2,
"count" : {
"value" : 1
}
}
]
}
},
{
"key" : 15,
"doc_count" : 2,
"amount" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 12000,
"doc_count" : 2,
"count" : {
"value" : 1
}
}
]
}
}
]
}
}

Resources